Introduction

Viruses account for the majority of newly emerging human pathogens. Over the past few years, many different viruses such as SARS coronavirus, MERS coronavirus, avian influenza viruses, hantaviruses, Zaire ebolavirus, or Zika virus have (re)emerged as human pathogens [1,2,3,4,5,6,7]. Vaccines are the most efficient and cost-effective tools to fight infectious diseases, particularly virus infections. Millions of people and domestic animals worldwide still suffer from many devastating infectious diseases for which no (efficient) vaccines exist. We lack a rapid, universal, and reliable strategy that could be used for attenuation of viruses and production of vaccines.

From the three basic types of viral vaccines, modified live virus vaccines are the most efficacious and preferred vaccines for healthy individuals, because they evoke broad, strong, and durable immune responses and generally outperform inactivated and subunit vaccines [8,9,10].

Traditionally, modified live virus vaccines have been prepared empirically by serial passage of virulent viruses in cell culture and/or laboratory animals [8, 9]. However, attenuation by this procedure is costly, time consuming, and highly unpredictable [9]. While serial passage results in accumulation of a large number of mutations, often only a handful of them contribute to attenuation [11, 12]. Consequently, some vaccines prepared by serial passage are prone to reversion to virulence [13,14,15]. This safety concern is the biggest limiting factor for use of such vaccines [8, 9, 11].

Recent advances in the de novo synthesis of DNA ushered in the era of synthetic biology and the nascent field of modified live virus vaccines that are prepared by large-scale recoding of pathogen genomes [16,17,18,19]. In contrast to viral attenuation by serial passage, the attenuating mutations are introduced into viral genomes deliberately, according to rationally designed recoding principles.

The overall concept of attenuation by large-scale recoding is simple and effective: viruses with recoded genomes replicate efficiently in culture systems, which is favorable for viral vaccine production, but their replication capacity and virulence in vivo is severely reduced or absent [17, 20]. The reduction of the reproductive fitness enables the host to gain the upper hand in controlling virus replication by innate and adaptive immune responses.

Typically, the goal of the recoding is to change dinucleotide, codon, or codon pair composition of the recoded viral genomes, because it was shown that all three types of (interrelated) modifications could lead to replication-competent, but severely attenuated viruses. Importantly, while recoding introduces hundreds of point mutations into viral genomes, the amino acid composition of the encoded proteins remains preserved. Consequently, the recoded viruses are antigenically identical with their pathogenic parents. The antigenic identity and replicative potential enable attenuated viruses to induce immune responses that are similar to those of virulent strains. Recoded viruses represent very promising vaccine candidates, because it might be possible to achieve the desired level of attenuation by adjusting the level of recoding [17]. In addition, viruses attenuated by large-scale recoding are extremely genetically stable, which is explained by the sheer number of introduced mutations [16, 17, 20,21,22].

Virus Attenuation by Codon Deoptimization

Amino acids, except for methionine and tryptophan, can be encoded by two or more synonymous codons, but these are used at unequal frequencies, a phenomenon known as codon bias. Synonymous codons also differ in translational accuracy [23], propensity to mutate to non-synonymous and non-conservative codons [24, 25], abundance of tRNA that decode them [26], and capacity to allow non-standard (wobble) base pairing between the third base of the codon and the first base of the anticodon [27].

Codon choice affects translation efficiency [28], protein folding [29], and mRNA stability [30], but the significance of codon bias despite of decades of investigation remains unclear. The prevailing hypothesis predicts that frequently used codons are translated more rapidly than rare codons, because frequent codons are often decoded by abundant tRNAs [26, 31, 32]. Consequently, utilization of rare codons reduces translation rates and protein yield because these are decoded slowly by rare tRNAs [30, 33]. Yet, only little direct in vivo evidence supports this hypothesis [34, 35]. In addition, it was shown that codon-optimized genes are often not translated as efficiently as expected [33].

The first attenuated virus that was prepared by large-scale recoding was a poliovirus (Enterovirus C) and the recoding modified codon usage of the capsid coding region [16, 18]. The rationale for recoding was the opposite of the codon optimization strategies. The goal of the recoding was to modify viral genomes to contain more codons that are infrequently used by the virus [18] or the virus host [16], because it was assumed that these might reduce speed of translation elongation and thus also protein yield. The codon deoptimization resulted in severely attenuated viruses in vitro [16, 18] and in vivo [16]. As expected, maximization of codons that are underrepresented in the virus host decreased translation capacity and protein yields of the recoded viruses. Surprisingly, viruses that contained the increased number of its own infrequent codons showed unaltered protein production, but diminished viral RNA yields and specific infectivity of purified virions [18].

Since then, others followed suit and many other viruses including rabies virus [36], influenza A virus, human respiratory syncytial virus [22, 37, 38], lymphocytic choriomeningitis virus [39, 40], and foot-and-mouth disease virus [41] were recoded using the codon deoptimization principles. In most cases, codon deoptimization resulted in production of highly attenuated viruses in vitro and in vivo [39,40,41,42]. However, some codon-deoptimized viruses remained pathogenic [36], or became only moderately attenuated [22].

Interestingly, experiments with human respiratory syncytial virus showed that viruses that were codon deoptimized according to the viral host codon usage had decreased protein production and were attenuated, whereas viruses that were deoptimized according to the virus codon usage were not [22].

Virus Attenuation by Codon Pair Deoptimization

The fact that codon usage alone could not explain observed differences in protein production implicated that other sequence features, such as neighboring nucleotides, or codons (codon context) must influence translation elongation. Recent studies accumulated compelling evidence that different mRNA context cues modulate eukaryotic translation (reviewed in [43]).

Similar to codon bias, codon pair bias, that is codon pair combinations, in protein coding genes is not random either [44, 45]. Some codon pairs are found in open reading frames (ORFs) significantly more or less frequently than would be expected based on the overall frequencies of two codons that form a particular codon pair [16, 21, 44]. The level of under- and overrepresentation of each codon pair can be measured with the codon pair score (CPS) statistic [21].

Codon pair bias was found in every species studied [46] and can be radically dissimilar between different species [20], but closely related species have essentially the same codon pair bias [45, 47]. Its existence has been known for many years, but it was on the periphery of scientific inquiry, and thus its biological significance and the forces shaping it are only poorly understood [46].

The attenuation by codon pair (bias) deoptimization, also known as “synthetic attenuated virus engineering” (SAVE), was pioneered in 2008 by the group of Eckard Wimmer at Stony Brook University when the effects resulting from the alternation of the codon pair bias was explored by recoding of poliovirus [21]. The attenuation of viruses by codon pair deoptimization involves reshuffling of existing codons in a protein coding sequence without changing the codon bias or amino acid composition of the encoded protein [20, 21]. The goal of reshuffling is to maximize the number of codon pairs that are underrepresented in the protein coding sequences of the virus host.

In the seminal study, the recoding by codon pair deoptimization involved the P1 region of the virus, which encodes the viral capsid [21]. Remarkably, a poliovirus with a fully codon pair-deoptimized P1 region, “PV-Min,” could not be rescued in cell culture, despite the fact that no new rare codons were introduced into recoded viral segment. On the other hand, “PV-Max” virus with codon pair-optimized P1 segment had biological properties of the wild-type parent.

Since its initial description, codon pair deoptimization enabled rapid and highly efficient attenuation of a wide variety of viruses, including influenza A virus [20, 48, 49•], human immunodeficiency virus [50], human respiratory syncytial virus [51], vesicular stomatitis Indiana virus [52], and dengue virus [53•]. Some of the recoded viruses have shown 100,000-fold attenuation in comparison to pathogenic parents and have been successfully used as highly protective experimental vaccines with a wide margin of safety [48, 49•].

There are two main competing hypotheses that propose different molecular mechanisms that lead to attenuation of viruses by codon pair deoptimization. One hypothesis suggests that the increased numbers of underrepresented, or “non-preferred” codon pairs in recoded sequences, are themselves the reason for attenuation, because they create conditions that are not conducive for efficient protein production or processing [21, 53•]. It is speculated that physical properties of some tRNA molecules hamper their efficient interaction at adjacent A-site and P-site in the translating ribosome. As a consequence, codon pair-deoptimized sequences do not support efficient protein translation, and these are prone to increased mistranslation, stalled translation, or premature termination [49•]. The alternate hypothesis suggests that not the codon pairs themselves, but the increased number of CpG (and TpA) dinucleotides that are present in codon pair-deoptimized sequences (see below for explanation) is responsible for decrease of mRNA levels and thus also protein yields and virus attenuation [54,55,56].

While recoding by codon pair deoptimization has always led to decrease of protein production, it is unknown whether this decrease is caused primarily by suboptimal protein translation, or could be also caused by the reduced mRNA levels, because it was shown that codon pair deoptimization can be responsible for extensive reduction of mRNA levels [48]. However, the reduction of mRNA levels does not occur universally, and often the reduction of RNA levels is disproportional to the magnitude of reduction of protein levels [48, 49].

Typically, codon pair deoptimization introduces several hundred nucleotide changes into recoded genes. It is not known which genetic changes that arise through codon pair deoptimization are responsible for reduced protein production and ensuing virus attenuation. There are three possible options: (1) reduction of protein production is caused by a large number of underrepresented codon pairs that exert small negative effects on protein production, (2) reduction of protein production is caused primarily by a small number of codon pairs that exert strong negative effects on protein production, and (3) other yet unknown sequence features are responsible for decrease of mRNA stability, or faster turnover of mRNA transcript.

In 2016, in an elegant study, Gamble et al. provided compelling evidence—through experimentation with 35,000 GFP variants in yeast species Saccharomyces cerevisiae—that codon pairs rather than individual codons can exert a potent effect on translation elongation [57••]. The study identified 17 inhibitory codon pairs that were implicated in low protein production of the superfolding GFP. The inhibitory effect could not be assigned to individual codons, or six-base sequence, or encoded dipeptide, since reduced protein production was observed only when both codons of the inhibitory pair were present, in-frame, and adjacent in a proper order. The correct ordering suggested that tRNA interactions with mRNA on the ribosome mediated the inhibitory effect.

Codon and Codon Pair Deoptimization Increases the Number of CpG Dinucleotides in Recoded Genes

It was discovered that codon and codon pair deoptimization of vertebrate viruses not only increases the number of codons, or codon pairs that are underrepresented in coding sequences of the host, but also increases the frequency of CpG and, to lesser degree, TpA (UpA) dinucleotides in recoded sequences [45, 54, 55, 58]. The increase of CpG and TpA dinucleotides by codon pair deoptimization is inadvertent, as codon pairs that contain CpG and TpA dinucleotides at the codon pair boundary (NNC-p-GNN) are among the most underrepresented codon pairs in vertebrates [45, 55]. For example, 97 of the 100 most underrepresented codon pairs contain CpG at the codon pair boundary [45]. As a result, recoding by codon pair deoptimization does not increase the number of CpG dinucleotides that are already present in the shuffled codons, because recoding preserves codon bias, but creates new CpG and TpA dinucleotides at the boundary between the new codon pairs.

Similarly, because CpG and TpA dinucleotides are significantly suppressed in the genomes of higher eukaryotes [59], synonymous codons that contain CpG and TpA dinucleotides, for example, alanine’s GCG, or leucine’s CTA and TTA codons, are also infrequently used in protein coding sequences. Thus, codon deoptimization of vertebrate viruses also results in elevated number of CpG and TpA dinucleotides in recoded sequences. It is, therefore, unclear whether the increase of underrepresented codons, codon pairs, or less-favored dinucleotides in recoded sequences is primarily responsible for virus attenuation.

Consequently, an alternative hypothesis suggests that the cause of attenuation is to be found in the increased number of CpG (and TpA) dinucleotides, which are recognized by an as yet uncharacterized self/non-self-recognition system that stimulates enhanced innate immune responses to such recoded viruses [55, 56]. Since codon pair preferences and dinucleotide frequencies are intimately related (the most underrepresented codon pairs contain CpG and TpA dinucleotides at the codon pair boundary), dissecting the effects of the two phenomena is exquisitely difficult [56, 60].

Virus Attenuation by Increase of CpG/TpA Dinucleotides Frequencies

Vertebrate genomes have low CpG levels and the CpG suppression can be plausibly explained by the methylation-deamination hypothesis [61]. This hypothesis suggests that abundant methylation of cytosine in CpG dinucleotides is responsible for CpG suppression, because methylated cytosine often mutates to thymine by spontaneous deamination [62]. As a result, methylated CpG dinucleotides decay into TpG (and CpA) dinucleotides over time.

It remains enigmatic why CpG and TpA (UpA) dinucleotides occur at lower frequency also in the genomes of most RNA and small DNA viruses that infect vertebrates [45, 63]. For example, human papillomaviruses exhibit a frequency of CpG dinucleotides in their genomes that is only ~ 50% of the expected number. Even more striking, human immunodeficiency viruses contain reduced CpG to only ~ 25% and human polyomaviruses to less than 10% of the expected numbers [45]. Because CpG methylation does not occur on RNA, the methylation-deamination hypothesis, nor viral sequence constraints, can explain underrepresentation of CpG dinucleotides in genomes of vertebrate RNA viruses [64].

An alternative explanation for suppression of CpG dinucleotides in the genomes of small viruses suggests that CpG dinucleotides act as immunostimulatory motifs that trigger antiviral immune responses [45, 54, 64]. However, the identity of the hypothetical receptors recognizing CpG-rich RNA molecules remains elusive.

A recent study by Takata et al. showed that the host zinc-finger antiviral protein (ZAP) is such a long-suspected ssRNA CpG receptor, which specifically binds to CpG-rich RNA and targets them for degradation by the RNA exosome [65••]. These results suggest that the selective pressure mounted by ZAP drives vertebrate RNA viruses to reduce the levels of CpG dinucleotides in their genomes. Thus, increasing the number of CpG dinucleotides in viral genomes could be responsible for viral attenuation, because viral RNA with high CpG content is better recognized and then removed from the cytoplasm. Since ZAP expression is induced by interferon, viruses that can block interferon responses or counter the action of ZAP should be resistant to selection pressure exerted by ZAP. It remains to be determined whether ZAP is the only host factor that can recognize CpG-rich sequences, and what immune evasion strategies viruses employ to avoid the action of ZAP.

The early experiments with recoded poliovirus that had artificially elevated CpG and UpA dinucleotides in its capsid coding region showed that recoding only minimally affected protein production, protein processing, or the overall production of viral particles, but had significant negative effect on virus fitness, especially on specific infectivity of viral particles [66]. The fitness of the recoded virus was reduced to the threshold of viability when CpG and UpA dinucleotides were maximized within the recoded genome segment [66].

Since both codon pair bias and CpG dinucleotides appear to influence virus replication, a study with echovirus 7 attempted to separate the effect of two phenomena by creating mutants in which the two parameters were independently varied. The authors increased either the frequency of CpG and UpA dinucleotides in the viral genome and left codon pair bias constant, or vice versa [54]. Phenotypic characterization of the resulting mutants showed that only alternation of the CpG and UpA frequencies, but not codon pair bias, had a negative effect on viral fitness [54]. Interestingly, a complementary study showed that virus mutants that lacked CpG an UpA dinucleotides in their genomes had enhanced replication, produced larger plaques, and readily outcompeted wild-type parents in competition assays [55].

A subsequent study from the same group demonstrated that elevation of CpG frequencies in influenza A virus can also result in moderate attenuation of the virus in vitro and in vivo [58]. However, since recoded viruses with increased CpG or UpA frequencies did not have the same codon bias as the control—unmodified and permuted—viruses, changes in the codon bias could confound observed viral properties. In addition, because influenza A virus antagonizes ZAP activity [67, 68], it remains to be determined if ZAP alone can inhibit replication of recoded influenza A viruses with increased CpG levels.

Conclusion

Although the capacity to attenuate viruses by the three alternative attenuation methods is yet to be directly compared, viruses that are designed by codon pair deoptimization show consistently high levels of attenuation, to the extent that some codon pair-deoptimized viruses are nonviable in permissive cells [21, 52].

The major drawback of the attenuation by alternation of codon, codon pair, or CpG dinucleotide frequencies is that the molecular mechanisms responsible for attenuation remain largely unknown. Until this problem is solved, it will not be possible to improve this attenuation method further and design rationally better and safer vaccines. In addition, it will not be possible to assess reversion of attenuation based on observed genetic changes in attenuated viruses. Also, it is yet to be determined if DNA viruses can be attenuated by the same attenuating principles as small RNA viruses.

The continuously decreasing cost of synthetic DNA might soon allow us to characterize the phenotype of thousands of differently recoded viruses. Once phenotype is connected with genotype, unbiased and agnostic approaches might be able to precisely identify sequence features that are essential and sufficient for development of highly effective and safer viral vaccines.

In contrast to existent attenuation methods, recoded vaccine candidates can be designed within minutes and produced synthetically within days. The potential applications that might originate from these approaches are immense and could be universally applicable for attenuation of many known viruses and bacteria, but also to yet unknown viral threats as they emerge [17, 20].