The GeneOptimizer Algorithm: using a sliding window approach to cope with the vast sequence space in multiparameter DNA sequence optimization
- 4k Downloads
One of the main advantages of de novo gene synthesis is the fact that it frees the researcher from any limitations imposed by the use of natural templates. To make the most out of this opportunity, efficient algorithms are needed to calculate a coding sequence, combining different requirements, such as adapted codon usage or avoidance of restriction sites, in the best possible way. We present an algorithm where a “variation window” covering several amino acid positions slides along the coding sequence. Candidate sequences are built comprising the already optimized part of the complete sequence and all possible combinations of synonymous codons representing the amino acids within the window. The candidate sequences are assessed with a quality function, and the first codon of the best candidates’ variation window is fixed. Subsequently the window is shifted by one codon position. As an example of a freely accessible software implementing the algorithm, we present the Mr. Gene web-application. Additionally two experimental applications of the algorithm are shown.
KeywordsGene synthesis Codon optimization Sequence optimization algorithm Synthetic genes Expression optimization
In many cases de novo gene synthesis has become the preferred access route to biological DNA sequences. As the prices for synthetic genes have dropped considerably over the past years, today gene synthesis often economically outcompetes the classic genetic engineering methods. Another great advantage over the use of naturally occurring templates is the fact that the synthetic gene can literally be designed on an “electronic drawing board”. Not only is it possible to freely add genetic elements such as a promoter or restriction sites flanking the coding region of a gene, but also to optimize the coding sequence itself for specific experimental requirements. This is possible due to the fact, that nearly all amino acids can be encoded by up to six different codons. Therefore, the DNA sequence can be altered without changing the corresponding amino acid sequence.
One of the most common applications is adaption of the used codons to the specific codon usage of a heterologous expression host. In the simplest form of optimization, a plain backtranslation is performed, where each amino acid is represented by the specific synonymous codon most frequently used in highly expressed genes of the production host. Obviously this procedure can lead to the generation of undesired restriction sites, expression constraining sequence elements, repetitive base stretches, etc. On the other hand, it may be desirable to introduce certain DNA motifs or avoid similarities to naturally occurring sequences. So the challenge is to find the sequence, which represents the best compromise between different and sometimes conflicting requirements. Without doubt, the best solution would be to generate all possible combinations of codons representing a given amino acid sequence, assess all of them with the help of a quality function and finally choose the one with the highest quality score. Unfortunately the number of possible combinations is in the range of 10e47 even for a rather small protein of 100 amino acids, making the outlined approach impossible to perform in practice.
One possibility to reduce the sequence space, that has to be evaluated is to rely on statistical optimization methods. In multiple iterations synonymous codons are exchanged at random, while the choice of a certain synonymous codon may be controlled by a probability distribution based on the codon usage of the host organism (Villalobos et al. 2006). In each iteration the resulting sequence is assessed with respect to the desired parameters, and if the codon change leads to an improvement of the overall quality score, the changes are kept, otherwise they are discarded. In a variation of the method, known as “simulated annealing”, also codon changes leading to a worse sequence score are allowed. In this case, the probability of acceptance for a new sequence is controlled by a “temperature parameter”-dependent Boltzmann distribution. This means that in the beginning of the optimization process, when the “temperature” is high, changes leading to a worse overall score are more likely accepted than in later iterations with a lower temperature parameter (Hoover and Lubkowski 2002; Rocha et al. 2008).
While these methods often lead to sequences with well balanced overall properties, they also suffer from several drawbacks.
Many of the optimization parameters to be taken into account represent rather local sequence properties, spanning a region of a few dozen bases, than global phenomena. This is obvious for short sequence motifs, such as restriction sites, splice site recognition patterns, etc.
Regarding properties like the GC content it is normally much less important to achieve a certain overall GC content than to avoid spikes in GC distribution, i.e. short sequence stretches with either very high or very low GC content.
The former assertion can also be extended to inverse repetitions, where long distance inverse repetitions are considered to be biologically less important than neighbouring ones, which can form stable hairpin loops, and several other features. As Monte Carlo Methods take only a tiny fraction of the whole sequence space into account, in most cases a less than optimal solution with respect to the theoretically ideal combination of codons representing the desired properties will be found in finite time.
Although it is impossible to assess all possible codon combinations representing a given amino acid sequence, it becomes clear from the aforesaid, that it is acceptable for many sequence features to reduce the search space by performing an exhaustive search for the best solution only inside a small “variation window”, which is moved along the whole sequence.
Materials and methods
All shown or described sequence optimizations were performed with the GeneOptimizer software suite, which was developed in-house by the Geneart Corporation.
Transient expression of Mip1-alpha in 293T cells
MIP1-alpha alleles were cloned via 5′ XhoI—EcoRI-3′ into pcDNA3.1 (Invitrogen). Recombinant plasmids were produced in E. coli (XL10 Gold) and purified using the Endo-free Qiagen Maxi Kit (Qiagen) according to the manufacturers instructions. Ten μg of each plasmid were transfected using the Ca-Phosphate method (Graham and van der Eb 1973) into 293T cells. Cell culture supernatants were harvested 48 h post transfection. Amounts of secreted MIP1-alpha were determined using a commercial sandwich ELISA Kit (R&D Systems, mouse CCL3/MIP-1α).
Protein expression in E. coli
Standard and methylation site-optimized expression constructs were transformed into E. coli BL21(DE3). Two independent colonies were inoculated into 0.5 ml Luria–Bertani (LB) broth containing kanamycin (50 μg/ml), and grown overnight at 30°C with shaking at 160 rpm. Overnight cultures were then diluted in 50 ml of freshly prepared Luria–Bertani broth containing kanamycin (50 μg/ml). Cells were grown to an OD600 between 0.5 and 0.7 at 37°C and induced with 1 mM IPTG. After induction, cells were shifted to 30°C and continued to grow for 4 h. Cells were harvested by centrifugation at 4,000g for 10 min, resuspended in 5 ml lysis buffer (PBS, 1% Triton X100, 20 μl ProteaseInhibitor), and flash freeze in liquid N2. Lysis was performed by one freeze/thaw cycle, followed by the addition of 20 μl lysozyme (in a concentration of 0.5 mg/ml), incubation on ice for 10 min and sonication for 30 s.
Quantification of expression
Protein concentration was measured using DC Protein Assay (Bio-Rad) and equal amounts were loaded on 4–20%-SDS–PAGE-gels (Invitrogen) for Western Blot analysis. Western Blot signals were detected using BM Chemiluminescence Western-Blotting-Substrate (POD) (Roche) or SuperSignal West-Femto-Maximum-Sensitivity-Substrate (ThermoScientific) and quantified using GelProAnalyzer-Software6 (INTAS). Corresponding standard and methylation site-optimized constructs were analyzed in triplicates on the same gel using α-Penta-His antibodies. Quantified results were averaged and the ratio standard versus methylation site-optimized construct was determined. Lysate from E. coli cells transformed with the empty expression construct, served as negative controls for analysis.
Presentation of the algorithm
The performance of the algorithm can be improved by the stepwise calculation of the individual quality functions, adding their score to the total score and comparing the latter with the best total score already achieved by a previously assessed test sequence. If the remaining quality functions can only contribute negatively to the total score by definition (e.g. the GC content score) and a higher total score has already been reached with a different test sequence, the calculation of the remaining quality functions can be omitted.
Suitable methods for the calculating of the individual Score q values for important optimization criteria will be explained below.
One important aim of sequence optimization for heterologous expression is therefore to take into account the codon usage of the host organism by reencoding the amino acid sequence with codons having a high c ij in the expression host. Simple single parameter gene optimization is often done by choosing the codon with the highest c ij for each amino acid. As soon as additional optimization parameters shall be considered, the c ij values are no longer suitable for use in the quality function, because they do not allow one to compare the quality of codons encoding different amino acids. This is due to the fact, that the number of synonymous codons is not equal for each amino acid. For example, glutamate is encoded by two codons, while leucine has six and the second frequently used glutamate codon may have a c ij of 0.4, surpassing the “best” leucine codon with a c ij of 0.25.
It is a very common requirement, that the optimized sequence must not contain certain DNA motifs. One reason can be, that they would impede further processing of the synthetic gene, which is e.g. true for internal restriction sites. But unintentionally introduced sequence patterns can also have an undesired effect in the biological system itself, such as promoter recognition sequences, internal ribosomal entry sites, splice sites and so on. On the other hand, it may be part of the requirements to computationally introduce DNA sequence motifs into the optimized sequence. This can be in combination with a positional constraint allowing one to cut the DNA at a certain position with a restriction enzyme. Another task can be to generate a sequence with as many CpG motifs as possible to enhance its immunogenicity or expression in a mammalian system (Notka et al. 2007).
Many different algorithms and methods have been developed for the prediction of biologically active sequence patterns. In the simplest case, a regular expression search can be performed, which gives a yes or no answer as a result. This is however only possible for highly conserved motifs, like restriction sites. In this case, the scoring function will examine a 3-prime part of the test sequence, and deliver the number of sequence segments matching the regular expression, multiplied with a motif specific “penalty” or “bonus”. To ensure that all occurrences are found, the length of the considered part of the test sequence must match the longest motif.
The GC content is an important characteristic of a DNA sequence. DNA with extremely high or low GC content is more difficult to handle with standard molecular biological techniques such as PCR or sequencing, and also the gene synthesis process itself, which is often based on the correct hybridization of oligonucleotides with a subsequent PCR, may be aggravated (Sanli et al. 2001). Also in the biological system high deviations from an equal GC/AT distribution can lead to genetic instability of the constructs, for example (Lee et al. 2002).
It is nevertheless not sufficient to observe the overall GC content of the sequence, but to avoid spikes of very low or high GC content in the distribution along the sequence.
Repetitions, inverse repetitions
DNA stretches of high similarity to each other (“repetitions”) in a gene can lead to genetic instability, as recombination events are fostered (Bzymek and Lovett 2001; Chen et al. 1987). Also gene synthesis methods employing a batch-Polymerase Extension Reaction of overlapping oligonucleotides will be hampered, since the partially similar oligonucleotides will cause false hybridization events to occur (Czar et al. 2009). The same is true for inverted repeats, which can, when they are sufficiently near to each other, lead to quite stable hairpin loops, which in turn can contribute to low energetic mRNA-secondary structures. Especially at the 5′ end, stable mRNA secondary structures can significantly reduce expression by hampering translation initiation (Kudla et al. 2009; Griswold et al. 2003).
Homologies to reference sequences
A third application of sequence alignment in the quality function is the avoidance of similarities to a given reference sequence. This can be important for the development of DNA vaccines, where recombination events between the vaccine and the wildtype virus must be avoided. Another example is siRNA resistant genes, which are used to restore the gene function after the original gene has been silenced. When the original phenotype can be restored, the change in phenotype can be attributed to the silenced gene with higher confidence than with siRNA silencing alone (Dong-Ho and Rossi 2003). They can be optimized for increased expression and at the same time reduced homology to the wildtype gene. Again, a local alignment between a terminal part of the test sequence and the reference sequence can be used in the quality function.
Exemplary effects of various quality functions
To exemplify the effect of various quality functions, the DNA sequence coding for the green fluorescent protein from Aequorea victoria (GenBank: X83960.1) is at first optimized only to the highest possible CAI with respect to E. coli K12 codon usage (Kazusa). In subsequent optimizations a two parameter quality function is used, which also accounts for a desired GC content of 50% (within a 40 nucleotide window), and the weighting factor for this parameter is stepwise increased. To visualize the properties of the resulting sequences, the used codons are classified by their w ij × 100 value and histogrammed, the GC content is calculated within a window of 40 bases and plotted against the sequence position.
These snapshots of consecutive optimization/analysis cycles illustrate a typical approach in determining a suitable set of weighting parameters for the quality functions to achieve a sequence representing the desired characteristics. However, once a suitable set of parameter values has been determined, these values can be used as a good starting point for similar optimizations, e.g. “optimization for expression in E. coli” and then often the optimization will yield a satisfying sequence with the first run.
As the tendency in processor development is towards multicore processors rather than a further increase in clock speed, it is noteworthy, that the algorithm offers several opportunities for parallelization. This may, for example, be done by splitting up the calculation of the quality function for a certain position of the variation window to several processor cores, i.e., dealing with an equal number of window position-specific codon combinations on each core. It is important that the threads can exchange information about the best total score obtained for one of the already assessed combinations, so that the calculation of quality functions can still be performed stepwise and cancelled when obvious that a certain combination will no longer reach a better score than the already established best score.
To give an example of the actual time requirements, a typical optimization run was performed on a coding sequence comprising 738 codons. All needed algorithms were implemented in vb.net and executed on a standard personal computer (Windows XP, AMD Athlon 64 X2 Dual Core Processor 5200+, 2.60 GHz, 3 GB RAM). For the optimization, a homo sapiens codon usage table was employed, in which codons with a relative adaptiveness w < 0.30 were not taken into consideration. When a quality function comprising the codon quality, GC-content and the check for sequence motifs was used, with m = 5, a runtime of 70 s was measured. With inclusion of the check for repetitions, the runtime increased to 96 s.
Application example Mr. Gene
The workflow starts with the input of the original sequence, which can either be provided as DNA or amino acid sequence. The user may also choose between several optimization templates for different common organisms or opt for proceeding without optimization. The templates not only provide for the correct codon usage table, but also control the other optimization parameters, such as which DNA motifs to exclude from the optimized sequence by default, or the weight of the different optimization goals in the quality function. This especially helps non-experts to successfully design and optimize their genes on their own.
Besides the coding sequence, the user can also provide 5-prime and 3-prime untranslated regions and an optional cloning vector.
In the next step additional motifs to be excluded from the optimized sequence can be chosen, either from a repository of common motifs—like as restriction sites—or by manually entering a individual nucleotide pattern. In this step also the organism specific codon usage table may be altered.
The provided data is now used to compute the optimized sequence, which is then presented to the user on the next screen of the assistant. In case some undesired DNA patterns—such as recognition sites or extended repetitive elements—could not be eliminated automatically from the sequence, these sites are listed on the upper part of the screen. In the lower part, the sequence is presented inside a codon editor, which allows each used codon to be altered individually by the user. This is done by simply clicking on the relevant sequence position and choosing a synonymous codon from the appearing drop-down list. The list of problematic sequence parts is also coupled with the codon editor, where the selected sequence part is highlighted. After the sequence has been edited manually, it can be re-checked anytime as to whether problematic sites are still present.
A further screen provides a graphical comparison of the properties of the original sequence to the optimized sequence. Included are a codon usage histogram and two plots showing the distribution of codon usage and GC content along the sequence.
The final sequence can either be directly ordered for synthesis at the Mr. Gene company, saved as project, or exported as a PDF file.
Application examples of the algorithm
It has already been pointed out that a distinct feature of the algorithm is the ability to introduce defined DNA motifs into the optimized sequence, where the number of generated patterns can be controlled by the relative weighting of the optimization parameters (e.g. codon usage versus bonus per introduced motif). This greatly facilitates studying the relationship between the number of certain DNA patterns in a coding sequence and a supposedly associated effect.
A second experiment was designed to illustrate the versatility of the algorithm in implementing different literature-known optimization strategies. Also their effects on the expression of one exemplary protein, the murine macrophage inflammatory protein Mip1-alpha, is demonstrated.
For the first sequence variant, a simple backtranslation was performed by using a solely CAI based quality function in the optimization, i.e. for each amino acid, the synonymous codon with the highest relative adaptiveness value.
The variants including the wildtype sequence were synthesized by Geneart and 293T cells were used to transiently express the genes. The amounts of secreted protein were measured using a commercially available ELISA.
We have presented a deterministic algorithm for the optimization of a coding sequence, which has significant advantages over stochastic methods especially as far as local sequence properties are concerned. This is most obvious with the task of introducing a defined motif into the sequence, which may only be possible with one specific combination of codons within the variation window. On the other hand, it is possible to find a codon combination that eliminates a weight matrix-defined motif by changing the nucleotides most important for the biological activity, often without compromising other important sequence properties.
It might be argued, that the directionality of the algorithm (i.e. optimizing the sequence from the 5-prime start to the 3-prime end) is a disadvantage compared to stochastic methods, which normally take a more global approach in assessing the sequence within each iteration of the optimization process. For example, consider an amino acid sequence that contains two identical sequence parts, one at the beginning and one at the end of the sequence. When the aim is to achieve a good CAI and at the same time avoid extensive repetitions in the optimized DNA sequence, the presented algorithm can only eliminate the repetition by introducing worse synonymous codons (with respect to their codon usage) in the second sequence part, while a stochastic algorithm is able to distribute the worse codons evenly between the two parts. However, it has been shown that the occurrence of rare codons can be much better tolerated at the 3-prime part of the sequence than at the beginning, actually turning the supposed disadvantage into an advantage (Goldman et al. 1995; Vervoort et al. 2000).
While we have shown how a number of important sequence properties can be accounted for in the quality function, it is obvious that further optimization parameters, for example the consideration of codon pairings (Gutman and Hatfield 1989), can easily be included in the calculation of the total quality score.
The authors would like to thank Thomas Hofmeister, Andreas Wolf and Wolfgang Strenzl for their efforts in implementing the algorithm in different software applications. DR thanks Jörg Enderlein for his support in the course of the development of the GeneOptimizer software application. This work was supported by the Bundesministerium für Bildung und Forschung (BMBF) through grants 0313850 (“OLIGO” to R.W.) and 0313687 (“BioChancePLUS” to R.W.).
Conflict of interest
The authors declare competing financial interests: Geneart AG performs gene design optimization as a free service with the genes that it sells.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
- Ikemura T (1981) Correlation between the abundance of Escherichia coli transfer RNAs and the occurrence of the respective codons in its protein genes: a proposal for a synonymous codon choice that is optimal for the E. coli translational system. J Mol Biol 151:389–409. doi: 10.1016/0022-2836(81)90003-6 CrossRefPubMedGoogle Scholar
- Lee SG, Kim DY, Hyun BH, Bae YS (2002) Novel design architecture for genetic stability of recombinant Poliovirus: the manipulation of G/C contents and their distribution patterns increases the genetic stability of inserts in a Poliovirus-based RPS-vax vector system. J Virol 76(4):1649–1662CrossRefPubMedGoogle Scholar