Background

Messenger RNA was originally not expected to have any secondary structure, because it was simply supposed to pass through the ribosomes (as a magnetic tape passes a tape-recorder) [1] and any secondary structure was thought to interfere with the translation process. However, mRNA has considerable total folding energies (TFE) depending on the number and distribution of complementary bases (407 kcal/1000 bases, mean ± S.E.M., n = 147). The TFE is the sum of two components. First, the compositional component (CFE), which is determined only by the numbers of the four bases and their relative proportions, is equal to the dG value of shuffled sequences. Second, there is the positional component (PFE), which is determined by the position of bases and can be less or more than the CFE. This component is called Folding Free Energy (FFE) and is the difference between the dG values of intact and shuffled RNAs. Most attention has been paid to the FFE because it is required to form a unique structure, while the CFE defines numerous equally possible structures. While the CFE is a measure of random complexity, the FFE is the measure of ordered, structural complexity.

The secondary structure and complexity of mRNA became an important issue because it influences the accessibility of the mRNA to regulatory molecules (proteins, microRNA), its stability and its level of expression. In addition, new theoretical considerations and experimental evidence suggest that mRNAs may even play role in carrying and providing structural information for translated proteins and might serve as chaperons.

It is suggested that mRNA secondary structures are conserved and play important roles in (a) splicing [2], (b) control of gene expression [3], (c) discontinuous translation and pauses in protein synthesis [4, 5], (d) determining the protein secondary structure [68] and (e) regulating gene expression level and accuracy [9].

The aim of this work is to study the relationship between mRNA primary and secondary structure, base composition and structural complexity (measured as folding energy). Special attention is paid to the role of wobble bases in modifying mRNA thermodynamics.

Methods

Ninety coding sequences of proteins with known 3D structures were selected from the PDB. This selection contained 26,370 codons. Care was taken to avoid very similar structures in the selections. The propensity towards alpha helices was monitored during selection and structures with very high and very low alpha helix contents were also selected to ensure a wide range of structural representations. The possible influence of different kingdom (prokaryotes and eukaryotes) or variations in environmental conditions on mRNA folding stability were not considered in this study.

Single-stranded RNA molecules can form local secondary structures through the interactions of complementary segments. Watson-Crick (WC) base pair formation lowers the average free energy, dG, of the RNA and the magnitude of change is proportional to the number of base pair formations. Therefore, folding energy is used to characterize the local complementarity of nucleic acids.

I used a nucleic acid secondary structure predicting tool, mfold [10], to obtain dG values and the lowest dG was used in the calculations. Backtranslation and wobble base manipulations were performed by an online backtranslation tool [11]. The Backtranslation Tool (developed by Entelechon, Germany) creates DNA sequences from protein sequences using optional Genetic Code and Codon Usage Tables. Additionally, the user can determine special rules to select between available synonymous codons. I used this tool to create nucleic acid sequences where synonymous codons 1) were used accordingly to human CUT, 2) only the most frequent codons were used, 3) the codons were used with equal frequency (A = T = G = C in wobble positions) or 4) only A or C was used in wobble positions.

A JAVA tool called SeqForm, developed by us, was used to select sequence residues in predefined phases (every third in our case) and for residue replacements [12]. Amino acid collocations were detected and evaluated by another tool, SeqX [13].

Linear regression analyses and Student's t-tests were used for statistical analysis of the results.

Results and discussion

Folding free energy of coding sequences

There is disagreement in the literature regarding the positional (free) folding energy of coding sequences. It has been shown that mRNAs have greater negative folding energies than shuffled or codon choice-randomized sequences [14], but this is not generally accepted [1]. I was able to find statistically significant dG differences between intact and shuffled mRNAs (Figure 1), but this FFE is only a small fraction (~10%) of the total folding energy. Also, there were many exceptions in this pool where the shuffled sequences had dG values equal to or even lower than those of the native sequences, which indicates that the primary structure (sequence) inhibits secondary structure formation in these cases.

Figure 1
figure 1

Effect of wobble bases on the dG of CDS. The TFE of mRNA is indicated in native sequences (CDS), after residue randomization (shuffle) and after the indicated manipulation of the wobble bases. (For details see the text). Each column represent the mean ± S.E.M.; n is indicated in the columns.

The simple measurement of global folding energy is often criticized and is not regarded as a reliable measure of mRNA structure conservation [1]. There is compelling evidence for conserved, local secondary structures in the coding regions of eukaryotic mRNAs and pre-mRNAs [15] and widespread selection for local RNA secondary structure in coding regions of bacterial genes [16, 17]. Therefore I agree that global folding free energy measurements probably have little value in studies of single-stranded nucleic acid structures.

The potential role of wobble basis in determining and regulating the folding energy of coding sequences

Synonymous codons are not used with equal frequency in redundant codons; however, the frequency pattern of codon usage is well conserved in the same species. The cause or reason for this bias is not known, nor is it known whether the bias has any effect on biological functions. Many biological parameters are known to correlate with codon bias, for example (a) translational selection, (b) GC composition, (c) strand-specific mutational bias, (d) amino acid conservation, (e) protein hydropathy, (f) transcriptional selection, (g) structural elements in the coded protein, (h) tRNA copy number, (i) speed of translation and even (j) RNA stability or (k) gene length (for review see [18]) [19, 20]. It is logical to expect that codon usage bias should have some effect on mRNA structure and folding energies.

Indeed, modifications of synonymous codon usage have very pronounced effects on the dG values of coding sequences (Figure 1). The TFE of our sequences (100%) was reduced to 88% by shuffling (all nucleotides, not only the third), to 87% by equalization of wobble base usage frequencies, and to 82% by back-translation of the protein sequence using the uniformly human Codon Usage Table (CUT). Much greater dG reduction was achieved when only the most frequent wobble bases were permitted (reduction to 38% of the intact mRNA) or when only the bases A or C were permitted in the wobble position (reduction to 22% of the intact molecule). It is possible to accomplish four-fold changes in mRNA folding energy by changing only the wobble bases, with no change in the primary sequence of the coded proteins. I conclude that the wobble bases are strong regulators of the total folding energy of mRNAs and probably even the mRNA structure.

The literature also suggests, along with our previous results [21], that there is a connection between the 3D structures of mRNA and the protein it encodes. This possibility makes the idea of wobble base regulation of nucleic acid structure even more interesting, because it indicates that the excess information in the redundant codon (carried by the wobble bases) may be used to modify or regulate the protein structure. To explore this possibility we compared the TFEs of intact and wobble base-modified mRNA sequences to the propensity towards different structural elements in the coded proteins (Figure 2). There is a positive correlation between mRNA TFE and the frequency of helices in the coded proteins. In contrast, the correlation between dG and beta sheet frequency is negative. This means that RNA complexity is proportional to beta sheet-type and other amino acid collocations, but inversely related to alpha helix frequency. The relationship between mRNA and protein structure will definitely be the subject of further evaluation because of its fundamental importance for further understanding the translation process and protein folding. The correlation between nucleic acid composition (sequence), nucleic acid folding energy (structure) and protein structural elements (helix, sheet) is a strong indication that protein-folding information is present in the redundant exon bases [8, 21], and coding sequences function as chaperons [7].

Figure 2
figure 2

Correlation between dG of CDS and Structural Elements of the Coded Protein. The relative frequencies of the two main structural elements in 77 proteins are plotted against the folding energy of their coding sequences.

Correlation between mRNA sequence composition and total folding energy

There is a strong, negative, linear correlation between the length (L) of a protein coding sequence (CDS) and its dG as well as its G+C (guanine + cytosine) content and folding energy (Figure 3). C+G content is known to have major influence on the TFE of a nucleic acid [22]. The total C+G content varied from 27 to 70% in our sequence collection (50.3 ± 8.2, mean ± SD, n = 80). The distribution of C+G differed between the different codon positions: more C+G was found in the first and third codon positions than in the second. Consequently, the FEs between RNA subsequences comprising the first and third codon positions were significantly lower than those involving the subsequences formed by 2nd codon residues (Figure 4).

Figure 3
figure 3

Correlation between the Length (a), G+C content (b) and the TFE of mRNAs.

Figure 4
figure 4

Distribution of C+G and Folding Energy in Codon-Related Subsequences of mRNA. Codons were identified in intact mRNA sequences and phase-selected for subsequences containing only the first (1_1), second (2_2), third (3_3) codon letters or combinations of these subsequences. Subsequence 1_2 means, for example, that the 3rd codon letters were removed from the original mRNA. The G+C content was counted and the dGs were measured by mfold. Each column represent the mean ± S.E.M of n = 87 determinations. a: FE of intact mRNAs and subsequences, b: G+C content of intact mRNAs and subsequences.

The TFE represents the 3D complexity of the structures formed by nucleic acids where Watson-Crick base pairs play a fundamental role. There are 3 H bonds between C and G (dG = -1524 kcal/1000 bases) but only two between A and T (dG = -365 kcal/1000 bases). This more than 4-fold dG difference explains the dominance of C+G over A+T in determining TFE.

Two major factors determine the folding energies of nucleic acids: first, the G+C content or (G+C)/(A+T) ratio; second, the availability of bases for Watson-Crick base pairing, which may be characterized by the 1000–2|G-C|-2|A-T| values. The |G-C| and |A-T| values are the absolute difference between WC pairs, i.e. the non-pairing fraction. The value 1000 – (non-pairing fractions) will give the maximum possible number of WC pairs in a 1000 base-long sequence.

There are strong linear correlations between these values (based on the primary sequence of the mRNA) and the free energy (Figure 5). The correlation is strong enough to be able to predict the dG value of a nucleic acid from the base composition of its primary sequence. This indicates that the relationship between nucleic acid sequence composition and folding energy is simple. I used coding sequences in this experiment, but this relationship is not unique to CDS; it is generally valid for all kinds of nucleic acids. The nucleotides at the wobble positions have a large influence on the folding energy of mRNAs. Synonymous codons are seemingly equivalent to each other in respect of amino acid coding, but they are not at all synonymous or freely interchangeable with regard to mRNA folding energy and secondary structure.

Figure 5
figure 5

Correlation between nucleotide composition and Folding Energy in mRNA sequences. The wobble base compositions of 78 mRNA sequences (ORIG) were modified to contain equal numbers of synonymous codons (EQUAL); only one, the most frequent synonymous codon (1 MAX); or A or C, but not T or G (A OR C). Predictions of dG were made using the different equations in the figure and plotted against the mfold values determined.

Contribution of synonymous codons to the folding energy of coding sequences

To analyze the role of individual codons on the mRNA structure further, I designed 64 different artificial nucleic acid sequences. Each sequence contained a 100-fold repetition of a single codon, i.e. they were repeating poly-codons. The sequences (64 × 64) were hybridized using the DINAMelt Server two-stage hybridization tool [23]. The 4096 codon affinity (dG) values were sorted into 400 subgroups corresponding to the 20 × 20 coded amino acid pairs. The 400 accumulated (summed) codon affinity values are given in Table 1.

Table 1 Accumulated Affinity between Codons

Fifty-one of the 400 possible amino acid pairs are coded by codons with no affinity for each other. The accumulated dG values of the codons encoding the remaining 349 amino acid pairs show statistically highly significant positive correlations to the calculated and real frequencies of amino acid pairs in real proteins (Figure 6).

Figure 6
figure 6

Correlation between the affinity (dG) of artificial poly-codons and the propensity for the coded amino acid pairs. The affinities of artificial poly-codons for each other (64 × 64 = 4096 dG) were determined. Synonymous codons were combined and added to 20 × 20 = 400 subgroups corresponding to the possible pairs formed by the respective coded amino acids. Frequency of calculated amino acid (AA)-pairs= sqrt [(# of synonymous codon of amino acid A)x(# of synonymous codon of amino acid B)]. Frequencies of real AA-pairs were calculated from the real propensity towards amino acids in the proteins examined using the equation = sqrt [(propensity of amino acid A)x(propensity of amino acid B)]. Frequencies of co-locating AA-pairs were measured from crystallographic structures using the SeqX tool. Two linear correlation coefficients were calculated, one excluding and the other () including dG = 0 values.

This chain of correlations indicates that synonymous codon frequency has a strong positive effect on the accumulated affinity of codons in the nucleic acid (mRNA structure) and the relative frequencies of amino acids in the coded protein generally and co-locating amino acids especially (protein structure). I interpret this result as supporting the view that the effects of synonymous codons on nucleic acid structure and protein composition are additive and they are not interchangeable with each other in these respects.

An additional chain of evidence that suggests the non-equality of synonymous codons arises from studies on single-nucleotide polymorphisms (SNPs). Synonymous SNPs do not produce altered coding sequences, so they are not expected to change the function of the protein in which they occur. However, they often do. Synonymous SNPs in the Multi-drug Resistance 1 (MDR1) gene alter the conformation of the protein product, the P-glycoprotein (P-gp), while the mRNA and protein levels remain similar in wild-type and polymorphic P-gp [24, 25]. Synonymous mutations in the human dopamine receptor D2 (DRD2) affect mRNA stability and synthesis of the receptor [26]. Synonymous mutations affect splicing and are not neutral in evolution [27, 28]. After all, SNPs are not silent and not invisible in many cases [29, 30]. It might be hypothesized that the presence of a rare codon, marked by the synonymous polymorphism, affects the timing of co-translational folding and thereby alters the structure and function of expressed protein. My recent study provides additional evidence in that direction.

There are numerous theories and speculations regarding the Genetic Code because of its redundancy. The 64>20 canonical code contains 3-times more information than it is necessary for determining amino acids. The excess information is stored in the wobble bases. A useful property of the Genetic Code is the minimization of the effects of frame-shift translation errors. However it is more and more accepted that coding sequences convey, in addition to the protein-coding information, several different biologically meaningful signals at the same time. These "parallel codes" include binding sequences for regulatory [3134]] and structural proteins [35], signals for splicing [36], and RNA secondary structure [16, 3739] It was recently noted [40] that the universal genetic code can allow arbitrary sequences of nucleotides within coding regions much better than the vast majority of other possible genetic codes. This ability to support parallel codes is strongly correlated with an additional property–minimization of the effects of frame-shift translation errors.

My study at hand is a specific case where the folding free energy of an mRNA is used as a measure for adapting structural features and the suggestions that wobble bases are important in carrying parallel (structural) codes are not entirely novel. However it is an additional indication that the interchangeability of synonymous codons is rather restricted: the universal genetic code is optimal as it is.

The reported correlations are not backed up with an evolutionary model and doesn't show that the RNA secondary structure has exerted a selection pressure on the evolution of the canonical genetic code. Therefore these correlations are not necessarily genuine and they may simply be secondary by-product of other optimization processes. However, the existence of these correlations per se provide strong support for the presence and importance of parallel codes in the universal, canonical codon.

Conclusion

The genetic code is redundant in regard to the meaning of codons in translation. However, this redundancy seems to have its own meaning. Synonymous codons are interchangeable in regard to amino acid coding, but their effect is individual and additive in respect of their role in determining the FEs of coding sequences and the relative frequencies of amino acids in the translated protein sequences and co-locating amino acid pairs. This might explain why wobble base modifications often have a large effect on the efficiency of gene expression and folding of the protein product [4148].