Background

Messenger RNA (mRNA) sequences encode the amino acid sequence of the protein but may also bear additional information. For example, certain synonymous codons may improve translation [13] and a variety of motifs may regulate expression at the level of translation, cellular localization, decay or splicing [49]. Many of these motifs are secondary structures, and eukaryotic mRNAs contain regulatory structures in their 5' and 3' UTRs [1015], or introns [16, 17]. However, it remains unclear whether secondary structures in the coding regions are of functional importance. Laboratory studies suggest that local secondary structures within coding regions can interfere with translation [18, 19], and one may therefore expect selection against structures that are too stable. Surprisingly, however, several bioinformatic studies find that RNA structures within the protein coding regions are more stable than expected by chance [2023] (but see [24] for opposing result). These studies used various algorithms to predict the secondary structures of mRNA sequences, and then compared the free energy values of these structures to the values for randomized sequences.

Here, I test for selection on mRNA secondary structure using another approach. Instead of testing for unexpected structural stabilities, I directly compare the selective constraint of sites that differ in their importance for the predicted secondary structure. I.e. I predict the secondary structure of coding yeast sequences using different algorithms, and for each nucleotide, I identify whether it is paired with another nucleotide, or unpaired. I assume paired sites are more important for the predicted secondary structure than unpaired sites. If there is selection for secondary structures, one might expect higher structural constraint at paired than at unpaired sites. Such constraint would affect synonymous codon use and substitution rates. In S. cerevisiae a relationship between codon use, tRNA abundance and expression level indicates that codon use is affected by selection for translationally optimal codons [1]. If there is selection for mRNA structure, structurally important sites may be under conflicting selection pressures: a codon might support the preferred mRNA structure that is translationally non-optimal. Under structural selection, one might expect lower numbers of optimal codons at paired than at unpaired sites. If mRNA structure is conserved across species, one might further expect lower numbers of synonymous substitutions at paired than at unpaired sites; possible compensatory substitutions however may make the latter test predictions less clear-cut. When a mutation occurs at a paired site and disrupts the pairing ability, a second compensatory mutation on the corresponding paired site may restore the pairing ability [25, 26]. Compensatory mutations may increase substitution numbers at paired sites. Innan and Stephan [27] show however, that unless selection against deleterious intermediates is very small, substitutions should occur only very slowly in paired regions [27].

Accurate structure prediction is obviously crucial for these tests. In several studies [2022], mRNA structures are predicted by thermodynamic properties using the minimum free energy (MFE) algorithm [28] only although taking the whole ensemble of possible structures and comparative information into account is known to increase predictive accuracy [2932]. I therefore predict the secondary structures by thermodynamic and comparative information (RNA- and ALIfold [33]), using the minimum free energy (MFE) algorithm and McCaskill's partition function of the thermodynamic equilibrium [34].

Results of this study are consistent with selection upon mRNA structures: numbers of conserved optimal codons and synonymous substitutions are reduced at structurally important sites.

Methods

Choice of study organism & data

I focus on Saccharomyces cerevisiae, as this model eukaryote is well studied, with genome sequences available for it and several related species. Importantly yeast allows using optimal codon numbers to investigate alternative selective constraints while controlling for effects of base composition. This is because (i) translational selection has been investigated extensively and supported in yeast [13]: certain translationally "optimal" codons increase in frequency with expression level and correspond to the most abundant tRNAs in the cell or to the tRNA with which they form the strongest binding. (ii) Crucially, translationally optimal codons in yeast are not biased towards GC-ending codons, as in many other Eukaryotic organisms. In yeast 12 optimal codons end with G or C (-GC), 12 with A or T (-AT). To control for base composition is important as RNA secondary structure predictions are – at least partly- based on thermodynamic properties and will therefore be affected by GC content: GC nucleotides form the most stable binding with three hydrogen bonds and will consequently more likely be paired in the structure. From the yeast alignments provided by Kellis et al. [35] comparing Saccharomyces cerevisiae with S. paradoxus, S. mikatae and S. bayanus, I use 492 genes that have start and stop codons but no premature stop codons or frame-shifting indels in all four species.

Secondary structure

I predict the secondary structure of the coding sequences using the below methods and identify for each nucleotide whether it is paired with another nucleotide, or unpaired. I assume paired sites are more important for the predicted secondary structure than unpaired sites. Note however, that unpaired sites may well be important for maintaining the mRNA's tertiary structure.

Secondary structure prediction methods

The thermodynamic stability of a secondary structure is measured as the amount of free energy released or used by forming base pairs. Positive free energy requires work to form a structure, negative free energy releases stored work. Free energy parameters are estimated from chemical melting experiments. The widely used Minimum Free Energy (MFE) algorithm [28] computes the one single structure with the most negative energy value, that thermodynamically is hence the most likely to be formed. The MFE algorithm seems fairly accurate for short RNA sequences, for which ~73% of paired sites are accurately predicted. mRNAs however are likely to be present in a population of structures [36, 37]. Often 5–10% of structures share very similar free energy values [38], and the predicted MFE structure might just be one out of many thermodynamically similar structures. Taking all possible secondary structures of the thermodynamic equilibrium into account, McCaskill's algorithm [34] computes the most probable structure and calculates the probability that each site is paired. When taking base pairings with high probabilities, the accuracy of the prediction increases [29]. Another benefit of McCaskill's algorithm is that it is less affected by small but reasonable variations in the underlying energy parameters – while the MFE prediction is very sensitive [39, 40]. I used the RNAfold (Vienna RNA Secondary Structure[33, 41]) package to predict structures of the four yeasts separately using the MFE and McCaskill's algorithms. When using McCaskill's algorithm, I consider sites to be paired that pair with high probability (>2/3) across the structure ensemble; all other sites are considered as unpaired. With increasing sequence lengths predictive accuracy decreases presumably because of the enormous increase in the number of potential base pairings that can be made as sequence length increases [42]. I therefore look at both the complete set of genes, and at the subset of genes shorter than 800 bp.

To predict the secondary structure, one can also assume structural conservation, and compute the one consensus structure that allows the largest amount of structural conservation across homologous sequences. Especially supportive of structural conservation are sites that vary at the sequence level but retain potential of Watson-Crick pairings in the structure (co-variations). Structures predicted with the aid of comparative data appear to be more accurate than those based on thermodynamic properties alone [3032]. I use the ALIfold package [33, 43] that integrates comparative information in the prediction made with either MFE or McCaskill's algorithm and predict the consensus structures of the four yeasts together using the ALIfold default settings for co-variation weight (Φ1 = 1, and Φ2 = 1).

Optimal codon use

Codon identification is based on the S. cerevisiae sequence. Optimal codons are defined as in Kliman et al. (2003) [44]. The relative frequency of optimal codons (Fop[45]) is the ratio of optimal codons to synonymous codons. I compute the relative frequency of optimal codons for each amino acid and gene separately. For amino acids with both one AT- as well as one GC-ending optimal codon (thr, val, ile, ser), I compute the relative optimal codon frequencies of the two optimal codons per amino acid separately. Throughout the paper, the terms "optimal" and "suboptimal" will refer to translational selection.

Tests

If there is selection for secondary structures, one may expect higher constraint at structurally important (paired) than at structurally less important (unpaired) sites.

  1. (1)

    Under translational selection one may expect lower numbers of translationally optimal codons at paired compared to unpaired sites. Note that the analysis is restricted to those codons that are conserved across the four yeast species and are likely to experience stronger selection pressures. Restricting the analysis to conserved sites is crucial for the ALIfold measure, as it incorporates substitutions in its prediction: ALIfold may tend to pair conserved sites, and under translational selection conserved sites tend to have higher optimal codon use than non-optimal sites. This could generate an artificial positive correlation between optimal codon numbers and structure when considering all codons. As GC-ending optimal codons are more likely to be paired, I look at GC- and AT-ending optimal codons separately. I do this for the four yeast species separately (using RNAfold) as well as for their consensus structure (using ALIfold), using MFE as well as McCaskill's algorithm for both methods.

  2. (2)

    If mRNA structures are conserved across species one may further expect lower numbers of substitutions at paired compared to unpaired sites. As ALIfold incorporates comparative information, this test is only meaningful for structures predicted by RNAfold. Codons experiencing non-synonymous substitutions are excluded from this analysis as one may expect possible selection on mRNA structure will mainly affect synonymous substitutions, while non-synonymous substitutions will be more constrained for other reasons. To check the structural similarity and potential conservation of predicted structures across species, I first compute the relative number of base pairings per gene that are consistently, i.e. unambiguously, predicted to be paired or unpaired across species. To estimate structural constraint at synonymous sites, I count for each synonymous optimal and non-optimal codon how often the respective third codon position is paired and unpaired in the S. cerevisiae structure (RNAfold) and how often the codon is conserved or experiences a synonymous substitution compared to S. paravensis. Note that translational selection and structural selection may be counter-balancing with respect to synonymous substitution numbers. I.e. unpaired sites with high numbers of optimal codons may experience reduced synonymous substitution numbers due to translational selection while paired sites with high numbers of non-optimal codons may experience reduced synonymous substitution numbers due to structural selection. To disentangle structural selection from translational selection, I look at optimal and non-optimal codons separately as. I further look at GC- and AT-ending codons separately as mutational processes and gene conversion events may be compositionally biased [46].

Statistics

Each of our analyses generates a set of 2 × 2 contingency tables per gene and per amino acid or codon. These are divided according to whether the site is paired or unpaired in the predicted secondary structure, and whether (1) the codon is optimal or non-optimal, and whether (2) the codon is conserved or synonymous polymorphic across the four species. To combine these independent 2 × 2 tables, I use the Mantel-Haenszel Z statistic according to Sokal and Rohlf [47]. I compute joint probabilities for all tables or certain subsets. To disentangle an effect of GC content on synonymous codon use at paired sites, I combine amino acids with AT-ending ending and amino acids with GC-ending optimal codons. I exclude contingency tables when expected values were zero, tested for homogeneity and computed the joint odds ratio (WMH) and its significance, including the continuity correction. I orient the odds ratio such that selection in favour of mRNA secondary structure is indicated by WMH <1: i.e. lower numbers of optimal codons, and lower numbers of synonymous substitutions at paired sites.

Results

  1. 1)

    Conserved optimal codon numbers are significantly lower at paired compared to unpaired sites, irrespective of the method (RNA- and ALIfold) and algorithm (MFE and McCaskill's) used to predict secondary structure. Crucially, the tendency remains whether I consider amino acids with AT- or GC-ending optimal codons (Tables 1, 2).

Table 1 Comparison of conserved optimal codon numbers at paired and unpaired sites.
Table 2 Comparison of conserved optimal codon numbers at paired and unpaired sites.

The tendency is true for most amino acids separately; even GC-ending optimal codons that are more likely to be paired for thermodynamic reasons tend to be less frequent at paired sites (Table 2). Notable exceptions however are leu, lys, ile, and for RNAfold additionally phe (Table 2). One explanation for these exceptions may be that selection strength for translationally optimal codons is stronger in these amino acids, for example translational errors may be more likely or more costly. Considering prediction accuracy may decrease with gene length, I first restricted the data to genes shorter than 800 bp; including all genes however does not change the result.

  1. (2)

    I first check the similarity and potential conservation of structures predicted by RNAfold. The major parts of mRNAs do not seem conserved in structure across species or prediction accuracy is low: 75% of sites are ambiguous, i.e. predicted to be paired in one or more species, but predicted to be unpaired in the remaining species (Table 3). When looking pairwise on average 41% of sites are ambiguous; number of ambiguous sites is only slightly lower for short genes. The ambiguity of predicted structural status will introduce considerable noise and may cause non-significant results. Despite high ambiguity in structure prediction the numbers of synonymous substitutions are consistently lower (WMH<1) at paired sites (Table 4). The tendencies remain when restricting the data to genes shorter than 800 bp. Other species comparisons lead to similar results (data not presented). Results become significant for GC-ending (optimal and non-optimal) codons (when structure is predicted using McCaskill's algorithm). G and C nucleotides do not only form stronger bonds and are more likely to be paired and structurally important, they are also more likely to be unambiguously predicted paired than A and T nucleotides (means GC: 0.263, AT: 0.153, t = 29.8409, df = 866.05, ***). This could reduce the level of noise and cause the significance of results for GC-ending codons.

Table 3 Similarity of predicted structures for species pairs.
Table 4 Comparison of synonymous substitution numbers at paired and unpaired sites.

Discussion

I tested for evidence of selective constraint acting on mRNA secondary structures in protein coding yeast genes. Predicted secondary structures differ greatly according to the prediction method used and between species. Nevertheless, there are significantly fewer conserved optimal codons and consistently fewer synonymous substitutions at paired sites for all predicted secondary structures. The results of this study are consistent with purifying selection on mRNA secondary structures.

Similar tendencies of codon use have been reported for Drosophila and humans: mRNA stability seems high when optimal codon use is low in Drosophila [48] and paired sites contain an excess of rare codons in humans [49]. Note that in this study, the comparison of optimal codon use is restricted to conserved sites. Besides the methodological need for ALIfold (see Material and Methods), the restriction to conserved sites restricts the analysis to sites potentially under considerable strong selection. For RNAfold structures, results become non-significant when not restricting the data to these conserved sites (data not presented). Strong conflicting selection pressures seem to act on certain sites while the remaining sites seem less constrained for structure. Selection on local and not global structures may explain these results and contribute to the low structural similarity across species. Selection on local mRNA structures in coding regions of eukaryotic genes has been suggested before [49]. Beside the low structural similarity also compensatory substitutions may contribute to the non-significant results when comparing substitution numbers at paired and unpaired sites.

Previous bioinformatic studies that focussed on whether or not the thermodynamic stability of mRNA structures of various organisms is selected for or against [2023, 49, 55] lead to partly inconsistent results and controversies about the accurate randomization procedure. In these studies, the observed MFE is compared to the expected MFE, which is estimated by taking the mean MFE of randomized versions of the same sequence, and a significant deviation is taken as evidence for selection for or against thermodynamic stability of the structure. The randomization of sequences can be performed in a number of different ways holding various properties of the sequence constant, while randomizing others. The properties are of biological importance; variables that are affected by forces other than selection for mRNA structure – for example the amino acid sequence – should be fixed. Which variables should remain free to vary however may not always be obvious, while the results are very sensitive to them. Di-nucleotide content for example might be selected for its effect on stability and should be allowed to vary for randomized sequences argue Chamary and Hurst [22]. However, di-nucleotides might well be affected by mutation bias, or selected for some other reason [21], in which case, di-nucleotide content should be kept fixed. The control of di-nucleotides in fact renders significant results non-significant [2023, 55].

Note that in contrast to comparing observed and expected MFE values, the comparison of constraint at paired and unpaired sites does not indicate that selection acts for or against the thermodynamic stability of the structure, but that the very predicted structure is under selection. With respect to selection for or against stability of structures, ALIfold results indicate that the thermodynamically most stable global structure is not conserved across the four yeasts: ALIfold consensus energy value is much higher i.e. less stable compared to the average energy value of the single sequences [see also Washietl et al. [50] for approach]. This is conform with results of Babak et al. [24] which support selection against stability of structures in coding regions. It is reasonable to expect selection on mRNA structures may act against too stable structures because too stable and un-flexible mRNA structures may interfere for instance with translation [18] and some mRNAs flexibility may allow their specific and dynamic complexes with other factors. mRNAs lead a complex life [51] and besides thermodynamic stability, selection on mRNA structure may also exist to maintain specific local or global mRNA structures that allow binding and interaction with other factors and thus effect biological functioning. Not only structural targets may be of effect, also accessibility of sequence targets may depend on global or local mRNA structures.

While results of this study are consistent with selection upon mRNA structures in coding regions and support laboratory studies that report synonymous substitutions are functionally important with respect to mRNA structure and translation in humans [5254], two considerations should be made. First, we do not know whether thermodynamic mRNA structure predictions predict the mRNA structures that are formed in the cell. mRNAs are generally associated with other factors [51], and effects of mRNA-associated microRNAs and proteins on the structure are hard to predict. Also, kinetics of mRNA folding and pseudo-knots are not considered here. Even with the comparative method, mRNA structures may remain at best approximations of the real mRNA structures in the cell. Secondly, the predicted and also the real structure will be affected by certain DNA patterns – however whether or not the respective DNA patterns are selected for mRNA structure or another reason may be hard to judge. There are several DNA patterns one may consider. (i) Di-nucleotide content of naturally occurring sequences leads to higher than expected thermodynamic stability [e.g. [21, 23, 55]]. Di-nucleotide content may be selected for its effect on mRNA structure but it may also be affected by mutation bias, or selected for some other reason, for example for nucleosome positioning [5658] or transcription pause sites [59]. (ii) Frequency of polypurine tracts is increased in exons and may affect thermodynamic structure. Again, polypurine tracts may be selected with respect to mRNA structure but also for other reasons such as enhancing splicing [60]. (iii) Translational protein folding into alpha-helix and beta sheets may affect synonymous codon use [61] and periodic DNA patterns may affect mRNA structure. If thermodynamic predictions correspond to any other force such as selection on nucleosome positioning and transcription or co-translational pause sites, the observed patterns may be a consequence of that and inference of selection acting directly upon on mRNA structure may be incorrect.

Alternative selection upon mRNA structures (or any other selective target) may counterbalance translational selection and explain why the bias towards translationally optimal codon is never complete and even in highly expressed genes non-optimal codons are used. Alternative selection may also contribute to the discrepancy between expected and observed codon bias [58], and may lead to systematic underestimates of selection strength for optimal codons. As selection for mRNA structures may be acting stronger on GC-ending codons, in organisms in which potential translationally optimal codons are biased towards GC, such as Drosophila, mammals, C. elegans, estimates of selective strength for optimal codons may also be overestimated. It will be worth considering effects of alternative selection and disentangling the different targets of selection.

Conclusion

I tested for evidence of selective constraint acting on mRNA secondary structures in yeast. Predicted structures differ greatly according to the prediction method used and between species. Nevertheless, there are significantly fewer conserved optimal codons and consistently fewer synonymous substitutions at paired sites for all predicted secondary structures. These results are consistent with purifying selection on mRNA secondary structures in protein coding yeast sequences and suggest their biological importance. One should consider however that accuracy of structure prediction is unknown for mRNAs and interrelated selective forces may contribute. Selective pressures alternative to translational selection seem affect synonymous and optimal codon use in yeast. Depending on strength and direction of translational selection in an organism, such alternative selective forces may lead to under- or over-estimates of selective strength on optimal codon use.