Background

There is growing evidence that local mRNA folding (i.e., short-range secondary structure) inside the coding region is often stronger or weaker than expected, but the explanation for this phenomenon is yet to be fully understood. mRNA folding strength affects many central cellular processes, including the transcription rate and termination [1,2,3], translation initiation [4,5,6,7,8,9,10,11,12,13,14], translation elongation and ribosomal traffic jams [15,16,17,18], co-translational folding [19,20,21], mRNA aggregation [22], mRNA stability [23, 24], and mRNA splicing [10, 25] (reviewed in [26,27,28]). Many of these effects are mediated by interactions of mRNA within the CDS (protein-coding sequence) with proteins and other RNAs and may include structure-specific or non-structure-specific interactions.

In recent years, several studies showed evidence for selection acting directly to affect mRNA folding strength within the CDS (Fig. 1a). Studies looking at the CDS as a whole found selection for strong mRNA folding in most species [22, 29,30,31,32]. Studies focusing on the beginning of the coding region (i.e., the first 40–50 nucleotides) found evidence for the inverse, with selection acting to weaken mRNA folding in that region [30, 32,33,34]. In addition, there is some evidence for specifically strong folding in nucleotides 30–70, which may slow down translation elongation near the 5′ end of the mRNA, possibly to prevent ribosomal traffic jams [18, 35, 36]. Finally, it has been suggested that folding is weakened in the region leading to the stop codon [32,33,34], but not in a way that attributes this weakening to direct selection on folding strength rather than a side effect of some other bias in this region. These results are generally in agreement with available small-scale (e.g., [13, 14]) and large-scale [10,11,12, 24, 37,38,39] experimental validation performed in model organisms. Some of these characteristic regions were found to be correlated with genomic GC-content and to be stronger in highly expressed genes [29, 36, 40,41,42]. However, the previous studies cited did not systematically examine how the selection on folding strength changes along the coding sequence and how this phenomenon varies across the tree of life. In this study, using high-resolution analysis of the folding selection profiles in over 500 organisms from the three domains of life, we examine all data under a common framework and under more stringent controls (including accounting for the evolutionary distances between species), to determine which correlations are likely to stem from causal relationships involved in maintaining mRNA folding. We show that the previously proposed patterns of local selection on mRNA folding are not universal and examine their association with genomic and environmental factors in different taxonomic groups to better understand the underlying evolutionary processes.

Fig. 1
figure 1

Common regions of folding bias (∆LFE) are present across the tree of life, but are not universal. There is correlation between the strengths of these regions in different species, indicating there are factors influencing the bias throughout the coding sequence. a Summary of profile features with the fraction of species in which each feature appears in each domain (based on model 1 rules; see “Analysis” under the “Methods” section for details). The results based on the less restrictive model 2 rules (with weaker ∆LFE near the CDS edges not required to be positive; see “Analysis” under the “Methods” section) are shown in bright blue below each bar. References shown here are based on comparison to randomized sequences (i.e., equivalent to ∆LFE). b Scheme illustrating profile features reported separately in previous studies within the CDS, showing features [A]–[D] from a. c Observed distribution of ∆LFE profile values at different positions relative to CDS start (left) and end (right). d The distances (in nt) from the start codon where ∆LFE transitions from positive to negative, for species belonging to different domain. The lengths of the initial weak folding region range up to 150 nt in some bacteria. e Spearman’s correlations between mean ∆LFE profile values in regions [A], [C], and [D]. White dots indicate significant correlation (p value < 0.01)

Results

To test different hypotheses related to direct selection acting on the local folding energy (LFE) in different regions of the coding sequence, we measured the mean deviation in LFE between the native and randomized sequences (maintaining the amino acid sequence of all CDSs as well as codon and nucleotide composition including the GC-content, see “Analysis” under the “Methods” section for more details). The resulting deviation values, denoted ∆LFE, measure the increase or decrease in local mRNA folding energy relative to what we expect based on the encoded protein and codon frequencies. Any significant deviation from random can be attributed to a specific arrangement of codons that supports increased or decreased base-pairing and folding strength along the mRNA strand (Fig. 2a).

Fig. 2
figure 2

Overview of the computational analysis to measure ∆LFE while controlling for other factors known to be under selection at different regions of the coding sequence and find factors correlated with it. a The variables and concepts involved in determining local folding strength and calculating ∆LFE. The effects of the compositional factors on the left side are removed in order to specifically measure the contribution of codon arrangements to the native folding energy. Blue arrows indicate possible selection forces. b The different steps in the computational pipeline used to estimate ∆LFE and the factors affecting it (see “Analysis” under the “Methods” section). For each genome, the CDSs are randomized based on each null model (CDS-wide and position specific), to calculate a mean ∆LFE profile based on that null model. At the next step, based on GLS, correlations between features of the ∆LFE profile and genomic/environmental features are computed. Input data sources (native CDS sequences, species trait values, species tree) are shown in green. c The distributions of some genomic properties within the dataset—CDS count, genomic GC-content, and genomic ENc′ (measure of CUB). The dataset was designed to represent a wide range of values (among other considerations; see “Species selection and sequence filtering” under the “Methods” section)

Specifically, if the null hypothesis used to generate the randomized sequences holds for the native sequences at some position, we expect ∆LFE to be 0. Otherwise, a significant deviation from ∆LFE = 0 indicates that the local folding energy values cannot be explained by selection on amino acid content, codon bias, or GC-content alone and serves as evidence for direct selection on local folding energy (Fig. 2a). Positive ∆LFE indicates putative selection for weaker secondary structure, while negative ∆LFE corresponds with selection for stronger secondary structure. We specifically aimed at finding nearly universal patterns in ∆LFE, as well as groups of organisms and specific organisms with profiles deviating from such patterns. The resulting ∆LFE profiles were subsequently used with the evolutionary tree of the analyzed organisms to detect association between ∆LFE and genomic and environmental traits that cannot be explained by taxonomic relatedness alone and therefore may hint at underlying causal relations. We discuss the influence of genomic features such as codon usage bias (see the “Correlation between codon usage bias and ∆LFE” section) and GC-content (see the “Correlation between GC-content and ∆LFE” section), and of environmental features like intracellular life (see the “Weak ∆LFE in endosymbionts and intracellular organisms” section) and growth temperature (see the “Weak ∆LFE in hyperthermophiles” section).

Conserved regions of folding bias (∆LFE)

We observed that significant ∆LFE is present in most species and in most regions of the CDS (Fig. 3, Fig. 1a, c). The mean ∆LFE profiles of most species share the same structure (Fig. 3a, Fig. 1b, c), as follows. The region immediately following the CDS start (typically extending through the windows starting at positions 0–20 nt (Fig. 1a, region A), with a median of 20 nt/10 nt/20 nt in bacteria/archaea/eukaryotes, respectively) has positive mean ∆LFE (evidence of selection for weak folding), usually followed by a transition to negative mean ∆LFE (indicating selection for strong folding) within the first 50 nt and maintained throughout most of the CDS (Fig. 1a region C, Fig. 1c, d). The negative ∆LFE tends to weaken in the area immediately preceding the last codon (typically nucleotides 50–0 nt with median of 50/90/40 nt in bacteria/archaea/eukaryotes, respectively, Fig. 1d) in 83% of the species, and ∆LFE becomes positive there (indicating weaker-than-expected folding) in 37% of the species (including 68% of eukaryotes). This evidence of selection for weak mRNA folding near the stop codon in many organisms across the tree of life is reported here for the first time; two previous studies [18, 32] reported that the local folding energy (LFE) is weak near the start codon in three organisms and without showing that it cannot be explained by direct selection on the amino acid sequence (e.g., using computation of ∆LFE as was done here).

Fig. 3
figure 3

Two summaries of the ∆LFE profiles demonstrate the consistency and diversity found. a Characteristic ∆LFE profiles for species belonging to different taxons. The format of the plots appears in the upper left corner: ∆LFE bias is shown (by color) for windows starting in the range 0–150 nt relative to the CDS start, on the left, and CDS end, on the right; red denotes negative ∆LFE (stronger-than-expected folding) while blue denotes positive ∆LFE (weaker-than-expected folding; see the scale at the upper-right corner). The characteristic profiles for each taxon were calculated using clustering analysis, by grouping similar species according to the correlation between their profiles (see “Visualization” under the “Methods” section for details). The bars (in turquoise) appearing to the right of each characteristic profile indicate the relative number of species it represents. The full ∆LFE profiles for all species appear in Additional file 1: Figure S7. b. Summary of ∆LFE profile diversity for all species using dimensionality reduction to 2 dimensions with PCA (see explanations about PCA in the main text), with similar values (profiles) mapped to nearby positions. Background shading (blue) indicates density (see “Visualization” under the “Methods” section for details). This shows most species have similar profiles (located near the center), but different kinds of less typical profiles are also represented. Top, CDS start; bottom, CDS end. Short species names are listed in Additional file 1: Table S3

To measure how frequently these elements appear together within the same species, we tested them against a model, based on two variants. The stricter variant, model 1, counts species in which the regions of weak folding at the beginning and end of the CDS have, on average, weaker than expected folding, i.e., significantly positive ∆LFE. The less restrictive model 2 requires folding in these regions to be significantly weaker than in the middle of the CDS, but not necessarily significantly weaker than random (see “Analysis” under the “Methods” section for details). Since the models are applied to the mean ∆LFE of a population of genes which may vary greatly in their individual values, both estimates of the adherence to the model are informative. The combined models (composed of the three regions described) are found in 23% (model 1) and 69% (model 2) of the species analyzed (Fig. 1a), appearing very frequently in bacteria but also commonly in archaea and eukaryotes. The conservation of the ∆LFE profile structure in species across the tree of life is evidence of its biological significance.

GC-content and LFE both change during evolution, and it is worthwhile to compare their level of conservation in related species. LFE is to a large degree determined by GC-content (as evident by the almost perfect correlations found between GC-content and native or randomized LFE, Additional file 1: Figure S1), so one might argue the observed ∆LFE is a side effect of selection acting on GC-content. However, we found that the ∆LFE profile is more conserved than genomic GC-content at any phylogenetic distance within the same domain (Additional file 1: Figure S2). We also found that the profile does not consistently correlate with local variation in CUB (Additional file 1: Figure S3), demonstrating that the results reported here are not side effects of selection on codon bias (e.g., due to adaptation to the tRNA pool).

Additional tests also support direct selection acting to maintain folding strength. ∆LFE profile features are also preserved when calculated using a null distribution that maintains the codon distribution at any position in the CDS relative to the CDS start; thus, local (position-specific) genomic amino acid or codon distributions are not enough to explain the ∆LFE profile (Additional file 1: Figure S4). These features appear in many cases to be stronger in highly expressed genes, genes coding for highly abundant proteins, and genes with a strong codon adaptation to translation elongation, I_TE [43] (see Additional file 1: Figure S5). Finally, these results remain after controlling for the strength of the Shine-Dalgarno binding in the 5′-UTR [44] (Bahiri Elitzur S, Cohen-Kupiec R, Fine L, Yacobi D, Apt B, Diament A, et al.: Prokaryotic rRNA-mRNA interactions are involved in all translation steps and shape bacterial transcripts, Manuscript submitted for publication 2020) and for genes with short or overlapping 5′-UTRs (see, for example, [45]). Together, these results show that the ∆LFE profiles are unlikely to be explained as side effects of selection for a genomic or CDS position-dependent compositional bias in nucleotide, codon, or amino acids acting alone, although many such biases have been reported and are believed to have important biological effects [36].

Note that the randomized LFE profiles also are not always flat, revealing some residual influence on LFE, caused by the amino acid frequencies at different regions, remains even after randomization. ∆LFE controls for this by separately measuring the folding energy biases found in each position.

The different elements making up the model profile structure have functions associated with them. The weak folding region at the beginning of the coding region may improve access to the regulatory signals in this region (e.g., the start codon) [5, 36]. The region of positive ∆LFE preceding the CDS end may help recognition of the stop codon and ribosomal dissociation from the mRNA and prevent ribosomal read-trough. Strong folding in the middle of the coding sequence may assist co-translational folding [19,20,21] by slowing down translation in specific positions to allow protein folding or other co-translational processes to take place, as well as regulate mRNA stability [23] or prevent mRNA aggregation [22].

The division of the profile into the three regions described here is also apparent when the data is analyzed in an unsupervised manner via principal component analysis (PCA) [46] (Fig. 3b and Additional file 1: Figure S6). This arranges species on a two-dimensional plane according to their ∆LFE profiles, so species with more similar ∆LFE profiles are placed closer together. The resulting plots (for the beginning and end of the coding sequence) show the majority of species have similar ∆LFE profiles (located very close to each other near the center of the plot), with positive ∆LFE near the ends of the coding sequence and negative ∆LFE in the middle of the coding sequence. Groups of species containing other types of profiles are arranged around them on the plots. At either end of the coding sequence, 2 variables (principal components) are sufficient to describe at least 85% of the variability between all ∆LFE profiles, supporting the division of the ∆LFE into three regions (since the mid-CDS region appears in both analyses, see Fig. 1e).

In 45% of the organisms, we found an additional feature: a peak of selection for strong mRNA folding around 30–70 nt downstream of the start codon (Fig. 1a, region B). It was suggested ([34, 35], based solely on evidence in Eschericia coli and Saccharomyces cerevisiae) that this peak is responsible for increasing translation throughput, by minimizing ribosomal traffic jams occurring because of uneven translation elongation rates throughout the CDS. There is also some evidence [4, 9] that strong secondary structure downstream of the start codon can enhance translation. Whatever the mechanism responsible for it, the results here show that this feature is common across the tree of life. This feature was also shown previously to be stronger in highly expressed genes in 3 species [45], and our results extend this claim (see Additional file 1: Figure S5).

The ∆LFE profiles of eukaryotes are much more diverse than those found in prokaryotes. One striking observation is that significant positive ∆LFE throughout the mid-CDS region, present in 13% of the eukaryotes tested, is not observed in any of the 371 bacterial species tested except in Deinococcus puniceus (Additional file1: Figure S8, see also Fig. 1a). This seemingly universal rule hints at a constraint on bacterial CDSs not obeyed in eukaryotes and is one of two major differences observed between the domains (along with the correlation with genomic-GC, see the “Correlation between GC-content and ∆LFE” section).

Despite these general trends, there is also significant variation in the ∆LFE profiles across and within taxonomic groups. In the subsequent sections, we discuss genomic and environmental factors that explain some of the variation between mean ∆LFE profiles in different species.

Correlations between ∆LFE regions

The strengths of the three major regions of the ∆LFE profile described above are strongly correlated (Fig. 1e): organisms with relatively stronger ∆LFE (in absolute value) in one model region appear to also have stronger ∆LFE in other regions. For example, the 0–20-nt region has a strong negative correlation with the 150–300-nt region (Spearman’s ρ = − 0.46; p value < 1e−8). This correlation remains highly significant for different ranges and when testing using GLS (Additional file 1: Fig. S9). The two mid-CDS regions (relative to CDS start and end) are positively correlated (ρ = 0.84, p value < 1e−8), as are the CDS start and end regions (ρ = 0.52, p value < 1e−8). These correlations indicate ∆LFE profiles of different species can generally be ordered by magnitude from species having strong (positive or negative) ∆LFE features throughout the CDS to those showing weak or no ∆LFE. In eukaryotes, the negative correlation between the CDS start and mid-CDS regions is not present (results not shown), but in this case, neither do the ∆LFE profiles generally follow the structure of positive start ∆LFE and negative mid-CDS ∆LFE and the profile values may continue to change farther away from the CDS edges.

Together, these results suggest that the different elements making up the typical profile structure are influenced at the genome level by a factor or combination of factors acting jointly on all regions and strengthening or weakening |∆LFE|, as well as distinct factors acting on each region differently. Some factors contributing to this scaling effect are discussed in the following sections.

Correlation between codon usage bias and ∆LFE

Codon usage bias is generally correlated with adaptation to translation efficiency [47,48,49,50]. If ∆LFE is also related to selection for translation efficiency, it is reasonable to expect it would correlate with CUB. To test this hypothesis, we used ENc′ (ENc prime, [51, 52]), a measure of codon usage bias (CUB) that compensates for the influence of extreme GC-content values that skew standard ENc (effective number of codons) scores. Indeed, such a correlation is found (Fig. 4, Additional file 1: Figure S10b)—∆LFE tends to be stronger (in absolute value) in species having strong CUB (low ENc′), and this holds both near the CDS edges and in the mid-CDS regions. Similar results were obtained when using other measures of CUB (CAI [53] and DCBS [49], Additional file 1: Figure S11), and these correlations persist within many individual taxons (Fig. 9, Additional file 1: Figure S10b). In addition, species with strong CUB tend to have ∆LFE profiles that closely match the model elements (Fig. 4b, c), and further analysis shows the correlation of CUB with the ∆LFE profiles is due to correlation with the magnitude of the profiles and not due to specific profile regions (Additional file 1: Figure S12). Since ∆LFE is computed while controlling for the CUB of each sequence, the reported results suggest that organisms with higher selection on CUB also have, “independently” from a statistical point of view, higher selection on ∆LFE.

Fig. 4
figure 4

Folding bias (∆LFE) is positively correlated with genomic CUB (measured as ENc′) throughout the CDS. This correlation indicates stronger folding bias exists in species with stronger CUB at all regions of the CDS. a Correlation strength (R2, measured using GLS regression) between genomic ENc′ and ∆LFE at different positions relative to the CDS start (left) and end (right). R2 values below the X-axis indicate negative regression slope (i.e., negative correlation with ∆LFE). The regression slope generally has the opposite sign as ∆LFE, indicating strong ∆LFE is correlated with strong codon bias throughout the CDS. Major taxonomic groups are plotted as different colored lines. White dots indicate regression p value < 0.01. b Comparison of ∆LFE profile values in species with strong vs. weak CUB. Species with strong CUB (yellow, ENc′ ≤ 56.5) tend to have more extreme ∆LFE and show the conserved ∆LFE regions more clearly, while species with weak CUB (blue, ENc′ > 56.6) tend to also have weak ∆LFE. c Genomic ENc′ values plotted using coordinates determined by ∆LFE profiles. Species with strong CUB (left plot, lower left quadrant and right plot, right side) have stronger ∆LFE profiles that more strongly adhere to the conserved ∆LFE regions. Coordinates are based on PCA for profile positions 0–300 nt relative to CDS start (left) and end (right). The PCA coordinates are the same as those in Fig. 3b

Using genomic CUB as measure of optimization for efficient translation elongation, we found that it is also a good predictor of the strength of ∆LFE. One interpretation of this is that the genomic variation in ∆LFE can largely be explained not by different species having distinct “target” ∆LFE levels, but by different species having varying “abilities” to maintain ∆LFE in the presence of mutations and drift because the selection pressure is insufficient under their effective population size (either because the selection pressure is low or because the effective population size is low).

Correlation between GC-content and ∆LFE

GC-content is a fundamental genomic feature and is correlated with many other genomic traits and environmental aspects [54, 55]. It might be a trait maintained under direct selection, or merely a statistical measure of the genome that other traits evolve in response to because of its biological and thermodynamic consequences. GC-content is also the strongest factor determining the native LFE (Additional file 1: Figure S1a), since G-C base pairs are more stable than A-T pairs (due to the increase in the number of hydrogen bonds and more stable base stacking). Selection on folding strength (measured by ∆LFE) also influences folding strength, and we would like to measure the correlation between these two factors that influence the folding strength (namely, GC-content and ∆LFE). This is made possible since ∆LFE is calculated relative to the baseline maintaining the GC-content of the original coding regions in the randomized ones (see “Randomization procedures” under the “Methods” section for a description of the null models). This controls for the direct effect of GC-content, allowing us to directly study the interaction between ∆LFE and GC-content (see also Additional file 1: Figure S1a).

The correlations (expressed as R2) between genomic GC-content and ∆LFE at different points near the CDS start and end are shown in Fig. 5a. This dependence shows a similar pattern to that seen in the ∆LFE profiles themselves (Fig. 1c, Fig. 5a, and for the correlation with CUB, see the “Correlation between codon usage bias and ∆LFE” section), with significant correlations appearing in roughly the same CDS regions described for the ∆LFE profiles. The correlation takes the opposite directions in the CDS edges than that maintained throughout the inner CDS region, which means GC-content is positively correlated with the strength of ∆LFE (in absolute value) throughout the CDS (like CUB is).

Fig. 5
figure 5

Folding bias (∆LFE) is positively correlated with genomic GC-content throughout the CDS. a The effect of genomic-GC on ∆LFE at each position along the CDS start (left) and end (right), measured using GLS regression R2 values. R2 values above the X-axis indicate positive regression slope (indicating moderating effect of GC-content); R2 values below the X-axis indicate negative regression slope (i.e., reinforcing effect of GC-content). Near the CDS edges (where ∆LFE is usually positive), genomic-GC generally has a moderating effect on ∆LFE. In the mid-CDS region (where ∆LFE is usually negative), genomic-GC generally has a reinforcing effect on ∆LFE. Major taxonomic groups are plotted as different colored lines. White dots indicate regression p value < 0.01. b Comparison of ∆LFE profile values in species with high vs. low genomic GC-content. Species with high GC-content (blue, genomic-GC > 45%) tend to have more extreme ∆LFE and show the conserved ∆LFE regions more clearly, while species with low GC-content (yellow, genomic-GC ≤ 45%) tend to also have weak ∆LFE. c. Genomic GC-content for all species plotted on the PCA coordinates of their ∆LFE profiles (same coordinates as in Fig. 3b. N = 513) for CDS start (left) and end (right). Low-GC species are generally clustered in a small region, indicating they have similar ∆LFE profiles, and that region is characterized by weak ∆LFE. d Qualitative summary of ∆LFE in relation to GC-content in the mid-CDS

Near the CDS start, positive correlation (indicating a moderating effect) exists in the windows starting at 0–60 nt (Fig. 5a, Additional file 1: Figure S10a). This effect appears in almost all taxons analyzed, with R2 values between 0.2 and 0.9 and significant p values in most taxons, and may be explained as counteracting the strengthening influence of GC-content on secondary structures to prevent them from hindering the translation initiation process.

The opposite effect exists in the mid-CDS: negative (reinforcing) dependence on genomic GC-content appears in the region at 70–300 nt after CDS start in most bacterial and archaeal taxons (Fig. 5a–c, Fig. 9, Additional file 1: Figure S10a) and is generally maintained throughout the length of the CDS (excluding the edge regions). As mentioned above, selection for strong mRNA folding and mRNA structures inside the coding may be related to transcription elongation [2], co-translational folding [19,20,21, 26], and mRNA stability [23]. The observed ∆LFE in this region is indeed negative in nearly all bacterial and archaeal species; it is possible that the folding is further reinforced in species higher GC-content since they are under stronger selection for these processes. Note that the effects of genomic GC-content and CUB (see the “Correlation between codon usage bias and ∆LFE” section) are somewhat overlapping, but each factor significantly contributes to the total observed effect (Additional file 1: Figure S13).

In eukaryotes, we observed a wider variation in mid-CDS ∆LFEs (which is not found in other groups), from strongly positive to strongly negative, with a non-linear dependence on genomic-GC (Fig. 6, Fig. 9). Low-GC eukaryotes tend to have weak ∆LFE in the mid-CDS region, while high-GC eukaryotes tend to have strong positive or negative ∆LFE in the same region. To evaluate this relation, which is not linear, we used maximal information coefficient (MIC) [56, 57], a measure that can capture any statistical dependence including non-linear dependencies. We found that this relation is quite significant (MIC = 0.54, p value ≤ 2e−5; see “Analysis” under the “Methods” section). Fungi, however, show a strong positive (moderating) correlation between genomic-GC and ∆LFE (Fig. 5a, Fig. 6a; Eremothecium gossypii, GC% = 51.7, is the only observed fungus with GC% > 45 and negative ∆LFE in the mid-CDS region). There are also clear internal disparities in ∆LFE among fungi families (Additional file 1: Figure S7). Note that in some species (e.g., Zymoseptoria tritici), the positive ∆LFE seems to extend throughout the CDS. In other species, there is a transition to negative ∆LFE further downstream (as much as 500 nt from CDS start, results not shown).

Fig. 6
figure 6

Genomic-GC effect on ∆LFE in eukaryotes shows divergence in high GC-content species that is not observed in other domains, while low GC-content species have weak ∆LFE. a mean ∆LFE values for eukaryotes in the range 100–300 nt from CDS start, plotted against genomic-GC. Fungi are highlighted in blue. There is no linear relation between the variables (R2 = 0.01), but there is strong statistical dependence nevertheless (MIC = 0.582, p value < 2e−5, N = 78); see some explanation on MIC in the main text. b PCA plot for the same species shows the same two classes of ∆LFE profiles in eukaryotes having high genomic-GC (top and bottom regions; see “Visualization” under the “Methods” section for details). On the left, ∆LFE profiles are plotted in the positions given by their first 2 PCA components. On the right, genomic-GC values for the profiles plotted at the same coordinates. Short species names are listed in Additional file 1: Table S3

The group of fungi and other eukaryotes having strong selection for weak local mRNA folding in the mid-CDS region (all of which have high genomic GC-content) runs counter to the general trend in prokaryotes. It is possible that these species are under selection for higher translation elongation speeds, which tend to be hindered by stronger mRNA folding [15,16,17,18]; however, it is not clear why such cases are not observed in other groups like bacteria. The correlation with GC-content reported here may also be partially explained by the fact that both GC-content and ∆LFE are affected by common factors such as the ability to maintain the selected sequences under the effective population size. The wide range of ∆LFE values for eukaryotic species and the absence of linear correlation with GC-content (in general) reveal additional factors are involved in this aspect of gene expression.

Weak ∆LFE in endosymbionts and intracellular organisms

Many endosymbionts and other species with intracellular life stages have low effective population sizes, because their life cycle includes recurring population bottlenecks [58, 59] or has lower selective pressure due to reliance on the host [60]. These species generally have weaker ∆LFE compared to their relatives, as can be clearly seen from their ∆LFE profiles (Fig. 7, also see Additional file 1: Figure S7, e.g., Richelia intracellularis, Blattabacterium sp.). The apparent disparity between endosymbionts and their relatives is strongest near the CDS start. Taken as a whole, the difference in ∆LFE is small (Fig. 7a), but when comparing within smaller taxons, the difference is much more noticeable (e.g., gammaproteobacteria in Fig. 7b–d). Endosymbionts also tend to have lower GC-content and CUB [60], but the results are still generally significant after considering this at least in proteobacteria, where we have a sufficient sample size (Additional file 1: Figure S14). The dichotomic grouping of species as endosymbionts is an oversimplification and ignores the variety of species with intracellular stages, including obligate and facultative intracellular parasites (and our annotation of species as endosymbionts, based on the literature, may not be complete). Indeed, some species we classify as endosymbionts (e.g., Halobacteriovorax marinus SJ) nevertheless have low genomic ENc′ and strong ∆LFE.

Fig. 7
figure 7

Endosymbionts and other intracellular species have generally weak ∆LFE. a Comparison of ∆LFE values at different CDS positions between endosymbionts (green) and other species (pink). The ∆LFE values are less extreme in endosymbionts, indicating lower selection on local folding strength. b Comparison of ∆LFE distributions at different CDS positions between endosymbionts (green) and other species (pink) within gammaproteobacteria (N = 44). c ∆LFE for species included in the tree within gammaproteobacteria; the endosymbionts and intracellular species (marked) have weaker ∆LFE bias compared to their relatives. d PCA plot for ∆LFE profiles (left, see “Visualization” under the “Methods” section) and the intracellular classification (right) for the species in gammaproteobacteria (N = 44). For clarity, overlapping profiles are hidden on the left (as in all PCA plots for ∆LFE profiles); all species are plotted on the right. Short species names in the PCA plot on the left panel are listed in Additional file 1: Table S3

Weak ∆LFE in hyperthermophiles

In temperatures approaching the RNA melting temperature, base-pairing is destabilized and it is likely that codon arrangement and ∆LFE can no longer significantly affect the secondary structure. We found hyperthermophilic archaea and bacteria to have weaker (closer to 0) ∆LFE in the mid-CDS region (Fig. 8). This effect is not apparent at lower temperatures (below 65 °C) or across all temperatures, with temperature having no significant correlation with ∆LFE (Fig. 8e, Fig. 9) when controlling for species relatedness. Our results are consistent with [40], which argued for negative correlation with growth temperature, but that paper only analyzed the beginning of the coding region and did not control for the evolutionary relations among organisms. Based on our analysis, the linear relation between temperature and ∆LFE is not generally supported by GLS (Fig. 8e, Fig. 9, Additional file 1: Figure S10c); however, since species tend to have similar temperature requirements as their close relatives, it is hard to conclusively decide if any similarity in ∆LFE is derived from association with temperature or the evolutionary relationship without having considerably more data. In hyperthermophiles (species with optimum growth temperature above 75 °C), however, there is a significant decrease in ∆LFE (even when the folding strengths are predicted at room temperature, Additional file 1: Figure S15). These results suggest that mRNA folding is not effective in higher temperatures (in general), and consequently, ∆LFE is not preserved. In moderate thermophiles, ∆LFE may follow the precedence of genomic GC-content, which previous studied concluded is not an adaptation to high temperatures at the genomic level, but may still be part of such an adaptation at specific rRNA and tRNA sites where secondary RNA structure is particularly important [61, 62].

Fig. 8
figure 8

Hyperthermophiles have weak ∆LFE. a ∆LFE profiles (for CDS beginning and end) for members of euryarchaeota covered by the phylogenetic tree (N = 28) and their annotated optimum growth temperature classification (mesophile—green, moderate thermophile—orange, hyperthermophile—red) and genomic GC-contents. Hyperthermophiles have weak ∆LFE that cannot be explained by the tree topology or their genomic GC-contents. b ∆LFE profiles (left) and optimum growth temperatures (right) for all members of euryarchaeota having annotated optimum growth temperatures (N = 25), plotted using their PCA coordinates (see “Visualization” under the “Methods” section). Hyperthermophiles seem to be clustered in a small region characterized by weak ∆LFE. c ∆LFE profiles (left) and optimum growth temperature (right) for all species having annotated optimum growth temperature (N = 173), plotted using their PCA coordinates (see “Visualization” under the “Methods” section). Short species names from PCA plots are listed in Additional file 1: Table S3. d Comparison of ∆LFE values for species having optimum temperature above (blue) or below 75 °C (yellow), for positions relative to CDS start (left) or end (right). e Regression for optimum growth temperature vs. mean ∆LFE (average for positions 100–300 nt after CDS start) using GLS (green regression line, N = 96, R2 = 0.004, p value = 0.6) and OLS (red regression line, N = 173, R2 = 0.45). The apparent linear relation is no longer significant when controlling for the phylogenetic relationships. Points plotted in red are included only in OLS

Fig. 9
figure 9

Summary of trait correlations with ∆LFE in the mid-CDS region for different taxonomic groups. Many of these correlations are discussed in the previous sections. For each taxonomic group and trait combination, correlations are measured using R2 with GLS (phylogenetically corrected, green bars) and OLS (uncorrected linear relationship, red bars). Significant correlations are marked with “∗” (p value < 0.05) or “∗∗” (p value< 0.001). Correlations with genomic-GC% and genomic-ENc′ are robust in prokaryotes, whereas other traits do not have consistent linear relationships. All correlations are for the region 100–300 nt after CDS start. (a) No linear dependence, but a significant relationship does exist (see Fig. 6). (b) Linear dependence appears in GLS but not in OLS. Small sample size exists in some taxons. (c) No significant linear relationship found over the entire range of values, but hyperthermophiles have significantly lower ∆LFE (see the “Weak ∆LFE in hyperthermophiles” section)

Discussion

The results we presented here provide a wide integrated view on the way evolution shapes local mRNA secondary structures in the coding regions of organism across the tree of life. In addition, the results include novel attempts to tie this phenomenon to genomic, evolutionary, and environmental variables in the hope of further clarifying the processes involved. In this section, we will summarize and discuss key results.

First, we show that selection on mRNA folding strength in most (but not all) species follows a conserved structure with three distinct regions (Fig. 1)—decreased local folding strength at the beginning and end of the coding region and increased folding strength in mid-CDS. The fact that this structure is more conserved than other genomic traits like GC-content (Additional file 1: Figure S2), as well as its alignment to the coding regions, suggests these features are related, at least in part, to translation regulation. Our statistical tests demonstrate that these features cannot be merely side effects of factors known to be under selection like codon usage bias and amino acid composition.

In general, the model features for the beginning and mid-CDS appear much more frequently in the analyzed organisms (appearing in around 80% of the organisms), while selection for weak folding near the stop codon, first demonstrated here, is comparatively rare (it appears in around 37% of the organisms). This may suggest that generally, the first two features tend to be under stronger selection (possibly since they tend to contribute more significantly to organism fitness).

Conformance to different model elements varies significantly between the three domains: weak folding at the beginning of the coding regions appears in the great majority of bacterial species (88%) but only in 56%/60% of eukaryotes/archaea, respectively (Fig. 1a, Fig. 3a). These differences may be related to polycistronic gene expression (see Additional file 1: Figure S16) or to generally higher effective population sizes and selection for high growth rate in bacteria; they may also indicate complementary constraints imposed by eukaryotic gene expression mechanisms (e.g., Cap-dependent translation initiation) and unique environmental constrains in archaea. On the other hand, selection for weak mRNA folding at the end of coding region (first conclusively shown here) is much more frequent in eukaryotes (appearing in 68% of the analyzed organism) than in the prokaryotes (20% in archaea and 33% in bacteria). This may be related to alternative mechanisms for efficient translation termination fidelity in prokaryotes (including mRNA folding outside the boundaries of the CDS) and/or to translation of polycistronic transcripts (see [63] for related observations in the 3′-UTR).

Second, we found that in some eukaryotes (in 13% of the analyzed eukaryotes and in one bacteria: D. puniceus), there is significant positive ∆LFE throughout the mid-CDS region (i.e., opposite to the general trend in prokaryotes, Fig. 1a, Fig. 6, Additional file 1: Figure S8). This phenomenon, more widespread than previously reported, may be related to selection improving elongation speed [18]. It is currently not clear why this type of selection appears only in these eukaryotes and is extremely rare in the other domains.

Third, we show that the “transition peak,” a region of selection for strong mRNA folding beginning around 30–70 nt downstream of the start codon that was reported elsewhere to be associated with translation efficiency [18, 35, 36, 45], appears frequently (45%) in the analyzed organisms, indicating this mechanism is common (Fig. 1a, c). This feature appears much more frequently in eukaryotes (73%) than in prokaryotes (22% in archaea and 43% in bacteria). Here, too, it is possible the lower frequency in prokaryotes hints at a complementary mechanism for translation initiation and elongation efficiency and fidelity in prokaryotes.

Forth, despite these differences, we found strong correlation between the strengths of three profile elements (found at the beginning, middle, and end of the coding regions, Fig. 1e) across the analyzed organisms. This supports the conjecture that much of the variation in their strength among organisms is caused by common factors acting jointly on the level of ∆LFE at all regions of the CDS.

Fifth, we discussed several variables that correlate with ∆LFE (and account for much of the variation mentioned above). The variables showing the strongest correlation are genomic GC-content (despite being explicitly controlled for by our randomizations as explained above, Fig. 5) and CUB (measured using ENc′, Fig. 4). Strong CUB and higher GC-content tend to be associated with more efficient selection on translation efficiency (see, for example, [64, 65]), and the fact that ∆LFE is correlated with them suggests the same underlying mechanism (or mechanisms) contributes to their selection.

The influence on ∆LFE of all traits analyzed in the mid-CDS region can be compared in Fig. 9. Other genomic and environmental traits analyzed (including genome size and growth time) were not found to have significant linear interaction with ∆LFE at the domain level. In many cases, there appears to be potential interaction with ∆LFE in smaller taxons (which may or may not be due to real interactions specific to those taxons, Additional file 1: Figure S10).

Sixth, we proposed four specific characteristics of species having weak ∆LFE (separately and together), demonstrating the conditions in which ∆LFE cannot be effectively maintained (or does not yield sufficient benefit to be maintained). The first two characteristics are based on the correlated traits described above: low GC-content and low CUB. Another characteristic is optimum growth temperature, since in higher temperatures, base-pairing is weakened, and consequently, the influence of codon arrangement and composition must also be reduced, and so is any possible effect of ∆LFE. The last disrupting factor, an intracellular life phase, stems from the fact that such organisms generally have lower effective population size (due to recurring population bottlenecks) and lower selection pressure on gene expression (because they partly rely on the host, [58, 59]). A binary classification model based on these four features has precision 0.66 and recall 0.82 in classification of ∆LFE strength (see “Analysis” under the “Methods” section and Fig. 10). Note that this binary classification discriminates species with very weak ∆LFE and has weak predictive value for ∆LFE strength in species where none of the factors hold, giving R2 = 0.2 (p value = 5e−25, OLS, all species) against mean |∆LFE| in the 150–300-nt region relative to CDS start. These conditions support the proposed mechanism of ∆LFE being the result of selection on secondary structure strength related to gene expression regulation and efficiency.

Fig. 10
figure 10

Classification model for weak ∆LFE based on four species traits. a PCA plot of ∆LFE profiles relative to CDS start (see “Visualization” under the “Methods” section). Short species names are listed in Additional file 1: Table S3. b ∆LFE profile strength, measured using standard deviation, for profile positions 0–300 nt relative to CDS start. c Predicted ∆LFE strength for each species using binary model for weak ∆LFE (precision = 0.66, recall = 0.82, N = 513, see “Binary classifier for ∆LFE strength” under the “Methods” section)

Our results point to cases where evolutionarily close organisms exhibit very different ∆LFE patterns and selection levels. For example, in fungi, members of Pezizomycotina (such as Aspergilus niger or Zymoseptoria brevis) have much more positive ∆LFE compared to members of Saccharomycotina (including Eremothecium gossypii and Candida albicans). Notably, a few eukaryotic species (e.g., the unrelated species Fonticula alba and Saprolegnia parasitica) have a ∆LFE profile that looks typical for bacteria (Additional file 1: Figure S7). This highlights the variety of gene expression mechanisms in eukaryotes, as well as the risk in generalizing about disparate groups based on observations on model organisms.

We would also like to emphasize the fact that ∆LFE has been considered a direct result of selection by previous studies cited here; we believe our results further support this hypothesis, for example, by showing ∆LFE is more conserved than genomic GC-content and demonstrating biologically reasonable trait interactions that may indicate a (direct or indirect) causal link. We should note however that our methodology does not assume any specific evolutionary process at work to produce the measured ∆LFE and this is an additional topic for further research.

Finally, we should note our analysis is based on average values over entire genomes. This provides important statistical power and reduces the random effects of other factors on specific genes. It is important to remember, however, that some of the gene-level factors filtered this way are nevertheless important and there is considerable variation between genes. This means that the reported features should be further analyzed in higher resolution, as well as validated experimentally to understand their origin. For example, ∆LFE in the mid-CDS region was suggested to be influence by both global factors like mRNA aggregation and local factors, like co-translational folding [29, 31], which may cause non-uniform selection pressure across the CDS. These differences may allow the effect of each factor to be experimentally validated separately. In addition, in future studies, it will be helpful and challenging to study the relation between ∆LFE and the position of genes in the operon (see [63]), and the influence of ∆LFE on the outcomes of translation initiation, termination, and splicing.

Conclusions

  1. 1.

    The previously proposed regions of selection on local mRNA folding strength are widespread and appear in many species across domains. For two such regions (strong folding downstream of the beginning of the CDS and weak folding near the CDS end), this is first conclusively demonstrated here. However, none of these regions is universal and exceptions, which sometimes run opposite to the common trend, are quite common. Nevertheless, the CDS in most species does contain consistent regions of tendency for increased or decreased secondary structure strength. These regions coincide with parts of the CDS involved in different gene expression processes and in particular different stages of mRNA translation (initiation, elongation, and termination), supporting the conjecture that mRNA folding strength has a role in these stages of mRNA translation. In addition, stark differences in the prevalence of the regions suggest interactions with domain-specific regulatory mechanisms: For example, the selection for weak folding at the end of the coding region seems to be more common in eukaryotes while the selection for weak folding at the beginning of the coding region appears more commonly in prokaryotes.

  2. 2.

    The tendencies for increased or decreased secondary structure strength in different parts of the coding sequence are correlated among species across the tree of life, indicating common factors are affecting them throughout the coding sequence. We present four factors that predict the strength of local mRNA folding selection within the coding sequence—GC-content, CUB, intracellular life stage, and a hyperthermophilic environment. These factors are characteristic of species with strong optimization for gene expression efficiency or fidelity, suggesting mRNA folding strength also contributes to this optimization.

  3. 3.

    A “transition peak” of selection for strong mRNA folding around 30–70 nt downstream of the start codon appears in ~ 50% of the analyzed organisms, showing this phenomenon (suspected of being linked to optimization of translation elongation) is widespread.

  4. 4.

    The statistical framework we proposed for studying position-specific selection effects on traits like local mRNA folding across taxonomic groups, while controlling for confounding factors such as amino acid bias, codon, and evolutionary distance, enables inferring factors that may directly affect these traits.

Methods

Analysis

Species selection and sequence filtering

The set of species included in the dataset (Additional file 1: Table S1, Additional file 2) was chosen to maximize taxonomic coverage, include closely related species which differ in GC-contents and other traits (Fig. 2c), and take advantage of the limited overlap between available annotated genomes, NCBI environmental traits data, and the phylogenetic tree (see below). To prevent under-representation of taxons in the dataset, included species were tabulated by phylum and species from missing phyla and classes were added if possible (Additional file 1: Table S2). Over-representation of closely related species is controlled by GLS (see below).

CDS sequences and gene annotations for all species were obtained from Ensembl genomes [66], NCBI [67], JGI [68], and SGD [69] (Additional file 1: Table S3). CDS sequences were matched with their GFF3 annotations to filter suspect sequences, as follows. The dataset excludes CDSs marked as pseudo-genes or suspected pseudo-genes, incomplete CDSs, and those with sequencing ambiguities, as well as CDSs of length < 150 nt. If multiple isoforms were available, only the primary (or first) transcript was included. Genes annotated as belonging to organelle genomes were also excluded. Genomic GC-content, optimum growth temperatures, and translation tables were extracted from NCBI Entrez automatically, using a combination of Entrez and E-utilities requests (Additional file 1: Table S3). A few general characteristics of the included CDSs are shown in Fig. 2c.

The taxonomic hierarchy and classifications used to analyze and present the data were obtained from NCBI Taxonomy. Endosymbionts were annotated using a literature survey (Additional file 1: Table S3). Growth rates were extracted from [52] (Supplementary Table A1).

Randomization procedures

To test different hypotheses regarding local folding energy (LFE), native sequences were compared against randomized sequences preserving attributes as defined by each null hypothesis, as follows (Fig. 2a, b):

To test the hypothesis that the native arrangement of synonymous codons causes a significant bias in LFE, synonymous codons were randomly permuted within each CDS (i.e., all codons encoding for the same amino acid within a given CDS are randomly rearranged). This “CDS-wide” randomization preserves the encoded protein sequence, nucleotide frequencies (including GC-content), and codon frequencies of each CDS (but generally disrupts longer-range dependencies). Synonymous codons were determined according to the nuclear genetic code annotated for each species in NCBI genomes.

To test the contribution of position-specific biases in amino acid composition, nucleotide frequencies, and codon frequencies including CUB (factors that are equalized at the CDS level by the CDS-wide randomization) on the observed LFE, a second “position-specific” randomization was used. In this randomization, synonymous codons were randomly permuted between codons found at the same position (relative to the CDS start) across all CDSs in each genome. This randomization preserves the amino acid sequence of each CDS, while nucleotide (including GC-content) and codon frequencies are preserved at each position across a genome.

LFE profile calculation

Local folding energy (LFE) profiles were created by calculating the folding energy of all 40-nt-long windows, at 10-nt intervals, relative to the CDS start and end, on each native and randomized sequence. This measure estimates local secondary structure strength (ignoring the specific structures) and reflects (among other considerations) the structure of mRNA during translation, which prevents long-range structures but allows formation of local secondary structure and generally agrees with existing large-scale experimental validation results [37]. Previous studies (e.g., [35]) showed that this measure is robust to changes in the window size. The coordinates shown always refer to the window start position relative to the CDS start (e.g., window 0 includes the first 40 nt in the CDS) or to the window end position relative to the CDS end. Estimated folding energies were calculated for each window using RNAfold from the ViennaRNA package 2.3.0 [70], with the default settings. All folding energies were estimated at 37 °C so as to compare equivalent quantities between all genomes (but see below under native-temperature profiles). The ∆LFE profile for each protein, defined as the estimated excess local folding energy caused by the arrangement of synonymous codons at any CDS position, was created by subtracting the average profile of 20 randomized sequences for that protein from the native LFE profile:

$$ {\Delta \mathrm{LFE}}_i={\mathrm{nativeLFE}}_i-\frac{1}{N}\sum \limits_{1\le n\le N}{\mathrm{randomizedLFE}}_i(n) $$

(i—CDS position, N—number of randomized sequences)

The mean ∆LFE profile for each species was created by averaging each position i over all proteins of sufficient length (so a different number of sequences may be averaged at each position). Note that while the native LFE of different CDSs within each genome varies considerably, the LFE of each native CDS is compared to its own set of randomized sequences.

To determine if the mean ∆LFE for a species in position i (relative to CDS start or end) is significantly different than 0, the differences di(p, n) between LFE of the native and randomized sequences for each CDS p at position i were collected:

$$ {d}_i\left(p,n\right)={\mathrm{nativeLFE}}_i(p)-{\mathrm{randomizedLFE}}_i\left(p,n\right) $$

(p—CDS index, 1 ≤ n ≤ N = 20—number of randomized sequences, i—CDS position)

The Wilcoxon signed-rank test was used on all values di(p, n) (with the null hypothesis implying that the distribution is symmetrical).

Native-temperature profiles

The predicted folding energy calculations for native and randomized sequences for a sample of N = 71 bacterial and archaeal species were repeated using the same procedure but with folding predicted at the optimal growth temperature specified for that species (instead of 37 °C).

Phylogenetic tree preparation

To study the relation between ∆LFE profiles and other traits, the profiles were analyzed using a phylogenetic tree as follows. The phylogenetic tree is based on [71] (Supplementary Dataset 2 and Supplementary Table 1) and contains species from our dataset across the three domains of life. Since there are slight discrepancies in some node identifiers between the tree ([71] Supplementary Dataset 2) and accession table ([71] Supplementary Table 1), species names were matched by hand. Tree nodes and profiles were then matched by NCBI tax-ID at the species or lower level between the available genomes and phylogenetic tree nodes (e.g., when the tree species a species, and there is only one genome available for a specific strain of this species). The tree distances were converted to approximate relative ultrametric distances using PATHd8 [72] version 1.9.8 with the default settings. Finally, the tree was pruned to the set of leaf nodes found in the dataset (or a subset of them which has data for both variables being correlated), by removing unused inner and leaf nodes and merging single-child inner nodes by summing distances. The resulting ultrametric tree (Additional file 3) was used to create a covariance matrix using a Brownian process (to reflect the null hypothesis that a trait is not under selection), using the ape package [73] in R.

Phylogenetically controlled regression

To test for correlations between traits among species while controlling for the similarity expected to exist between related species even in the absence of selection on either trait, generalized least-squared (GLS) regression was performed [74, 75] with the nlme package [76] in R and using REML optimization. Each regression included the subset of species for which data for both correlated traits was available, and which were also included in the tree. Regression p values are based on the null hypothesis that the slope of the explanatory variable is 0 (i.e., that the variables are independent), and estimated using the t test. Coefficient of determination (R2) values were calculated according to [75, 77]:

$$ {R}^2=1-\frac{\hat{u}^{\prime }{V}^{-1}\hat{u}}{{\left(Y-\overline{Y}e\right)}^{\prime }{V}^{-1}\left(Y-\overline{Y}e\right)} $$

\( \hat{u} \)—residuals, V—variance-covariance matrix, Y—observations, \( \overline{Y} \)—intercept of equivalent intercept-only model, and e—first column of design matrix.

For continuous traits, regression formulas included an intercept term. Discrete traits were represented by ordered or unordered factors, and the intercept term was omitted from the regression formula. For discrete traits, values of the explained variable (such as ∆LFE) were centered to have mean 0 (so regression is based on a null hypothesis that all levels have the same mean).

Regression robustness verification

To test the robustness of a correlation between traits at different CDS regions, the regression was repeated at all profile positions starting between 0 and 300 nt (relative to CDS start and end) and all contiguous subranges (using the mean ∆LFE value in each range) and reported only if consistent over the relevant range of positions (Additional file 1: Figure S17).

To test for specific trait correlations in individual taxons, the regression procedure was repeated for each taxonomic group (at any rank) containing at least 9 species (Additional file 1: Figure S10). For each taxonomic group, the value shown is the median R2 value for positions within the relevant range. The significance p value threshold was determined by applying FDR correction according to the number of taxonomic groups (treating them as independent to get a “worst-case” result).

Model element definition rules

Elements of the ∆LFE profile model were formalized as follows to allow estimation of their prevalence (Fig. 1a). Significance for all rules is defined using the Wilcoxon signed-rank test (see above) having p value < 0.05 at all positions within the range specified.

Model 1 (positive ends)

  1. A.

    Positive start: ∆LFE value at positions 0–10 nt relative to CDS start is positive and significant.

  2. B.

    Transition peak: the position of the minimum ∆LFE value in the range 0–300 nt, i*, is located in the range 20–80 nt relative to CDS start, and is significantly lower compared to all points in the ranges 0–10 nt and 100–200 nt relative to CDS start.

    To determine if the mean ∆LFE for a species in a given position i is significantly higher than the minimum (i*), the differences wi(p, n) between ∆LFE at the peak position and ∆LFE at the tested position were collected:

    $$ {w}_i\left(p,n\right)={d}_{i\ast}\left(p,n\right)-{d}_i\left(p,n\right) $$

    (p—CDS index, N ≤ 20—number of randomized sequences, i—position in CDS relative to start)

The Wilcoxon signed-rank test was used on all values wi(p, n).

  1. C.

    Negative mid: ∆LFE values at each position in the range 200–300 nt relative to CDS start and in the range 300–200 nt relative to CDS end are all negative and significant.

  2. D.

    Positive end: ∆LFE value at positions 10–0 nt relative to CDS end is positive and significant.

  3. E.

    Model structure: A + C + D

Model 2 (weak ends)

  1. A.

    Weak start: ∆LFE value at position 0 nt relative to CDS start is significantly higher than at positions 200–300 nt.

  2. B.

    Same as in model 1.

  3. C.

    Same as in model 1.

  4. D.

    Weak end: ∆LFE value at position 0 nt relative to CDS end is significantly higher than at positions 200–300 nt.

  5. E.

    Model structure: A + C + D

Binary classifier for ∆LFE strength

To measure the performances of several criteria in predicting ∆LFE strength, the following simple model was used. ∆LFE values for all species were divided into weak and strong groups based on the standard deviation of the mean ∆LFE at positions 0–300 nt. Species with standard deviation < 0.14 were included in the “weak ∆LFE” group. The binary classification of each species is based on 4 species traits as inputs, using the following rule (optimized using grid search):

$$ \mathrm{PredictedWeakLFE}=\left(\mathrm{Endosymbiont}=\mathrm{True}\right)\ \mathrm{or}\ \left(\mathrm{Genomic}\ \mathrm{GC}<38\%\right)\ \mathrm{or}\ \left(\mathrm{Genomic}\ {\mathrm{ENc}}^{\prime }>56.5\right)\ \mathrm{or}\ \left(\mathrm{Optimum}\ \mathrm{temp}>{58}^{\circ}\mathrm{C}\right) $$

Maximal information coefficient

Maximal information coefficient (MIC, [56, 57]) is a statistical measure of general (not necessarily linear) dependence between two variables. Informally, it is a generalization of R2 and also has values in the range 0.0–1.0, with high values indicating knowing the value of one variable allows inferring the value of the other. MIC was calculated using the minerva [78] package in R. p values were estimated using 10,000 random samples.

Correlogram plot

Correlogram plot (Additional file 1: Figure S2) was prepared using the phylosignal package in R.

Codon-bias metrics

Codon-bias metrics (CAI, CBI, Nc, Fop) were calculated for each genome using codonW [79] version 1.4.4. ENc′ [80] was calculated using ENCprime (github user jnovembre, commit 0ead568, October 2016) using the default settings. I_TE [43] was calculated using DAMBE7 [81], based on the included codon frequency tables for each species. DCBS was calculated according to [49].

Shine-Dalgarno binding strength

The Shine-Dalgarno (SD) strength for each gene was calculated according to, based on the minimal anti-SD hybridization energy found in the 20-nt region upstream of the start codon.

Visualization

Taxon characteristic profile chart

The mean ∆LFE profiles for CDS positions 0–300 nt relative to the CDS start and end within each taxon were summarized (Fig. 3a) by grouping species with similar profiles and plotting one profile representing each group. The grouping was achieved by clustering the ∆LFE profiles (as vectors of length 31) using K-nearest neighbors agglomerative clustering with correlation distances, using SciKit Learn [82]. The profile plotted to represent each group is the centroid (mean) of each cluster. To allow easy viewing of the region of interest, only positions 0–150 nt are shown for each cluster. K, the number of clusters for each taxon, was chosen (separately for the start and end profiles) to be the smallest value for which the maximum distance of any profile to the centroid cluster mean (i.e., the profile shown) was smaller than 0.8 for the start-referenced profiles and 1.3 for the end-referenced profiles. The full ∆LFE profiles for all species appear in Additional file 1: Figure S7.

PCA display for ∆LFE profiles

To summarize ∆LFE profiles and show how different values related to different profile types, we used PCA to obtain a two-dimensional arrangement in which similar ∆LFE profiles are mapped to nearby positions (see, for example, Fig. 3b). Also shown are the amounts of variance explained by each of the first two principal components.

PCA for the ∆LFE profiles (treated as vectors of length 31) was performed using SciKit Learn [82]. Analysis was limited to the first 3 components, and only the first two components are displayed (Additional file 1: Figure S6a,b). To verify the robustness of the PCA results, they were repeated using 500 samples with replacement from the same PCA input vectors and of the same size, and the angles between the component were verified to be approximately equal (Additional file 1: Figure S6c). To reduce clutter, overlapping profiles are hidden and the relative density at each position is shown in the background as blue shading (estimated as bivariate KDE with bandwidth determined by Scott’s rule using seaborn [83]) and also plotted on the axes.

Evolutionary and taxonomic trees were plotted using the ETE toolkit [84].