Journal of Molecular Evolution

, Volume 63, Issue 3, pp 393–400

Specific Selection Pressure at the Third Codon Positions: Contribution to 10- to 11-Base Periodicity in Prokaryotic Genomes

Authors

  • Amir B. Cohanim
    • Department of Biotechnology and Food Engineering
  • Edward N. Trifonov
    • Genome Diversity Center, Institute of EvolutionUniversity of Haifa
    • Department of Biotechnology and Food Engineering
Article

DOI: 10.1007/s00239-005-0258-1

Cite this article as:
Cohanim, A.B., Trifonov, E.N. & Kashi, Y. J Mol Evol (2006) 63: 393. doi:10.1007/s00239-005-0258-1

Abstract

Prokaryotic sequences are responsible for more than just protein coding. There are two 10- to 11-base periodical patterns superimposed on the protein coding message within the same sequence. Positional auto- and cross-correlation analysis of the sequences shows that these two patterns are a short-range counter-phase oscillation of AA and TT dinucleotides and a medium-range in-phase oscillation of the same dinucleotides, spanning distances of up to ∼30 and ∼100 bases, respectively. The short-range oscillation is encoded by the amino acid sequences themselves, apparently, due to the presence of amphipathic α-helices in the proteins. The medium-range oscillation, related to DNA folding in the cell, is created largely by a special choice of the bases in the third positions of the codons. Interestingly, the amino acid sequences do contribute to that signal as well. That is, the very amino acid sequences are, to some extent, degenerate to serve the same oscillating pattern that is associated with the degenerate third codon positions.

Keywords

Prokaryotic genomesDNA periodicityDinucleotidesCodon biasCodon usageThird codon positionsSupercoiling

Introduction

It is known that the choices of the third bases of codons are species-specific (Grantham et al. 1980; Aota and Ikemura 1986; Murray et al. 1989; Sharp and Li 1987; Shields et al. 1988; D’Onofrio et al. 1991). The third positions may influence the intensity of gene expression (Ikemura 1985; Sharp and Li 1987; Duret and Mouchiroud 1999; Duret 2000) and modulate overall (Andersson and Kurland 1990) and local (Guisez et al. 1993; Goldman et al. 1995; Komar and Jaenicke 1995) rates of translation. It appears, thus, that the third positions do carry some information not necessarily related to the protein, which is largely encoded in the first two codon positions (Trifonov 1989).

In the studies by Herzel et al. (1998, 1999) a strong 10- to 11-base periodicity of A and T is detected in prokaryotic genomes. In the yeast genome a similar periodicity is displayed, and the dinucleotides AA and TT are found to be responsible for the periodicity (Cohanim et al. 2005). There are two ranges of the dinucleotide oscillations. The short-range oscillation, up to ∼30 bases, where AA and TT are counter-phase, is due to the amphipathic α-helices (Herzel et al. 1998, 1999). The medium-range periodicity, with AA and TT in-phase, extends to ∼100 bases and is related to DNA folding (Herzel et al. 1999; Rodriguez 2002). The question arises how the medium-range periodicity is formed. In principle, all three codon positions may participate in the manifestation of the periodicity. The possible involvement of the first and second positions would mean that the protein sequence itself is designed to carry the medium-range periodicity. It also may be confined to the third positions of the codons. This study addresses the question, Which of the three codon positions is responsible for the medium-range periodicity, and to what degree? A firm conclusion is arrived at: the medium-range periodicity is carried mostly by the third coding positions. The protein sequence (first and second codon positions) plays some role as well, thus being partially degenerate to accommodate the periodicity.

Materials and Methods

Sequences

The sequences analyzed in this work were downloaded from NCBI (http://www.ncbi.nlm.nih.gov/). Names of the organisms and NCBI accession numbers are summarized in Table 1. These organisms display clear nucleotide/dinucleotide periodicity (data not shown; see also Schieg and Herzel 2004). Calculations and analysis were carried out separately for each organism. Each genome was found to exhibit the same sequence properties. In order to improve the signal-to-noise ratio, and for presentation clarity, combined calculation data for two large groups of sequences (Eubacteria and Archaea) are presented.
Table 1

List of complete genomes analyzed in this work

Organism

No.

E. coli

NC_000913

H. influenzae

NC 000907

P. gingivalis

NC_002950

V. cholerae

NC_002505

W. succinogenes

NC_005090

C. glutamicum

NC_003450

E. faecalis

NC_004668

L. innocua

NC_003212

L. plantarum

NC_004567

S. pneumoniae

NC_003028

A. fulgidus

NC_000917

M. jannaschii

NC_000909

M. thermo.

NC_000916

Positional Correlations

The periodical patterns in the sequences are normally detected by checking at what distances certain sequence elements follow one another. For example, the bases A and T in the work of Herzel et al. (1998, 1999) follow one another at preferred distances of 10, 20, 30, ... bases (in Archaea) or 11, 22, 33, ... bases (in Eubacteria). If only one sequence element is being analyzed, then the distance analysis is called positional auto-correlation. In cross-correlation, distances are calculated between two different sequence elements. To detect periodical patterns and characteristic distances between various dinucleotides, we used this technique without normalization of the occurrences between two motifs at discrete distances from one another, to allow for direct estimation of the statistical validity of the observations (Cohanim et al. 2005).

Filtering the Oscillations

The resulting positional frequency distributions (distance histograms) show, first, a very strong 3-base periodicity (e.g., in Fig. 1) typical for all protein coding sequences (Trifonov 1987). To observe any other periodicity this strong 3-base periodical component has to be filtered out. For this purpose the histograms were smoothed with a moving average of 3 bases, that is, a moving average of 3 for distance x was taken to be equal to the average of the positional frequencies for distances x – 1, x, and x + 1 (e.g., Herzel et al. 1999). Similarly, to get rid of low-frequency components (general trends in the histograms, like the overall slope in Fig. 1) a second filtering was used. First, the general trend was derived by smoothing the original plots with a moving average of 11 bases (Cohanim et al. 2005) to eliminate the 10- to 11-base periodicity (moving average of 11 for distance x was taken to be equal to the average of the values at distances from x – 5 to x + 5). The resulting smooth curve was then subtracted from the original nonfiltered histogram. This procedure results in the distributions that oscillate around the zero line, such that the 10- to 11-base component (if present) is revealed in pure form (Figs. 24).
https://static-content.springer.com/image/art%3A10.1007%2Fs00239-005-0258-1/MediaObjects/239_2005_258_f1.gif
Fig. 1

Distance histogram for dinucleotides AA or TT in eubacterial genomes. The histograms were generated by summing the occurrences for AA vs. AA, TT vs. TT, AA vs. TT, and TT vs. AA. The thick line was obtained by smoothing the oscillations with a moving average of 3, to eliminate the background 3-base oscillations characteristic for the coding regions.

https://static-content.springer.com/image/art%3A10.1007%2Fs00239-005-0258-1/MediaObjects/239_2005_258_f2.gif
Fig. 2

Oscillating components of the distance histograms for AA and TT dinucleotides in eubacterial coding regions. The low-frequency components of the original distance histograms were subtracted (see Materials and Methods). Similar histograms for codon shuffled sequences (see Materials and Methods) are shown for comparison.

https://static-content.springer.com/image/art%3A10.1007%2Fs00239-005-0258-1/MediaObjects/239_2005_258_f3.gif
Fig. 3

Oscillating components of the distance histograms for AA or TT dinucleotides in the three possible reading frames of eubacterial coding regions. Positional correlation was calculated separately for each of the three reading frames. Similar histograms for the sequences with codon shuffled are shown for comparison.

https://static-content.springer.com/image/art%3A10.1007%2Fs00239-005-0258-1/MediaObjects/239_2005_258_f4.gif
Fig. 4

Oscillating components of the distance histograms for AA or TT dinucleotides in coding regions of Archaea. Positional correlations were calculated separately for each of the three reading frames. The upper graphs were calculated for all AA and TT occurrences. Similar histograms for codon shuffled sequences are shown for comparison.

Frame-Specific Dinucleotides

To analyze contributions of different codon positions to the 10- to 11-base periodicity, we considered separately the dinucleotides occupying the first and second positions, second and third positions, and third and first positions of the codons of mRNA sequences located in the DNA genome sequences (ptt files from NCBI).

Dinucleotide Shuffling

To evaluate the statistical significance of the oscillating components, shuffled sequences were formed under the constraint of maintaining the dinucleotide frequencies the same as in the natural sequences. This procedure separately shuffles all 16 dinucleotides. For a given dinucleotide, for example, AT, the sequence is dissected into fragments with the cut at every dinucleotide AT so that each fragment starts with T and ends with A except for just two fragments at the ends of the sequence. The fragments are then shuffled. As a result the total number of AT dinucleotides after random reconnection of the fragments is strictly preserved.

Codon Shuffling

In order to determine whether the periodical oscillations of dinucleotides AA and TT are associated with their specific positions within the codons, respective control sequences were designed. The “codon shuffled sequence” was generated by shuffling the codons separately for each gene under the constraint of maintaining the amino acid order. This procedure, essentially, keeps the first and second positions of the codons unchanged, while the third positions are randomized. Neither protein sequence nor codon usage is affected. For each natural sequence many codon shuffled sequences were generated. The averages of the codon shuffled oscillating components and the noise due to relative freedom of the codon positions were estimated. The distance histograms calculated for codon shuffled sequences (Figs. 24; “Shuffled”) are the averaged oscillating components.

Cosine Wave Fitting

A simple cosine curve fitting for “oscillating component histograms” was used to estimate the amplitudes of periodical components. A least-mean squares score was calculated while varying the fitting cosine wave parameters (amplitude, period, phase). Although the observed oscillations show a decaying pattern, the cosine fitting gives the value close to the average amplitude of the oscillation in the whole range.

Results and Discussion

Eubacteria

The “AA or TT” positional frequency distribution (distance histogram) for eubacterial genome sequences, shown in Fig. 1, was obtained by summing the occurrences calculated for the dinucleotides AA vs. AA, TT vs. TT, AA vs. TT, and TT vs. AA, for all 10 genomes. The oscillations of the “AA or TT” positional frequency distribution (thin line) are smoothed with a moving average of 3 (thick line) to eliminate the background 3-base oscillations, characteristic for the coding regions (see Materials and Methods). A strong periodicity of the AA and TT dinucleotides is observed, with the period ∼11.1 bases. The significance of the observed oscillations was assessed for each eubacterial genome by comparing them with respective plots for shuffled sequences (see Materials and Methods). The amplitudes of the ∼11.1-base period oscillations of “AA or TT” in individual genomes range from 11 to 25 STD above noise. Compared to other dinucleotides, AA and TT are found to be the major contributors to the overall dinucleotide periodicity (data not shown).

Earlier studies have revealed that the alternation of hydrophobic and hydrophilic amino acids characteristic of α-helices has a periodicity of about 3.5 amino acids (Garnier et al. 1978; Kanehisa and Tsong 1980). Since codons TTX are coding for hydrophobic amino acids (phenylalanine and leucine) while codons AAX are coding for hydrophilic amino acids (asparagine and lysine), one would expect, respectively, the 10- to 11-base (3.5 × 3) periodicity of AA and TT in the protein coding nucleotide sequences. Importantly, the oscillations of the dinucleotides AA and TT in this case should be counter-phase to one another, reflecting the alternation of the hydrophobic and hydrophilic residues.

When all auto- and cross-correlations between AA and TT dinucleotides are considered (Fig. 2; “Natural”), the distance histograms show similar periodical behavior with the exception of the distributions AA vs. TT and TT vs. AA within the first ∼30 bases. Here, indeed, the expected counter-phase pattern is observed. The 30-base span roughly corresponds to the typical size of the α-helices, 10–12 amino acid residues (Penel et al. 1999; Pal et al. 2003; Engel and DeGrado 2004). Separate analysis of positional distributions of (all) hydrophobic and hydrophilic residues in crystallographically solved structures confirms that, indeed, the 10- to 12-residue span of the counter-phase oscillations is characteristic of α-helical regions only (data not shown). This is in good accord with the literature (Engel and DeGrado 2004).

The overall in-phase oscillation dominates. That is seen in the combined “AA or TT” curve (Fig. 2; “Natural”). It thus appears that there are two different periodical signals for AA and TT dinucleotides—the in-phase medium-range (up to ∼100 base) oscillation and the short-range (∼ 30 base) counter-phase oscillation—with similar periods. While the short-range periodicity is, apparently due to the α-helices, the medium-range one is likely to be dictated by the characteristic 100- to 200-base size of the amino acid periodical regions in prokaryotic genomes (Hosid et al. 2004; Tolstorukov et al. 2005).

The protein coding sequences that essentially make up the bulk of the prokaryotic genome not only encode the proteins, but also contain some other messages, superimposed on the same sequence (Trifonov 1989, 1997). An example of another message is the translation pausing message (Makhoul and Trifonov 2002), a special positional distribution of rarely used codons. In this study we observe three different overlapping messages carried by the eubacterial sequences: protein coding message, short-range counter-phase periodical pattern, and medium-range in-phase periodical pattern. We are, thus, confronted with the question, How can these three different messages coexist, and how do they interact with one another, being harbored by the same sequence? In particular, what are the contributions of the first, second, and third positions of the protein coding triplets to the two periodical signals?

In order to single out the elements contributing to the AA and TT periodicities, we first compared the positional frequency distributions for dinucleotides AA and TT calculated for the coding regions with those calculated for the sequences in which the codons were shuffled under the constraint of maintaining the amino acid order (see Materials and Methods). Thus, all third positions were, essentially, randomized, as well as a small proportion of the first and second positions (for leucine, serine, and arginine). This procedure, as one would expect, should largely eliminate possible contributions of the third positions to the AA/TT periodicities.

As mentioned before, the total “AA or TT” positional frequency distribution was obtained by summing the positional correlation values calculated for AA vs. AA, TT vs. TT, AA vs. TT, and TT vs. AA. By comparing the distributions of “AA vs. AA plus TT vs. TT” (Fig. 2; upper graphs) and “AA vs. TT plus TT vs. AA” (Fig. 2; middle graphs), one sees clearly that the (in-phase) oscillations beyond the third period are substantially weaker for the codon shuffled sequences (Fig. 2; “Shuffled”) compared to the natural sequences (Fig. 2; “Natural”). That is, the third codon positions, indeed, contribute to the medium-range periodical signal. Remarkably, however, the shuffling in the third positions did not fully eliminate the oscillation. This means that the amino acid sequence itself makes its own contribution to the medium-range periodicity. The counter-phase oscillation of the dinucleotides AA and TT is observed only within the first ∼30 bases, for both natural and codon shuffled sequences (Fig. 2; middle graphs), with essentially no difference in the amplitudes.

From the overall picture of ∼11-base in-phase periodicity (Fig. 2; bottom graphs), it is clear that the contribution from the amino acid sequences (Fig. 2; “Shuffled”) is about three times (also according to calculations) less than that observed in natural sequences (Fig. 2; “Natural”). In other words, the medium-range periodicity is largely due to the third codon positions. By repeating the codon shuffling procedure (see Materials and Methods), we were able to estimate the statistical difference between observed periodicity (“selection”) and randomized third positions (“neutrality”). The difference is 67 STD, which is firm evidence that the third positions of the codons are, indeed, under selection.

The partitioning of the medium-range dinucleotide periodicity signal among three codon positions is also illustrated by the oscillations calculated separately for dinucleotides in positions 1 and 2, positions 2 and 3, and positions 3 and 1 of the codons. The distance histograms for “AA or TT” at positions 1 and 2 (Fig. 3; upper graphs), “natural” vs. “codon shuffled,” are only slightly different. Indeed, when codons are shuffled under the constraint of maintaining the amino acid order, the distribution of dinucleotides “AA” occupying positions 1 and 2 (from lysine and asparagine codons) stays unchanged. Dinucleotides “TT” in positions 1 and 2 (from leucine and phenylalanine codons) are partially replaced by “CT” from CTN codons for leucine.

Dinucleotides AA and TT occupying positions 2 and 3 and positions 3 and 1 include the third position. The medium-range oscillation, as Fig. 2 demonstrates, is, indeed, largely due to the third positions. The shuffling almost completely eliminates the signal. Some residual periodicity after the shuffling, shown in the right-middle and right-bottom graphs in Fig. 3, has a 3× and a 4× lower amplitude for positions 2 and 3 and positions 3 and 1, respectively, compared to oscillations in natural sequences (Fig. 3; at the left). This result clearly shows that the third codon positions are very much loaded. Nucleotides A and T of the third codon positions, apparently, are placed in such a way that dinucleotides AA and TT are formed (when the neighboring bases are suitable) to fit the observed periodical pattern.

Archaea

The archaeal genomes A. fulgidus, M. jannaschii, and M. thermo, as in eubacteria, display both short- and medium-range periodicities of AA and TT. The medium-range period in Archaea is about 10 bases, unlike the eubacterial one, which is ∼11 bases (Herzel et al. 1998b; Schieg and Herzel 2004). The distance histograms for “AA or TT,” “natural” vs. “codon shuffled” (Fig. 4, upper graphs), clearly show that, similarly to eubacteria, part of the medium-range periodicity is due to the amino acid sequences, thus, adjusted to the overall ∼10-base periodicity. The medium-range oscillation, as the rest of Figure 4 demonstrates, is largely due to the third positions in Archaea as well. The shuffling almost completely eliminates the signal coming from the third positions (see plots for positions 2 and 3 and positions 3 and 1).

Concluding Remarks

The above analysis of contributions of three frames of the codons to the AA and TT periodicities thus shows that the short-range (up to ∼30-base) AA and TT counter-phase oscillation originates from the amino acid sequences, apparently due to the alternation of hydrophobic and hydrophilic amino acids characteristic of amphipathic α-helices. The in-phase medium-range (up to ∼100-base) oscillation mainly comes from the third positions. Interestingly, the protein sequence also contributes to the medium-range oscillation. Both in eubacteria and in archaea, the AA and TT periodicities provided by the amino acid sequences match the overall medium-range oscillations and follow the periods of ∼11 and ∼10 bases, respectively. In other words, the protein sequences are adjusted to these additional medium-range oscillation signals, apparently without compromising the respective protein functions. This also means that the protein sequences are flexible, i.e., degenerate to a certain degree.

In the works of Herzel et al. (1998, 1999) the medium-range oscillation was associated with a superhelical structure of prokaryotic DNA (Vologodsky 1992). From Crick’s (1976) formula for superhelical DNA trajectories, it would follow that in eubacteria the observed ∼11-base periodicity would facilitate the negative supercoiling (Herzel et al. 1999). Similarly, the 10-base periodicity of archaeal DNA was interpreted as a reflection of positive supercoiling (Herzel et al. 1999; Rodriguez 2002). Noncoding sequences do display the same 10- to 11-base periodicity of AA and TT dinucleotides (data not shown), suggesting that, indeed, the described sequence periodicity is a property of prokaryotic DNA in general rather than of coding sequences only. The protein coding sequences, thus, have to accommodate this pattern dictated by DNA, and this is achieved by respective biases both in the third codon positions and in the encoded protein sequence.

Any pattern emerging in the sequences may well be generated by elementary random events under some specific selection pressures. So are the periodical patterns observed. The most likely reasons for the pattern formation in this case are the importance of amphipathic α-helices for protein stability (short-range signal) and maintenance of superhelical and curvature features in DNA, intrinsically linked to the periodical occurrence of the respective dinucleotides.

Given a whole repertoire of various involvements of the third codon positions (translation pausing, rate of translation, species-specific pressures, short- and medium-range 10- to 11-base periodicities), it thus appears that the third positions are under selection. The resulting choice of the third base at any given location along the sequence has to yield to these local and global pressures. Correspondingly, an overall codon usage is not reducible to any one specific factor or role. In order to fully explain the nonuniform usage of codons and codon usage variations, one should instead study the various sequence codes that utilize the degenerate third positions and look for possible additional codes which have not been discovered yet.

Acknowledgments

We are grateful to anonymous reviewers for raising several important issues.

Copyright information

© Springer Science+Business Media, Inc. 2006