Background

Nucleotide sequences of all contemporary genomes are results of compromise between mutational pressure and selection [1]. Many mutations which took place in the past have been eliminated by genetic death. Even so, there are differences in the nucleotide composition of protein coding sequences and intergenic sequences – it is very difficult to discriminate between the effects of selection and mutations on their composition. Furthermore, many prokaryotic genomes have very asymmetric nucleotide composition of chromosomes [e.g. [29]]. Strand composition of DNA depends on the role which the strand plays in the replication process – leading or lagging. Usually the leading strand is richer in Guanine (G) than in Cytosine (C) and richer in Thymine (T) than in Adenine (A). The replication-associated mutational pressure is thought to be the most probable cause of this asymmetry [1012].

Analyses of long range correlations in DNA sequences revealed that in the intergenic sequences a very strong triplet signal can be detected [13, 14]. This signal can be created by fragments of coding sequences transferred into intergenic space by recombination mechanisms. Since the nucleotide compositions of the first, the second and the third nucleotide positions in coding sequences are strongly correlated, these correlations are seen even in some noncoding intergenic sequences. We have assumed that some intergenic sequences have derived from coding sequences and could freely accumulate mutations with frequencies determined by the replication-associated mutational pressure. If the time of divergence has not been very long, the homology between the intergenic sequences and their original protein coding sequences can be found (these original coding sequences we have called the reference sequences).

We have made an assumption that mutations have been accumulated only in the intergenic sequences and not in the reference sequences, which is not exactly true, but which enabled accomplishing our studies. This assumption could give a good approximation of mutational pressure exerted on intergenic sequences. Many other authors, who have constructed matrices of substitutions using the mutations accumulated in pseudogene sequences have made the same assumptions [15, 16]. Such an assumption could give higher estimated mutational rate than the real one. Nevertheless, the substitution rates in the matrices are described as a relative values thus, it should not change the values in the matrix.

We have chosen for our analyses the B. burgdorferi genome because there are many premises indicating that this genome is in the steady state. The B. burgdorferi genome is very asymmetric, which suggests its structural conservation [17]. There are not many inversions of genes between the leading and lagging strands or the mutational pressure has had enough time to make the inverted genes resemble the genes of the new strand [18]. The nucleotide composition of the third positions in codons testifies for the very conserved structure of chromosome. These positions follow precisely the sign of the asymmetry of intergenic sequences and, the third positions of Open Reading Frames (ORFs) situated on the leading and lagging strands have precisely mirror asymmetry, which is even stronger than that of intergenic sequences [17]. This paradox could be explained assuming that the highly degenerated third positions have accumulated more neutral or near neutral mutations introduced by the replication-associated processes because they stay at their positions longer than intergenic sequences. There are constraints for inversions of coding sequences but no constraints for inversions of intergenic sequences. Thus, some newly inverted intergenic sequences could complement the asymmetry of the "new host" strand.

Results and Discussion

Testing the table of substitutions and verifying the assumptions

Once having experimentally found the rates of all types of substitutions (Table 1, Borrelia burgdorferi Table of Substitutions (BbTS)), we were able to test these data and to verify our previous assumptions. In equilibrium, the number of a given nucleotide substituted by other nucleotides should be balanced by the number of that nucleotide substituting the other nucleotides. The following four equations should be fulfilled:

Table 1 Tables of substitutions, DNA composition in the equilibrium with the mutational pressure and half times of nucleotide substitutions.

NA>G + NA>C + NA>T = NG>A + NC>A + NT>A     (1)

NG>A + NG>C + NG>T = NA>G + NC>G + NT>G     (2)

NC>A + NC>G + NC>T = NA>C + NG>C + NT>C     (3)

NT>A + NT>G + NT>C = NA>T + NG>T + NC>T     (4)

where NA>G = N A *p(NA>C) and where p(NA>C) is the probability of substitutions of A by G, taken from the BbTS, (other symbols – respectively).

Note that there are numbers, not frequencies in the equations. Fulfilling these equations means that the nucleotide composition of the sequences submitted to the mutational pressure determined by the parameters of BbTS is in equilibrium. We have assumed that in the case of the B. burgdorferi genome the best approximation of such sequences is the composition of the third positions of codons of ORFs, as has been argued in the Introduction section. Thus, the nucleotide composition of this set of nucleotides should not change significantly under such mutational pressure. To prove that, we simulated the mutational pressure on the sequence of the same composition as described previously [19] and after 10,000 Monte Carlo Steps (MCS), when the sequence was in equilibrium, we compared it to the sequence before the simulation. The ratios of nucleotides were 0.994, 1.008, 0.992 and 0.988 for A, T, G, and C, respectively (note that the ratios do not sum to 1 because they are not weighted). There are no significant changes in nucleotide composition of the third positions after the prolonged exposition to the mutational pressure described by BbTS (Chi square test, p = 0.99987). Thus, BbTS generates DNA sequence with nucleotide composition corresponding precisely to the nucleotide composition of the third codon positions. In Fig. 1 we have shown the evolution of two DNA sequences, of which one originally had equimolar nucleotide composition and the other one – the nucleotide composition of the third codon positions of ORFs from the leading strand. Both sequences reach the same final nucleotide composition. Furthermore, a sequence obtained after long evolution in computer has very similar asymmetry in terms of GC skew and AT skew as the sequence of the third codon positions before evolution. GC skew is [G-C]/ [G+C] and AT skew is [A-T]/ [A+T]. The AT skew is -0.23 and -0.22 for the sequences before and after simulation, respectively. The GC skew is 0.34 for the sequences before and after simulation. Note that the most frequent substitution is C->T transition, which is in agreement with the cytosine deamination theory (see ref. 10 and references therein), and the average transition frequency is twofold higher than transversion frequency.

Figure 1
figure 1

Evolution of DNA sequences under the mutational pressure described by the "real BbTS". Light lines indicate the fractions of nucleotides in the sequence which initially has been composed of equal numbers of each nucleotide. Bold lines show the fractions of nucleotides in a sequence of nucleotide composition of the third positions in codons of the B. burgdorferi coding sequences from the leading strand. x-axis – the number of Monte Carlo Steps (MCS), y-axis – fraction of nucleotides in the evolving strand.

Properties of the substitution matrices

Let us consider only nucleotides existing in the original sequence, which is already in the steady state. It is trivial that the substitution of each of the four nucleotides will follow exactly the same rules as a decay of radioactive isotopes with characteristic for each nucleotide "half time of substitutions" (τ A , τ G , τ T , τ C for A, G, T, and C, respectively) determined by the sum of probabilities of substitutions of a given nucleotide by the other three nucleotides. In a more formal language:

τ A = ln2/(p mut *(p(A>G) + p(A>T) + p(A>C)); (symbols for nucleotides other than A-respectively), where pmut is a parameter which denotes the overall rate of mutations and does not influence the ratios between τ for different nucleotides.

It is also trivial that in the equilibrium, the fraction of a nucleotide which has been substituted is exactly the same as the fraction of this very nucleotide substituting the other ones (left sides of equations 1–4). Thus, after the half time of substitutions the ratio between the "old" nucleotides and "new" nucleotides is 1:1 (see Fig. 2A and Fig. 2C). This is a general property of any table of substitutions in the equilibrium state. But BbTS has another property: the half time of substitutions is precisely correlated with the frequency of the given nucleotide in the sequence in equilibrium with the correlation coefficient equaling 0.999 (p = 0.0007, Fig. 2B). This is not just a feature of any matrix of substitutions. We have tried to find analytically a table of substitutions which would generate a DNA sequence of the nucleotide composition of the analysed sequence of B. burgdorferi [see also [19]]. One of such tables is presented in Table 1. This "artificial table" generates a DNA sequence of the same nucleotide composition as BbTS does, but the correlation coefficient between the half time of substitutions and the fraction of nucleotides in the sequence is close to zero (Fig. 2C and 2D).

Figure 2
figure 2

The rate of substitution of nucleotides in the DNA sequence in equilibrium under the mutational pressure described by: A – the "real BbTS" and, C – the "artificial BbTS". Bold lines show the fractions of nucleotides which have not been substituted yet, light lines indicate the fractions of nucleotides which appeared when substituting other nucleotides. Plots B and D represent the relations between half time of substitutions and the sizes of nucleotide fractions for sequences in equilibrium under the mutational pressure of "real BbTS" and "artificial BbTS" respectively. See also text and description for Fig. 1.

We claim that the mutational pressure leading to the nucleotide substitutions is extremely highly correlated with the DNA composition of the genome in such a way that the higher substitution turnover of a nucleotide determines the lower fraction of this nucleotide in the DNA sequence.

It seems very unbelievable that such correlation in the B. burgdorferi genome has happened accidentally. We have tested many other tables of substitutions which had been published for different genomes and different sequences (data collected in Table 1). As long as such matrices describe the substitutions for sequences which are not under the selection pressure (i.e. pseudogenes or the third positions in codons), they follow the same rule, with extremely high correlation between τ N and the fraction of the nucleotide N in the DNA sequence in equilibrium (all correlations were statistically significant). This rule is true for asymmetric DNA like in the B. burgdorferi genome and for much less biased eukaryotic DNA. Matrices found for the third positions in the four-fold degenerated codons in Drosophila mitochondrial DNA [20] fulfil this rule more precisely than for all third positions in codons in that organelle's genome (the same results were obtained for matrices of primates' mtDNA published by [21], data not shown). These differences could be expected if some mutations in the third positions, leading to amino acid substitutions are not neutral. Furthermore, in some instances, for example for the table describing substitution rates in sequences under strong selection [22], we have not found the correlation between τN and the fraction of nucleotide N (see the last column in Table 1). That supports the hypothesis that the rule is a specific property of the pure mutational pressure. One can also notice that matrices found by analysis of substitutions into different pseudogenes in the same organism or in very closely related organisms give a different DNA composition in equilibrium, which supports the thesis that the mutational pressure varies for different regions of the same eukaryotic genome [2325].

We have no clear answer for the question: what selection forces have tuned the mutational pressure in such a way that it follows the strict rules for sequences released from selection. It is logical that nucleotides with higher turnover destabilise the genetic information and selection would tend to eliminate them from the DNA molecule. On the other hand, a lower frequency of a nucleotide gives it a higher informative value while at the same time the deviation from the equimolar fractions of nucleotides in DNA diminishes the coding capacity of the whole molecule. Perhaps mathematical analysis of this phenomena, taking into considerations the properties of the universal genetic code, will show that the optimum for information transfer by the DNA molecule is just at such points. Further studies would show other properties of these strategic points where τ determines very specific balance between the DNA composition and mutational pressure.

The implications of such evolutionary established relations between the DNA composition and the turnover rates of nucleotides would have a great impact on the understanding of the genomes evolution itself. It gives the possibility of estimating the relations between the mutational pressure exerted on specific nucleotides of each genome analytically i.e. by simple computing the nucleotide composition of sequences which are not subject to selection pressure. Having the mutational pressure in terms of nucleotide turnover, one can estimate the selection pressure exerted on any sequence or position in codons. For example, see what would be the fate of the first positions of ORFs from the leading strand of the B. burgdorferi genome under the BbTS molecular pressure, without selection (Fig. 3). Note that the half time of substitutions of each nucleotide is the same as for other sequences under such mutational pressure, but the rate of appearing of new nucleotides is different and the composition of the sequence would change non-linearly during evolution. It is also simple to count, from the results of computer simulations, the corrections for multiple substitutions and reversions, which is important for estimating the real divergence time. It is clear that such corrections should be counted considering different contributions of each nucleotide turnover in the overall frequencies of multiple substitutions. Having precise mutational pressure one can predict not only the selection pressure but also find the history of the sequence.

Figure 3
figure 3

Changes in the nucleotide composition of the first positions in codons of B. burgdorferi coding sequences from the leading strand under BbTS mutational pressure. Descriptions as for Fig. 1 and 2. Note that the fractions of nucleotides which have not been substituted are exactly as in Fig. 2A but the fractions of nucleotides which substituted other nucleotides are far from being symmetrical to the first ones.

Keeping in mind the precise relations between the fraction of nucleotide and its turnover time, the symmetric DNA (with A=T and G=C) is a specific case were the turnover times of nucleotides in pairs equal each other. Posing a question of which type of substitution should be blamed for the DNA asymmetry makes sense for the mutational pressure exerted on the DNA released from the selection pressure. Now a simple test for such a mutational pressure is available – it should generate the DNA in equilibrium whose nucleotide composition fulfils the rule of linear interdependence between the sizes of the nucleotide fractions and their turnover times.

Conclusions

Substitution matrices enable counting the DNA composition in equilibrium with a given mutational pressure. It is possible to test if a given substitution matrix is the pure mutational matrix or if it is "contaminated" with the effects of selection. The difference between the DNA composition in equilibrium with mutational pressure and a DNA sequence under both mutational and selection pressures allows for estimation of the effect of selection pressure exerted on the particular sequence.

Materials and Methods

Construction of the substitution table

To estimate the frequency of substitutions, we have analysed the differences between coding sequences of the B. burgdorferi genome and sequences homologous to them found in the intergenic regions. For the data, see Additional file 1. The sequence of the B. burgdorferi genome [26] was downloaded from http://www.ncbi.nlm.nih.gov. To accomplish our analysis, we extracted all intergenic sequences longer than 90 nucleotides. We translated them into amino acid sequences in all six reading frames. The amber and ochre stop codons were translated for tyrosine residues and opal for tryptophan. Then we searched data bases for homology with the B. burgdorferi protein sequences using FASTA program [27]. For detailed amino acid alignment data see Additional file 2. After selecting homologues (with E < 0.05) whose previously (presumably) coding strands were duplicated on the leading strand, we made alignments of nucleotide sequences of these intergenic sequences with the reference ORFs' sequences using CLUSTAL X programme [28] and we counted the nucleotide substitutions. The number of the analysed alignments sites was 3737 and the average number of substitutions per site – 0.46. For detailed nucleotide alignment data see Additional file 3. The observed numbers of nucleotide substitutions from nucleotide i to j (where i to j stand for A, T, G or C, and ij) were converted to relative substitution frequencies according to Gojobori, Li, and Gaur [29] and Francino and Ochman [30]. That allowed us to count the frequency of each of the twelve possible substitutions on the leading strand. Since the observed substitution rates were different for each of the four nucleotides, we introduced corrections for multiple substitutions and reversions for each type of the substituted nucleotide instead of one general correction. It means that we have counted for each type of nucleotide the fraction of substituted (observed number) and used it for estimating the corrected substitution number according to Kimura's formula [31]. The frequencies of substitutions, normalised in such a way that the sum of all 12 frequencies equals 1, are shown in Table 1 (Table of Substitutions – BbTS).

Computer simulations

Computer simulations were performed on the DNA sequences corresponding to the real protein coding sequences of B. burgdorferi leading strand or DNA random sequences generated by computer. In the first case all ORFs longer than 100 codons situated on leading strand of Borrelia genome were spliced together. In the second case the DNA sequence were constructed by computer by drawing the consecutive nucleotides with a probability described by the assumed composition of this artificial sequence.

During the simulation of mutational pressure, in one Monte Carlo Step (MCS) each nucleotide in the sequence was drawn with the probability equalling pmut, then it is substituted with the probability described by the substitution matrix. Note that the nucleotide drawn for substitution not necessarily has to be substituted. After each MCS each substitution was counted in the specific type of substitution and additionally the evolving sequence was compared with the original sequence and the accumulated substitution were counted. This allowed us to measure not only the divergence rate but also the mutational rate subdivided on 12 different classes. The dynamic of substitution was also counted analytically using the equations describing the first order reaction rate. For more details on computing methods see Kowalczuk et al. [19, 32].