Background

Point mutations in coding regions of genes can be classified into two categories: synonymous and nonsynonymous. In mammals, synonymous mutations are largely neutral, though they may sometimes experience weak selection [1]. Nonsynonymous mutations, in contrast, are frequently subject to strong selection. The rate of fixed synonymous (or nonsynonymous) substitutions is often denoted as Ks (or Ka). Ks is typically treated as a reasonable proxy for neutral mutation rate, and the Ka/Ks ratio (abbreviated ω) is often used as a measure for the fixation probability of nonsynonymous mutations scaled to neutral expectation.

Recently, we discovered a strong positive correlation between ω and Ks in several mammalian lineages, arguing that the fixation probability of nonsynonymous mutations in a gene is positively corrected with the gene's neutral mutation rate [2]. This finding is highly unexpected under the classical neutral theory of molecular evolution, which argues that the fixation probability of nonsynonymous mutations is determined by (and serves as an estimator for) selective pressure, and as such, should be independent of neutral mutation rate [35]. Conventionally, therefore, ω should not show any positive correlation with Ks, and in fact, simulations demonstrate that only an inverse correlation should in theory exist between ω and Ks due to a mathematical artifact [2] (see below). The observation of a strong positive correction between ω and Ks therefore challenges the current paradigm of molecular evolution, and necessitates a reexamination of the widely held assumption that ω is determined primarily by selective pressure independent of neutral mutation rate.

There is, however, an important caveat to this finding. The positive correlation between ω and Ks was observed in mammalian lineages with relatively large sequence divergence, specifically, the human-mouse, human-rabbit, and mouse-rat lineages [2] (the average Ks of these lineages being 0.48, 0.38 and 0.18, respectively), but not found in the human-chimpanzee or the human-macaque lineages, which have much lower levels of sequence divergence (the average Ks being 0.012 and 0.063, respectively). Instead, an inverse correlation between ω and Ks was seen in these latter short lineages [2]. This is troubling because it leaves open the possibility that the positive correlation between ω and Ks observed in long lineages may actually be an artifact associated with the imprecise calculation of Ka or Ks. In particular, multiple hits at any given nucleotide position can occur with high probability in long lineages. Such multiple hits, difficult to correct since only one or zero changes can be observed at a given nucleotide position, can constitute a major source of error in the calculation of Ka and, even more so, Ks. It is formally possible, therefore, that errors stemming from the improper correction of multiple hits along with some other unknown factors have systematically biased the calculation of Ka and Ks in long lineages, in a manner that creates an artifactual positive correlation between ω and Ks.

We have argued previously that this inverse correlation may be due to the effect of stochastic noise on Ks, which affects short lineages more profoundly than long lineages. Specifically, stochastic deviation of Ks from the true underlying neutral mutation rate (μ) represents a much greater fraction of μ in short lineages as compared to long lineages, and our simulations showed that this could indeed lead to a strong inverse correlation between ω and Ks in very short lineages such as the human-chimpanzee lineage because Ks is also used as the denominator in the calculation of ω (which equals Ka/Ks) (see Supplementary Material in [2]).

If the failure to observe a positive correlation between ω and Ks in short lineages is indeed due to the inherently high levels of stochastic noise affecting Ks in short lineages, then a reasonable correction of noise in Ks might be able to bring out the positive correlation again. Here, we show that by introducing a method that corrects for stochastic noise in Ks, a strong positive correlation between ω and Ks indeed becomes observable in short lineages. This result strengthens the authenticity of this highly unexpected correlation, and argues that a biological mechanism (rather than a methodological artifact) is relating the fixation probability of nonsynonymous mutations to mutation rate.

Results

We generated a set of 6,779 genes each with orthologs from five primate and rodent species for which large-scale genome sequences are currently available: human, chimpanzee, rhesus macaque, mouse, and rat. A subset of 5,831 genes also included orthologs from dog. To ensure that the current data set was consistent with our earlier results based on a separately derived data set [2], we sought to recapitulate the positive correlation between ω and Ks in human-mouse and mouse-rat comparisons. Indeed, for both lineages, a highly robust correlation exists and binning of genes creates a visually striking representation of the correlation (Additional file 1). We also plotted human-dog and mouse-dog values and observed correlations similar to that seen in the human-mouse lineage (data not shown). To ensure that the correlation is not restricted to genes with orthologs in all the species sampled, we also obtained data sets containing only pairwise orthologs. Again, we observed correlations that are essentially the same as that seen in the five- or six-way ortholog sets (data not shown).

We then considered the human-chimpanzee and human-macaque ortholog pairs within the complete data set. As expected for these short lineages, plotting ω against Ks showed that not only is there not a positive correlation, but there is a strong inverse correlation between these two parameters for the human-chimpanzee lineage (Additional file 2). We have speculated previously that this inverse correlation is due to the inherently high levels of stochastic variation in Ks when it is used as a proxy for neutral mutation rate in short lineages [2]. If this is correct, then the strength of the inverse correlation should be stronger in the human-chimpanzee lineage than in the human-macaque lineage because the former is shorter and therefore suffers from an even higher level of stochastic noise in Ks. This is precisely what we observed (Additional file 2). Indeed, the general finding is that as evolutionary distance of a lineage decreases (and hence the stochastic noise associated with Ks relative to neutral mutation rate increases), the relationship between ω and Ks goes from a strong positive correction in the case of long lineages progressively toward a strong inverse correlation in the case of very short lineages.

In molecular evolutionary studies, Ks is frequently used as a proxy for neutral mutation rate (μ). It is often forgotten, however, that mutational events are a discreet process and as such are subject to stochastic variation. Over short periods of evolutionary time, this stochastic variation is often large relative to the true underlying neutral mutation rate. As evolutionary time lengthens and the number of mutational events increases, stochastic variation decreases relative to neutral mutation rate. This is clearly demonstrated by computer simulations in our previous study [2].

High levels of stochastic noise in Ks can lead to an artifactual inverse correlation is because Ks is in both parameters being corrected, and it is the denominator of ω. Mathematically, by just correcting for noise on the x-axis (Ks), the artifactual inverse correlation should go away even if there is still considerable noise on the y-axis. This prediction was clearly borne out by our previous simulation studies [2]. We therefore decided to focus on devising a means to correct for noise in Ks only, and to examine if such correction can eliminate the artifactual inverse correlation between ω and Ks in short lineages. This may reveal the underlying positive correlation, if any, that has been obscured.

One way to accomplish this is to use the Ks value from orthologs of the same gene but in a longer lineage (i.e., Ks between a pair of species with greater divergence time). Indeed, when we plotted ω of the human-chimpanzee lineage against Ks of the corresponding human-mouse lineage, not only did the inverse correlation disappear, but a positive correlation as typically observed in long lineages was seen (Figure 1). The same result should, and did, occur when human-macaque ω was plotted against human-mouse Ks (Figure 1). We note that an implicit assumption in this approach is that the neutral mutation rate of a gene is correlated across different mammalian lineages [69]. Although the existence of such a correlation has been contentious [6, 7, 10], we were able to verify it in our data set independent of GC content, including not only Ks but also K4 (Additional file 3).

Figure 1
figure 1

Positive correlation between ω of the short human-chimpanzee or human-macaque lineage and Ks of the long human-mouse lineage. In cases where genes are binned, all the genes used in the analysis are divided into bins based on ascending Ks (i.e., the first bin contains genes with the lowest Ks values and the last bin contains genes with the highest Ks values, etc.). There are 200 genes per bin except for the last bin, which contains whatever number of remaining genes that is 200 or less.

In a similar manner, it is possible to correct for variation in ω apart from Ks. By using the Ks value from the human-chimpanzee lineage but ω from human-mouse, we were also able to observe the positive correlation (Figure 2). It is interesting to note the differences between these two approaches. When Ks is corrected, the positive correlation appears more robust than when ω is corrected. This is likely due to a the fact that Ks is used as the denominator in the calculation of ω.

Figure 2
figure 2

Positive correlation between ω of the long human-mouse lineage and Ks of the short human-chimpanzee or human-macaque lineage. The binning of genes follows the convention in Figure 1.

In the above analyses, the lineage used to calculate ω is entirely subsumed by the lineage used to calculate Ks (e.g., when human-chimp ω was plotted against human-mouse Ks). This may introduce confounding effects. We therefore also plotted human-chimp or human-macaque ω against mouse-dog Ks. We found that a rather similar positive correlation between ω and Ks exists despite a complete lack of shared descent between the lineage used to calculate ω and the lineage used to calculate Ks (Figure 3)

Figure 3
figure 3

Positive correlation between ω of the short human-chimpanzee or human-macaque lineage and Ks of the long mouse-dog lineage. The binning of genes follows the convention in Figure 1.

In the initial analysis, ω of a short lineage is plotted against Ks of a long lineage. This reveals a positive correction presumably because Ks of the long lineage is a more accurate estimator of neutral mutation rate than Ks of the short lineage even for the short lineage. The ideal situation, however, is to plot ω of a short lineage against Ks of the same short lineage, and do so in a manner that corrects for the stochastic noise in Ks. One approach to correct for the noise in Ks affecting individual genes is to bin genes and plot the bin-average ω against bin-average Ks. Given that Ks of the long lineage is a more accurate proxy for neutral mutation rate, we binned genes based on Ks in the long lineage even though bin-average ω and Ks values were all derived from the short lineage. Using this approach, we first plotted bin-average ω against bin-average Ks in the human-chimpanzee lineage, using human-mouse Ks to bin genes. This revealed a robust positive correlation between ω and Ks (Figure 4). A similar positive correlation was seen between bin-average ω and Ks in the human-macaque lineage when human-mouse Ks was again used to bin genes (Figure 4). Comparable results were also obtained for either the human-chimpanzee or human-macaque lineage when mouse-dog Ks was used to bin genes.

Figure 4
figure 4

Positive correlation between bin-average ω of a short lineage and bin-average Ks of that same short lineage when genes are binned by Ks of a long lineage. The binning of genes follows the convention in Figure 1.

We note that the accuracy of Ka and Ks can be compromised by polymorphisms and that the effect is stronger for short lineages. This is because in short lineages, a considerable fraction of observed sequence differences between two reference genomes may actually be polymorphisms rather than fixed divergence. As such, Ka and Ks values calculated from two closely related reference genomes are inflated. This is especially true for Ka because a significant fraction of nonsynonymous polymorphisms can be slightly deleterious mutations destined for elimination before they can reach fixation [11]. This may contribute to the poor quality of Ks in approximating μ and Ka/Ks in approximating selective pressure, and thus the erosion of the positive correlation in short lineages. What remains clear, however, is that polymorphisms are an unlikely source of the positive correlation between ω and Ks. This is because in long lineages, the effect of polymorphisms on Ka and Ks is negligible given that only a tiny fraction of the observed sequence differences between two reference genomes are due to polymorphisms, and yet that a robust positive correlation between ω and Ks can been seen in long lineages. This effectively rules out polymorphisms as a major contributing factor to the correlation.

Discussion

In this study, we show that the positive correlation between ω and Ks is not restricted to specific mammalian lineages. Rather, the correlation can be observed across all mammalian species for which large-scale genome sequence data are available. It is particularly interesting that the correlation can be observed even in very short lineages once measures are taken to correct for the stochastic noise in Ks. This argues that the failure to observe a positive correlation between ω and Ks in short lineages in our previous study is indeed due to noise in Ks as we had speculated [2]. We note that the calculation of Ka and Ks is minimally confounded by the occurrence of multiple hits in short lineages. The ability to observe a strong positive correlation between ω and Ks in very short lineages therefore argues that the correlation is not an artifact stemming from the improper correction of multiple hits in the calculation of Ka and Ks. Our current study thus bolsters the authenticity of this correlation while demonstrating its broad applicability across the mammalian tree.

Another important message from the study is that Ks in long lineages may provide a much better estimator of neutral mutation rate than Ks in short lineage. This occurs because of the large amount of stochastic variation in Ks relative to the true underlying neutral mutation rate in short lineages than in long lineages. Indeed, when estimating neutral mutation rate of a gene in a short lineage, it may be more accurate to use the observed Ks of that gene in a long lineage (scaled down proportional to the genome-average differential in sequence divergence of the two lineages) than to use the observed Ks from the short lineage. This study and our unpublished data confirm that neutral mutation rate can remain relatively stable across different mammalian lineages [8, 9].

Neither the previous long-lineage study nor the current short-lineage study alone is sufficient to eliminate methodological artifacts from consideration. In the study of long lineages, the correlation is complicated by possible difficulties in correcting for multiple hits, though the stochastic noise in Ks and the effect of polymorphisms are not major issues. In the study of short lineages, the converse situation is true. It is thus gratifying that a strong positive correlation between ω and Ks can be observed not only for long lineages, but also for short lineages after noise in Ks is corrected. Together, these results argue that the correlation is the result of a biological mechanism rather than a methodological artifact.

It is as yet unclear why there should exist such a strong positive correlation between ω and Ks. Neither this study nor our previous study provides a definitive mechanism by which this correlation may occur, though several models have been proposed. Perhaps ω is reflective of some combined effect of selective pressure and neutral mutation rate, or perhaps selective pressure and neutral mutation rate influence each other in some way [2]. It was speculated that the occurrence of intragenic compensatory mutations, which is supported by some theoretical and empirical studies [1214], may contribute to the former scenario [2]. It was also suggested that modulated mutability may contribute to the latter scenario [2], that is, the correlation may be partly due to the fact that genes with conserved functions have evolved lower neutral mutation rates over evolutionary time [1517]. These speculations notwithstanding, the biological mechanism responsible for the correlation remains unclear.

Given the presence of the correlation in all the mammalian lineages for which there is sufficient genome sequence data, including very short lineages, it now seems clear that the correlation is indeed the result of a biologically meaningful process. The stage is set for future studies to identify the mechanisms underpinning this enigmatic correlation.

Methods

Using data from Ensembl v36 [18], sequences were obtained for: human, Homo sapiens (NCBI 35); chimpanzee, Pan troglodytes (PanTro 1.0); rhesus macaque, Macaca mulatta (Mmul 1.0); mouse, Mus musculus (NCBI m34); rat, Rattus norvegicus (RGSC 3.4); and dog, Canis familiaris (CanFam 1.0). Genes were clustered into orthologous groups using reciprocal best BLAST hits following established methods [1921]. We further attempted to ensure proper alignments by imposing maximum acceptable Ks cut-offs at roughly three standard deviations above average for all alignments. Sequences were curated for length differences and for poor alignment. This resulted in 5,831 orthologous groups containing a member from each of the six species with reasonable alignment and 6,779 orthologous groups containing all five primate and rodent species. Ortholog groups, alignments, and evolutionary estimates used in this study can be obtained through the SPEED database [21].

In-frame alignments of orthologs were performed using The Wisconsin Package v10.2 http://www.accelrys.com/. Evolutionary parameters were estimated using the Li method [22], though the results obtained were comparable (positive linear correlations between ω and Ks) when other methods such as PAML [23, 24] were used.