Background

In an early attempt to relate theoretical concepts and empirical results bearing on the molecular basis of evolution, Kimura and Ohta listed five principles of molecular evolution [1,2]. The first of these deals with the concept of the molecular clock and predicts that, as long as a protein's function remains unaltered, the protein's rate of evolution will be approximately constant within different phylogenetic lineages. From this principle, it follows that detection of changes in a protein's rate of evolution may reveal functional changes [3]. This rate-constancy prediction is tested here by studying the relative rates of protein evolution in a simple phylogenetic context. The test relies on the simultaneous analysis of numerous protein sequences that represent complete protein sets from sequenced genomes or partial protein sets from genomes of which a significant fraction has been sequenced. Consistent with the rate-constancy prediction, given a phylogenetic tree topology, the ratio of evolutionary distances along different branches of the tree should be approximately constant for all proteins from an analyzed set of species.

Clusters of orthologous proteins from bacterial, archaeal and eukaryotic species sets of size three were analyzed with respect to this prediction. Specifically, for each protein phylogeny, the length of the longest branch was compared to the sum of the two shorter branches (Figure 1). Such a genome-scale test allowed for a robust assessment of the rate constancy of protein evolution, as well as the detection of multiple cases of accelerated evolutionary change and a quantitative assessment of the fraction of genes that undergo such acceleration.

Figure 1
figure 1

Schematic of an orthologous protein phylogeny and the null hypothesis (Ho) of a constant relative rate of protein evolution. Comparison of multiple orthologous proteins is predicted to reveal a constant relative rate of evolution. This should be manifest as an approximately constant ratio of phylogenetic branch lengths (bold lines) bA/(bB + bC). The branch lengths were calculated using the evolutionary distances (dashed lines) between the proteins from three species (dAB, dAC, dBC) as described in Materials and methods.

Results and discussion

Genomic-scale relative rates test

Three genomes of intracellular pathogenic bacteria of the family Chlamydiaceae (Chlamydophila pneumoniae, Chlamydia trachomatis and Chlamydia muridarum) and three genomes of archaeal hyperthermophiles of the genus Pyrococcus (Pyrococcus abyssi, P. horikoshii and P. furiosus) were chosen for analysis. Orthologs were identified using the rationale and procedure that underlie the construction of clusters of orthologous groups of proteins (COGs) [4,5]. This approach has been designed to maximize the likelihood of identification of orthologs when complete genomes are compared. A modified approach involving two additional steps was used to identify probable orthologs among the encoded proteins from three partially completed mammalian genomes (Homo sapiens (human), Mus musculus (mouse) and Rattus norvegicus (rat).

These procedures resulted in a total of 703 Chlamydiaceae, 1,305 Pyrococcus and 2,112 mammalian orthologous protein sets. Each orthologous protein-sequence set was aligned and the evolutionary distances between sequences within the sets were computed. These distances were used in phylogenetic reconstructions for all orthologous protein sets. For each species group, branch lengths were averaged to produce the trees shown (Figure 2). Each of the trees contains two relatively short branches of virtually identical length and one long branch, forming two main phylogenetic partitions. The rate-constancy prediction holds that the evolutionary rates along these phylogenetic branches should be correlated. In other words, the length of branch A (bA) should be correlated with the length of branches B and C (bB + bC) (Figures 1,2) resulting, in an ideal case, in a constant bA/(bB+bC) ratio.

Figure 2
figure 2

Phylogenies of the three analyzed species groups. Branch lengths are the average of all branch lengths for a given species group. The different branches bA, bB and bC are indicated.

This prediction was tested by evaluating the branch lengths estimated from alignments of orthologous protein sets of the three species groups. Consistent with the prediction, there is a highly significant positive correlation between bA and bB + bC for all three species groups (Figure 3). Thus, the vast majority of proteins in each analyzed group of species seem to evolve in accord with the rate-constancy prediction, suggesting that the functional constraints among orthologous proteins remain largely unchanged. However, in each group of species there are outliers (Figure 3) that do not conform to the expected pattern of variation. For each of the three groups of orthologous protein sets, upper and lower limits of the distance from the linear trend line that correspond to an expectation value of 0.05 were determined. All points above and below these lines (Figure 3) represent statistically significant deviants from the linear trend, given the standard error and the number of orthologous protein sets analyzed for each group of species. Points above the upper limit line show more variation between the phylogenetic partitions (that is, longer bA) than expected by chance, and points below the lower limit line show more variation within one partition (that is, longer bB and/or bC) than expected. The majority of the deviants fall above the upper limit. These points represent proteins with an accelerated rate of amino-acid substitution between phylogenetic partitions. The less frequent outlier points below the lower line represent proteins that have undergone accelerated evolution in one or both of the two more closely related species in the corresponding group. All of the orthologous protein sets that showed significant deviations from the linear trend were re-analyzed using BLAST similarity searches followed by phylogenetic analysis. This was done to verify that these sets contain orthologous proteins. For orthologous proteins, it is expected that they will form monophyletic clades without any intervening paralogs or proteins from distantly related species. For all Chlamydiaceae and mammalian data, the phylogenetic analysis revealed that these sets do in fact consist of orthologous proteins. However, the Pyrococcus analysis yielded five cases where orthology could not be ascertained. These ambiguous Pyrococcus sets were removed and the linear relationship between bA and bB +bC was re-assessed. The total number of orthologous protein sets that show evidence of accelerated evolution suggestive of functional diversification is between 0.76% and 1% of the total number of such sets analyzed for each species (Chlamydiaceae = 7/703, Pyrococcus = 11/1305, human-mouse-rat= 16/2112).

Figure 3
figure 3

Correlation between the branch lengths in different phylogenetic partitions. Linear regression where the bA branch length (y-axis) is plotted against the sum of bB + bC branch lengths (x-axis) for all orthologous protein sets of each species group. (a) Chlamydiaceae species group including C. pneumoniae, C. muridarum and C. trachomatis. (b) Pyrococcus species group including P. furiosus, P. abyssi and P. horikoshii. (c) Human-mouse-rat species group including H. sapiens, M. musculus and R. norvegicus. The equation for the linear regression trend line (y = mx + b), the correlation coefficient (r) and the level of significance for the correlation (P) are shown on each plot. The linear regression trend line is shown in bold black and the upper and lower limits, corresponding to an expectation value of 0.05, are shown in light gray. For each plot, only 0.05 points are expected to fall outside of these limits by chance. All values are shown as diamonds and the values outside the upper and lower limits that represent functionally diversified orthologous protein sets are indicated by larger squares.

Nucleotide-substitution rates

A number of cases of positive selection at the molecular level have been supported by comparison of the rates of synonymous (Ks) and non-synonymous (Ka) substitution [3]. Ka > Ks is considered unequivocal evidence of positive selection [2]. With this in mind, comparisons of Ks and Ka were carried out for the orthologous protein sets that showed a pattern of variation consistent with functional diversification in the whole-genome test (Figure 3). For the bacterial and archaeal species groups, the vast majority of these comparisons were inconclusive because the evolutionary distance between phylogenetic partitions resulted in a saturation of Ks. The more closely related human-mouse-rat group did not show saturation of Ks. Comparisons of Ks and Ka for functionally diversified mammalian protein sets did not reveal any cases with Ka > Ks. This is probably due to the fact that, if directional selection led to the acceleration of the amino-acid substitution rate in these proteins, it acted in an episodic manner [6,7] and on a minority of residues against a constant background of purifying selection [8,9]. This is consistent with previous work suggesting that adaptive changes are difficult to find using the strict Ka > Kscriterion and that even documented cases of positive selection can be missed using this technique alone [6,10,11,12,13]. The average Ka is, however, significantly higher for the putative functionally diversified proteins than for the total dataset (Table 1, Figure 4). The mammalian proteins with an increased rate of amino-acid substitution between the human and rodent lineages also show a level of Ks significantly higher than that of the total dataset (Table 1). This can be explained by the fact that, consistent with previous reports [10,14,15,16,17], we observe a positive correlation between Ks and Ka (r = 0.556, P < 10-10). This correlation could reflect a mechanistic bias in mutation [10,18,19] or synonymous sites may be subject to some degree of selection [20,21,22,23] (or both). However, the Ka/Ks ratio is also significantly higher for the sets of putative functionally diversified proteins (Table 1), indicating a relative acceleration of amino-acid substitution, which could be due to a relaxation of functional constraints and/or directional selection.

Figure 4
figure 4

Non-synonymous (Ka) versus synonymous (Ks) substitution rates for the human-mouse-rat orthologous protein sets. (a) Average Ks and Ka for the human-mouse and human-rat pairwise comparisons. (b) Ks and Ka for the mouse-rat pairwise comparisons. Thick diagonal line, Ks = Ka; thin horizontal line, average Ka. All values are shown with circles and the values corresponding to the functionally diversified proteins are indicated by larger squares.

Table 1 Synonymous (Ks) and non-synonymous (Ka) substitution rates for the human-mouse-rat orthologous protein sets

Functional prediction and phyletic distribution

There are some notable similarities in the (predicted) functional identities and the phyletic distributions of the putative functionally diversified proteins (Table 2). The sets from all three species groups are enriched for proteins that appear to function at the periphery of the cells and mediate their interaction with the environment. Specifically, 5 of the 7 cases in Chlamydiaceae, 7 of the 11 Pyrococcus cases and 13 of the 16 mammalian cases are (predicted) membrane or secreted proteins (Table 2). Some of the remaining proteins are completely uncharacterized and could associate with membranes by other mechanisms that do not involve recognizable structural features such as signal peptides or transmembrane segments. The mammalian set of functionally diversified proteins, for which much more experimental information is available than for the bacterial and archaeal sets, is particularly enriched for secreted proteins, including several proteins involved in immune surveillance and defense (Table 2). An example of a protein that is implicated in a different class of functions, but is still related to defense mechanisms is the predicted restriction endonuclease from Pyrococcus. Generally, the functions of proteins for which accelerated evolution was detected and functional diversification inferred are compatible with the previously reported cases of accelerated evolution which is particularly prominent, for example, in the immune system [3].

Table 2 Domain architecture and functional predictions for functionally diversified proteins

The phyletic distribution of the functionally diversified proteins is typically narrow, particularly in Chlamydiaceae (six of the seven proteins have no homologs outside this bacterial family). Although the selected Pyrococcus proteins showed more evolutionary conservation, the majority had orthologs only in other archaeal genomes, if not only in Pyrococcus. Similarly, 13 of the 16 selected mammalian proteins were either mammal-specific or vertebrate-specific (Table 2). The only two clear exceptions to this pattern of poorly conserved proteins primarily involved in defense and extracellular signaling were the mammalian enzymes hydroxysteroid sulfotransferase and hydroxy acid oxidase. Accelerated evolution in proteins with a narrow phyletic distribution may represent species-specific adaptations in how each organism responds to and processes unique environmental cues.

Conclusions

The present results show that the mode of evolution of the vast majority of the analyzed proteins from all three domains of life - bacteria, archaea and eukaryotes is compatible with the rate-constancy prediction of molecular evolution. A recent analysis of the genome-wide distribution of the evolutionary rates of prokaryotic genes suggested that the shape of this distribution does not significantly change during evolution, although the rates themselves may change [24]. This is consistent with the present finding of a strong correlation between evolutionary rates in different phylogenetic partitions. However, with remarkable consistency, for all three species groups analyzed here, approximately 1% of orthologous genes seem to have experienced accelerated evolution, suggesting functional and potentially adaptive diversification that is conspicuous enough to be detected at a statistically significant level.

It is also a formal possibility that some of the genes identified as functionally diversified may be losing their function. As this would entail a change in function, it can be considered as a special type of functional diversification. However, two points argue against this possibility. First, the Ks/Ka values for mammalian genes are all < 1 (including the putative diversified genes) and thus do not show evidence of loss of function or of pseudogenes. Second, amino-acid sequence alignments of the diversified proteins show that these proteins still retain the identity of critical domains, such as signal peptides and transmembrane domains, as well as critical residues in the active sites of enzymes (Table 2 and data not shown).

Accelerated evolution and functional diversification between paralogs is thought to be common in evolution and is posited as a major mechanism for the generation of new genes and new biochemical functions [25]. Functional diversification between orthologs is considered to be much less common, however. In fact, directional selection leading to functional diversification between orthologs is the least common form of natural selection observed at the molecular level [3]. In the present study, the implementation of a simple test designed specifically to exploit the availability of numerous complete and partially complete genome sequences facilitated the detection of multiple cases of probable functional diversification between orthologs. Application of this method to the growing set of complete genomes is expected to yield similar results and may reveal numerous additional cases of functionally diversified orthologs.

Materials and methods

Sequence analysis

Three closely related Chlamydiaceae genomes and three closely related Pyrococcus genomes were chosen for analysis. These genomes were chosen on the basis of three criteria. First, there were three or more closely related complete genome sequences available; second, the genomes were closely enough related to avoid saturation of amino-acid changes and to ensure unambiguous identification of orthologs; and third, the genomes were distantly enough related to provide meaningful information on amino-acid variation. Predicted proteomes for three Chlamydiaceae genomes (Chlamydophila pneumoniae - CWL029, Chlamydia muridarum and Chlamydia trachomatis) and two Pyrococcus genomes (P. abyssi and P. horikoshii) were obtained from the National Center for Biotechnology Information (NCBI) public server [26]. The predicted proteome for the P. furiosus genome was obtained from the sequencer's site [27]. The method for identifying probable orthologs is based on the genome-specific best hits (BeTs) rationale that underlies the construction of the COGs database [4,5]. For each group of genomes, all-against-all BLAST [28] comparisons were done using the predicted protein sequences. For each query sequence, the BeT (highest score or lowest expectation value) to each of the other proteomes was recorded. Triangles of mutually consistent symmetrical BeTs were detected to form COGs. Only COGs with one representative (ortholog) from each genome (COGs of size 3) were considered for further analysis. This sequence-similarity-based approach is generally not dependent on constant rates of evolution and thus is not misled by rate variation. However, it is a formal possibility that misidentification of orthologs could result from cases of extreme rate variations that would make the orthologs less similar to each other than to some paralogs. This possibility was examined using phylogenetic analysis, as described below.

Messenger RNA sequences from R. norvegicus, Mus musculus and Homo sapiens were obtained using NCBI's batch Entrez server [29]. The encoded amino-acid sequences were clustered at 99% similarity threshold using BLASTCLUST [30] to remove redundant sequences. The BeT procedure was used to identify perfectly symmetrical COGs of size 3 for this species group. Because these mammalian genomes are not complete, a formal possibility exists that the COGs procedure will isolate non-orthologous protein sets. For this reason, two additional steps were added to the ortholog identification procedure. Subsequent to the construction of human-mouse-rat COGs and phylogenetic trees, the 'triplet test' [17] was used to further ensure that the COGs of size 3 contained only orthologs. The triplet test verifies that the mouse and rat members of the COGs are approximately equidistant from the human member. In addition to the triplet test, the distributions of Ks between human and rodents and Ks between mouse and rat was used to eliminate potentially non-orthologous sequence sets. In both cases, the distribution of Ks was approximately normal, with the exception of long and sparsely populated tails. Sequence sets that fell in the tails of the Ks distribution (Ks > 0.35 for mouse-rat and Ks > 1.0 for human-rodent) were removed before analysis to eliminate potentially non-orthologous protein sets. This manual selection process does not appear to bias the results in any way (for example, by eliminating a certain functional class of genes). Omitting this step resulted in the inclusion of a number of distantly related and poorly aligned sequence sets that may not actually include orthologs. Thus, the relative rates test as implemented for the mammalian dataset is a conservative one. Furthermore, the most rapidly evolving genes will not be included in the analysis as many of them will not have detectable orthologs even in closely related genomes.

Orthologous protein sequence sets were aligned using CLUSTALW [31] with default options. Alignments that resulted in any branch lengths (based on the corrected gamma distance, see below) longer than 2 in the case of the Chlamydiaceae and Pyrococcus species groups or longer than 1 in the case of the human-mouse-rat species group were visually inspected and adjusted manually. Manual adjustment included removal of poorly aligning and/or low-sequence-complexity regions identified by BLAST. These adjustments did not result in any qualitative differences in the results of subsequent analyses.

For each orthologous protein set of size 3, evolutionary distances between the sequences were calculated using 3BRANCH program (Y.I.W., unpublished data, available from the author on request) that uses the distance correction based on the gamma distribution of site rate variation [32,33] with an α parameter of 1.0. Phylogenetic branch lengths were calculated from distances by solving the following set of equations (Figure 1):

distance from species A to species B (dAB) = branch length A (bA) + bB,

dAC = bA + bC,

dBC = bB + bC.

For example, solving for bA yields bA = (dAB + dAC - dBC)/2.

For each orthologous gene set, pairwise estimates of the synonymous (Ks) and non-synonymous (Ka) substitution rates were obtained from nucleotide-coding sequence alignments using the Pamilo-Bianchi-Li method [34,35] implemented in the PBLtest program (I.B.R., unpublished data, available from the author on request). Pairwise estimates were averaged to obtain the data presented in Table 1.

Domain architectures of the functionally diversified proteins were assessed by using the SMART server [36,37] and the NCBI's CD-Search server [28,38]. Domain architectures together with the results of PSI-BLAST [28] searches that were run for three iterations, Entrez protein annotation [39] and protein annotations in the COG database [5,40], were used to assign (probable) functions to the selected proteins. The phyletic distribution of these proteins was determined from the PSI-BLAST search results using the TAX_COLLECTOR program of the SEALS package [41]. Additional case-by-case inspection of these results was conducted and neighbor-joining phylogenetic trees [42,43] were constructed to verify the orthologous relationships between the selected proteins. This was done by ensuring that putative orthologous proteins did not show a greater similarity to any paralogs or proteins from more distant species than they did to the proteins included in the analyzed COGs. As a result of this inspection, five Pyrococcus COGs were removed from the original set of selected proteins.

Statistical analysis

For each species set, the standard error of the predicted y-value for each x-value in the regression (Figure 3) was determined. The number of standard errors away from the linear trend line that corresponds to an expectation value of 0.05 was calculated using the inverse of the standard normal cumulative distribution for a probability equal to 1 - (0.05/number of values in plot/2).