Background

The growth of gene and genomic databases provides motivation for developing tools to extract information about the function of a protein from sequence data, with the ultimate goal of understanding the collection of functions represented in an organism's genome. Work on molecular evolution over 30 years has shown that such questions must be phrased carefully, and always with cognizance of the Darwinian paradigm that insists that the only way of obtaining functional behavior in living systems is through natural selection superimposed on random variation in structure [1]. A behavior is functional if the organism would be less able to survive and reproduce if that behavior were different. An amino-acid residue is functional if, upon mutation, the organism is less able to survive and reproduce.

A long literature has sought to interpret the evolutionary behavior of protein sequences, in the hope of drawing inferences about the relationship between fitness and sequence [2]. What has emerged is the recognition that a family of orthologous proteins displays a diversity of structure and a corresponding diversity in behavior, where some of the behavioral differences have a strong impact on fitness (are functional), and others are neutral (or nearly so). Without resolving, in a general way, questions regarding the relationship (neutrality versus selection) between fitness and protein sequence, we can build interpretive tools that capture information from patterns of evolution of genomic sequences that is informative about function - in particular, events that are characterized by the biological scientist as a change in function.

For a protein to change its function, it must change its behavior; this in turn requires that it change its amino-acid sequence. A protein being recruited for a different function over a very short time (geologically speaking) frequently experiences an episode of rapid sequence evolution, an episode where the number of amino-acid substitutions per unit time is large. Therefore, molecular evolutionists have long been interested in the rates at which substitutions accumulate in protein sequences. These rates are known to vary widely in different protein families.

Calculating rates in the units of substitutions/time requires knowledge of the geological dates of divergence of protein sequences. Because geological times are frequently not known (and almost never known precisely), alternative approaches for identifying episodes of rapid sequence evolution have been sought. One of these examines nucleotide substitutions. It divides the number of nucleotide substitutions that change the sequence of the encoded protein (nonsynonymous substitution) by the number of nucleotide substitutions that do not change the sequence of the encoded protein (synonymous substitution), and then normalizes these for the number of nonsynonymous and synonymous sites. This is the KA/KS rate ratio [3,4,5]. High KA/KS rate ratios for reconstructed ancestral episodes of sequence evolution are hypothesized to be signatures of positive adaptation, and have been associated with significant change in function [6,7].

In general, KA/KS values are low. For example, the average KA/KS value in proteins between rodents and primates is 0.2 [8]. This is taken to indicate that most of these proteins, selected over millions of years, attained an optimum function prior to the divergence of rodents and primates. This implies that subsequent evolution was conservative; most nonsynonymous mutations were detrimental to the fitness of the organism.

Functional change can be defined as mutation that alters organismal fitness and is subject to selective pressure. For an example of intraspecific variation, phosphoglucose isomerase in montane beetles shows adaptation to local temperature variations [9]. Orthologous proteins also suffer positive selection. For example, the hemoglobin in the bar-headed goose has undergone adaptive change relative to the hemoglobin from the closely related greylag goose in response to a reduced partial pressure of oxygen at high altitudes [10]. Adaptive evolution is also believed to be displayed by paralogous mammalian MHC class I genes and relate to a birth-and-death model of gene duplication [11].

Traditionally, positive selection is defined by a KA/KS rate ratio significantly greater than unity. While 0.6 < KA/KS < 1 can occur by relaxation of functional constraint, the theoretical cut-off of 1 is well known to miss significant functional changes in proteins for several reasons [12]. Long branches can dilute an episode of positive adaptation (with KA/KS > 1) with episodes of conservative evolution. KA/KS values can miss positive selective pressures on individual amino acids because they average events over the entire protein sequence. Behavior in a protein can change significantly if only a few amino acids change while the remainder of the sequence is conserved in order to retain core behaviors of the old and new functions (for example, the protein fold). These adaptive events will only be detected on sufficiently short branches which pinpoint the adaptive change.

Alternative ways of identifying KA/KS values below unity that are suggestive of adaptive evolution involve comparison of these values for an individual branch of a tree with those values for branches in the tree generally. If one branch has a KA/KS value far outside of the norm for the family (but still below 1), we can guess that this branch represents an episode of positive selection. This will work for gene families that generally display conservative evolution (such as the SH2 (Src homology 2) domains) [13], but not for others. For example, many immune-system genes show a much more continuous distribution of values, which may indicate that they are perpetually under different amounts of positive selective pressure [11]. In this case, the designation of a cut-off value of KA/KS, below which two homologous genes have the same function, and above which they have different functions, is arbitrary. Ultimately, this level should be determined by benchmarking adaptivity with specific functions and specific protein folds.

KA/KS rate ratios are well known to be useful starting points for generating stories about the interaction between protein sequences and the Darwinian processes that shape these sequences. These stories help us understand how these sequences contribute to the fitness of the host. This means that biologists would find useful a comprehensive database of examples where KA/KS values are high. Most useful would be a database that presents families where KA/KS is greater than 1, and a separate family where KA/KS is greater than some arbitrary cut-off less than 1, but still relatively high compared to the average value in the average protein.

We report here such a database, The Adaptive Evolution Database (TAED). TAED is designed to provide, in raw form, evolutionary episodes in specific chordate and embryophyte (flowering plants, conifers, ferns, mosses and liverworts) protein families that might be candidates for adaptive evolution. TAED contains a collection of protein families where at least one branch in the reconstructed molecular record has a KA/KS value greater than unity, or greater than 0.6. The second cut-off is arbitrary, chosen to be high relative to the average KA/KS value for the average episode of evolution in a protein family. Empirically, the lower cut-off seems to admit additional examples of gene families that might have undergone adaptive evolution.

TAED should be used as a raw list of potentially adaptively evolving genes for experimentalists seeking gene families to study in further detail, and for bioinformaticists interested in studying large datasets of examples of genes with high KA/KS rate ratios.

Results

The Master Catalog [14] is a database of 26,843 families of protein modules generated from an all-against-all search of GenBank release 113. A protein is broken into independently evolving modules on the criterion of the presence of a subsection of a gene as a complete open reading frame in another species. Pairs that were within 180 PAM (point accepted mutation) units with a minimum length requirement were grouped into the same family. Each family contains an evolutionary tree and a multiple sequence alignment. This database was the starting point for the exhaustive calculation of KA/KS rate ratios.

The Master Catalog is different, both in concept and execution, from other resources (for example Hovergen [15] Pfam [16], and COGs) that offer databases of protein families. The Master Catalog incorporates reconstructed ancestral states within its data structure, in addition to multiple sequence alignments (MSAs) and evolutionary trees. Having these reconstructed ancestral states provides a value to the database, especially for functional interpretation, that is not offered by databases that contain only trees, or only multiple sequence alignments, or only trees and multiple sequence alignments. Further, because the Master Catalog is explicitly developed as a tool for doing functional genomics relying on reconstructed intermediates, and as the information about function is extracted from analysis of patterns of variation and conservation in genes and proteins within a family, it emphasizes the generation of high-quality trees, MSAs, and reconstructed ancestral states. For this reason, the Master Catalog does not attempt to build superfamilies (like Pfam does). Instead, it constructs nuclear families, where the trees, MSAs, and ancestral states are not compromised by poor gap placement, a common problem in Clustal-based multiple sequence alignments of sets of highly divergent protein sequences.

Alternative methods were considered for reconstructing ancestral sequences. Whereas maximum likelihood methodologies perform better in some situations, they are too computationally intensive to apply exhaustively. Further, they are based upon an explicit model of evolution that may not be appropriate along all branches analyzed, a situation where maximum parsimony may outperform maximum likelihood on some branches [17]. Therefore, to generate the initial version of this database, more computationally simple methods were used. As improved methodologies are developed, these will undoubtedly be applied to recalculate this database.

Two issues concerned the scope of the KA/KS analysis. First, we were concerned that silent positions would be 'saturated' with substitutions, rendering the KS measurements meaningless. Whereas reconstruction back to the last common ancestor of chordates or embryophytes with no intermediates frequently bears the signature of synonymous position equilibration, synonymous position saturation can be avoided if individual branches are shorter than the period required for saturation to occur (t1/2 to saturation of approximately 120 million years). Saturation was measured through the examination of the extent to which twofold redundant codon systems had reached equilibration [18]. Branches that showed equilibration greater than five half-lives towards saturation were excluded from TAED on the basis of differences between reconstructed ancestral sequences at the beginning of branches and sequences at the end.

A second significant problem is that of short branches bearing fractional mutations. These are known to generate KA/KS values with large errors. To prevent these errors from biasing the database, a new simple robustness test was implemented to ensure that an 'interesting' KA/KS value (one above the cut-off) was not recorded in the database if it became 'uninteresting' (below the cut-off) through the shift of a single mutation reconstructed in the branch. The test modified the KA/KS calculation in a simple way, as described below:

modified KA/KS = KAmod/KSmod

where

KAmod = (number of nonsynonymous - 1)/total nonsynonymous sites

KSmod = (number of synonymous + 1)/total synonymous sites

In general, the smaller the difference between KA/KS and KAmod/KSmod, the more significant or robust the branch. To exclude short branches with fractional mutations (arising through ambiguous ancestral sequence reconstruction) without excluding other short branches, branches with KAmod/KSmod values below 0.5 were excluded from TAED.

Of 5,305 families of modules containing chordate proteins, 280 contained at least one branch with a KA/KS value greater than 1, representing 643 branches emanating from 63 different nodes of the tree of life. Some 778 families had at least one branch with a KA/KS value greater than 0.6, totaling 2,232 branches emanating from 92 nodes of the tree of life. Thus 15% of all families of chordate modules are likely to have modified their function at least once during the course of evolution.

Of 3,385 families of modules representing embryophyte proteins, 123 have at least one branch with a KA/KS value greater than 1, representing 228 families emanating from 25 nodes. Some 407 families had at least one branch with a KA/KS value greater than 0.6, totaling 1,105 branches from 43 nodes. Here, perhaps 12% of all embryophyte families have modified their function along at least one branch.

This result based on ancestral sequence reconstruction contrasts greatly with the result of Endo, Ikeo and Gojobori, where the search for gene families undergoing adaptive evolution yielded only two families [19]. They compared extant sequences rather than reconstructed evolutionary intermediates, counted families only where a majority of the pairs were at high KA/KS values, and used a smaller database.

A list of candidate protein module families that have undergone modification of function is available at [20]. The version described here is designated TAED 2.1 and will remain available at this site. As more sophisticated methods are developed and applied, as correlations with functional and structural databases are pursued, and as data from other types of evolution beyond coding sequence evolution is added, links to these datasets will be provided. TAED 2.1 contains two image-mapped trees (for chordates and embryophytes), where the node that an adaptive branch emanates from can be clicked on to obtain a list and Master Catalog reference number. Multiple sequence alignments and phylogenetic trees corresponding to these entries can be obtained from EraGen Biosciences [21].

Discussion

This study represents the first comprehensive analysis of KA/KS rate ratios throughout the Chordata and the Embryophyta. Although the methods utilized were rough and designed to give a quick snapshot into a global picture of evolution, the TAED, as a raw resource, should be valuable in the analysis of much of chordate evolution. Functional genomics analyses of many of the protein families that have suffered recruitment and functional change within the past 500 million years will soon emerge. Many of the episodes of functional change recorded in TAED can be correlated with events in the geological or paleontological record, in response to changing environments, evolving paleoecology or the development of new physiology.

Gene families may display evolutionary episodes with high KA/KS values, and therefore appear within TAED, for several possible reasons. For example, branches resulting from gene duplication events that give rise to paralogs with very different behaviors will presumably have high KA/KS values, as will orthologous pairs from species that place very different demands on their function. This search was done without distinguishing paralogs from orthologs, and the user of TAED should be careful in the analysis of specific families in recognition of this fact.

Because there is no reliable true set of protein families 'known' to have suffered functional adaptation, it is not possible to score the results of this tool. It is important to remember that a Darwinian definition of function differs from the functional annotation of genomes, and it is possible for a protein to alter or change its function while retaining the same annotation. To examine this dataset, specific proteins must be examined individually.

Individual examination is likely to be productive, however. Many protein families already believed to be candidates for functional recruitment appear on the list. These include plasminogen activator in vampire bats which is expressed in saliva and involved in blood clotting [22], phospholipase A2 in snakes which is expressed in venom and involved in tissue damage [23] and MHC proteins in mammals, which are involved in the immune system as part of the host-parasite arms race [24], all having obvious explanations of why they may have suffered functional change. Several families are newly identified as being candidates for functional change, such as the previously proposed obesity protein leptin in primates.

A third category of discovery in TAED is in the detection of episodes of adaptive change at new points in the divergent evolution of proteins, for example myostatin in the Bovidae [25]. Table 1 is a sample table from TAED representing bovids. These are the candidate genes that were identified as showing rapid sequence evolution emanating from this node in the tree of life. They potentially include orthologs between two species of bovids, paralogs, alternatively spliced transcripts and intraspecific evolution. The genes on the list have roles in the immune system, body musculature and reproduction, traits frequently under selective pressure. These examples and many others are candidates for further experimental study through cloning from additional species and through functional study in laboratories expert in the particular protein.

Table 1 A sample listing from TAED indicating candidate adaptively evolving genes detected that emanated from the Bovidae node

Conclusions

From a phylogenetic perspective, the knowledge of candidate genes evolving at the same time in the same organism can allow one to begin to ask if entire pathways or phenotypic functions are under selective pressure at particular points in evolutionary history. Where tertiary structures for the proteins exist, mutations along branches can be mapped onto three-dimensional structures first to evaluate the validity of specific examples, and second, to understand the nature of adaptive evolution at a structural level.

One analysis of TAED indicates that among branches with KA/KS rate ratios > 1, only 3% of synonymous sites had mutated compared with 10% on the average branch in the database. This is consistent with the notion that episodes of adaptive evolution can be lost in long branches, as these are combined with prior and/or subsequent episodes characterized by lower KA/KS rate ratios characteristic of functional constancy. As more genes are sequenced from more species, the greater articulation of trees will not only increase the accuracy of sequence reconstructions, but will also allow us to detect new examples of functional change that are buried in long branches.

At a biological level, the dataset generated here can be mined to provide global pictures of how evolution has occurred. Correlation of data in this database with that in other functional databases will enable a leap from genotype to organismal phenotype. Further, the dataset provides a resource for experimentalists interested in specific genes. The high KA/KS rate ratio in leptin in a branch connecting primates with rodents may have been a useful predictor of change of function for pharmaceutical companies interested in the mouse model of leptin for human obesity. For the experimentalist, mutations occurring along putatively adaptive branches can be assayed for functional importance in systems of interest.

Finally, this database represents a growing framework for the study of adaptive evolution. As datasets become available, changes in gene expression, alternative splicing patterns, imprinting patterns, recombination events and other molecular mechanisms of adaptation will be added to this database in a phylogenetic perspective. The ultimate goal is a dynamic resource depicting candidate molecular events that are responsible for phenotypic differences between closely related species.

Materials and methods

Starting with the Master Catalog [14] (version 1.1 derived from GenBank release 113; free of charge for academic users through [21]), KA/KS rate ratios were reconstructed database-wide for each ancestral branch in every evolutionary tree containing genes from the Chordata and the Embryophyta. This analysis was restricted to these organisms because there is less evidence for codon and GC-content biases which complicate the accurate calculation of KS. The Master Catalog uses multiple sequence alignments generated from Clustal W and neighbor-joining trees, both derived from protein sequences. Because the Master Catalog is based on an analysis of nuclear families, rather than extended families, these inexpensive tools generate acceptable multiple sequence alignments.

KA/KS values were calculated for branches on an evolutionary tree between nodes using the method of Li and Pamilo and Bianchi [3,4,5] modified to allow full treatment of probabilistic ancestral sequences [26]. Reconstruction of ancestral sequences was done using the Fitch maximum parsimony methodology [27]. While reconstructed ancestral sequences contain ambiguities, using probabilistic ancestral sequences takes this into account (by weighting ambiguous positions according to their probabilities) and allows us to construct a model of evolutionary history that is robust. Two cut-offs were used to identify 'interesting' values for the KA/KS rate ratio, 1 and 0.6. Separate databases were constructed for each cut-off. The resulting dataset is freely available for further analysis at [20].