A number of mechanisms can generate duplicate copies of genes, ranging from single gene duplications to regional and whole genome duplications [13]. Large increases in gene number have been coupled to increases in organismal complexity and radiative divergence at several points in the history of metazoans including during the chordate/vertebrate transition and during the teleost fish divergence [1, 4, 5].

Metazoans differ from prokaryotes in their much smaller effective population sizes, where theory predicts that neutral stochastic processes will be relatively more important than adaptive processes in the expected case that adaptive mutations are rarer than nearly neutral mutations [6]. Large scale analyses, based upon the ratio of nonsynonymous to synonymous nucleotide substitution rates [7] or MacDonald-Kreitman statistics [8] have indicated small to intermediate degrees of positive selection (adaptive substitutions) in mammals, but these clearly do not represent the majority of substitutions. In such studies, it appears to be specific positions in protein-encoding genes, rather than the genes as a whole that are under positive selection [7]. Even examining substitution as a neutral walk through sequence in a folded protein (ignoring positive selection) has shown such a process to have fairly complex dynamics [9]. From this, it is relevant to examine population genomic phenomena, like the fates of duplicated genes, in the context of physical models of proteins. Further, it is not possible to systematically identify fates of real genes (subfunctionalization to the exclusion of neofunctionalization or vice-versa), so modeling under increasingly realistic conditions is likely to be the best way to understand evolutionary mechanisms.

Pseudogenization or nonfunctionalization is a purely neutral process that ultimately eliminates one of the duplicated copies as a functional gene and is the most common fate. Subfunctionalization, is an alternative neutral process that leads to an increase in organismal gene number for genes or functions that show modularity (one representative type of modularity is modeled here, but other types are also possible). Neofunctionalization is an alternative process leading to an increase in organismal gene number, but dependent upon rarer adaptive mutations. Neofunctionalization can include the evolution of a completely new binding capability (as modeled here) or modification/improvement of existing binding capabilities under positive selection after removal of pleiotropic constraint. These alternative fates are presented in the context of a lattice model in Figure 1.

Figure 1
figure 1

Eight different fates are possible in the two simulations, nonfunctionalization (pseduogenization) without cellular death, subfunctionalization, pleiotropic neofunctionalization plus either nonfunctionalization of the other copy or subfunctionalization, non-pleiotropic neofunctionalization, non-pleiotropic neofunctionalization also involving subfunctionalization, redundancy, and cellular death. Some of the fates have additional combinations of activity that are not represented in this figure. It should also be noted that some of the characterizations overlap. For example, pleiotropic neofunctionalization occurs in combination with nonfunctionalization, subfunctionalization, or is redundant.

Lattices are models of folded proteins in square or cubic shapes (a cubic lattice was employed here). The folding of a lattice is dictated by the contacts from amino acids that are not adjacent in the primary sequence (these contacts are present in the folded and unfolded states). Because lattices are small and the folding rules are simple, they can be used for evolving populations of proteins to study their structural properties.

Lattice models have previously been used to make important predictions about the behavior of proteins in evolutionary contexts, including their metastability [10] and the evolvability of new folds [11]. Lattices that bind to peptides [12] and small hydrophobic molecules [13] have been described and the latter used to show that subfunctionalization can lead to an increase in duplicate gene retention rates. Here, model genes that fold into lattices and bind peptides were duplicated, with neofunctionalization and subfunctionalization (simulation A) or just neofunctionalization (simulation B) as possible events that would preserve duplicated copies in a genome, with nonfunctionalization (pseudogenization) as an alternative fate (see Figure 1 and Table 1). The relative levels of duplicate gene preservation and the importance of both neofunctionalization and subfunctionalization were assessed.

Table 1 The initial amino acid sequences including binding sites and folding energies (RT units) are shown for the 10 proteins. Additionally, the three ligands for each protein are shown with their initial binding energies (also RT units) at the two sites (A in italics and B in bold). It should be noted that the two binding sites are adjacent in three dimensions and overlap by one amino acid.

Results and discussion

A set of 10 stably folded lattices was designed to each bind to 2 different ligands at overlapping sites. A third ligand was present in the cell, but did not bind at either site at the start of the simulation. The lattice was duplicated in a constant population of 1000 cells, where those cells that bound the third ligand were 5% more likely to appear in the next generation (a selection coefficient of 5% is arbitrary, but only serves as a scaler of the results). In each generation, 10% of molecules became nonfunctional at random through transcriptional knock-out. The fitness function required molecules to fold and genomes to have binding capabilities for the first two ligands. Cells were selected under the constraint that the first and second ligands needed to be bound, but could be bound by either molecule (subfunctionalization possible) in simulation A. The second simulation (simulation B) tightened this constraint and required both ligands to be bound to the same molecule (subfunctionalization not possible). In simulation B, only neofunctionalization is possible as a mechanism to preserve both copies non-redundantly. Neofunctionalization can occur through two mechanisms, a pleiotropic mechanism where the third ligand binds at a site that is also capable of binding one of the other two ligands and a non-pleiotropic mechanism, where the third ligand binds to an inactive site. The average values of each fate (from 10 different lattices) in each of the two simulations are shown in Tables 2, 3.

Table 2 For simulation A the final average (over 10 different peptides each repeated 10 times) frequency of each fate across generations is reported. Error bars are reported as the standard error of the mean.
Table 3 For simulation B the final average (over 10 different peptides each repeated 10 times) frequency of each fate across generations is reported. Error bars are reported as the standard error of the mean.

While initially, both models generate similar levels of neofunctionalization, with time model A begins to show significantly more neofunctionalization. In model A, the total number of subfunctionalized genes, including those that have also neofunctionalized increased initially, but then reached a plateau. These results are shown in Figures 2 (neofunctionalization), 3 (subfunctionalization), and 4 (nonfunctionalization, including those that have also neofunctionalized on the other copy). It is clear that allowing subfunctionalization results in a greater retention rate of duplicate genes with less nonfunctionalization, although subfunctionalization without neofunctionalization never accounts for a large fraction of the duplicate genes at any point in evolutionary time (total terminal preservation of both duplicates is shown in Figure 5). Figure 5 indicates that the retention profile is completely different when subfunctionalization occurs compared to when it does not. It is also clear in these simulations that there is not a strong selective pressure to retain robustness through redundancy, as seen in Figure 6.

Figure 2
figure 2

The total frequency of cells with neofunctionalized molecules (capable of utilizing ligand C) is shown for simulation A in green and simulation B in red. With time, the total level of neofunctionalization in simulation A surpasses that of simulation B.

Figure 3
figure 3

The frequency of subfunctionalization without neofunctionalization is shown for simulation A, where subfunctionalization is allowed. Subfunctionalization alone never represents a large fraction of cellular genomic fates.

Figure 4
figure 4

The total rate of nonfunctionalization is shown for simulation A in green and simulation B in red. Simulation B shows a much higher rate of nonfunctionalization at all evolutionary times.

Figure 5
figure 5

The total rate of duplicate copy retention without redundancy through either neofunctionalization or subfunctionalization is shown for simulation A in green and for simulation B in red. Allowing subfunctionalization to occur results in a different retention profile, with a much higher rate of duplicate copy retention.

Figure 6
figure 6

The frequency of redundant copies decreases with time in both simulation A (green) and simulation B (red). This occurs faster in simulation B, due to the faster rate of nonfunctionalization.

The role of subfunctionalization as a transition is based upon increasing the mutational space accessible to duplicates to neofunctionalize with removal of selective constraint at a binding site. This walk will differ for different lattices (and for different real proteins), modulating the importance of the effect of subfunctionalization. The rate of neofunctionalization in the absence of gene duplication (the emergence of new function in orthologs) is also related to the accessibility of this pleiotropic walk, but is expected to be even slower than that of neofunctionalization in the absence of subfunctionalization.

While it is not possible to systematically analyze duplicated fates and classify duplicated proteins as neofunctionalized, subfunctionalized, or redundant, this study has implications for our understanding of the role of duplication in the evolution of genomes. Protein segments that have lost function but are stably maintained in an expressed form will drift through sequence space until they achieve a function that makes them the targets of selection. Looking back to the origin of chordates, there is little doubt that gene duplication and the evolution of new function (as evidenced by annotation) went hand in hand. However, it may be that subfunctionalization initially played an important role in preserving copies that subsequently neofunctionalized over the past hundreds of millions of years.


Subfunctionalization has previously been shown to increase the retention rate of duplicate genes using a similar approach [13]. However, when neofunctionalization is included as a possible fate for duplicate genes, subfunctionalization is still important in short time frames after duplication. However, with increasing time, subfunctionalization decreases in importance and its role seems to be to preserve duplicate copies for eventual neofunctionalization, a role as a transition state. Subfunctionalization can still play an important role with larger finite population sizes, but the importance of neofunctionalization as a terminal fate becomes even more dramatic with increasing population size.


Lattice model for protein sequences

We considered a simplified model of evolving proteins. Our model consisted of a chain of 64 random (codon derived) amino acid monomers on a three dimensional 4 × 4 × 4 cubic lattice, simulating a folded protein. Gene sequences were selected randomly and lattices folded as below. Two adjacent binding sites were randomly selected on each lattice. Three peptides were then designed: two that bound specifically at each site and a third that was bound by neither site, as shown by the binding energies in Table 1.

Lattice folding and selection

Each amino acid was embedded at a single lattice point with distinct amino acids correspond to a distinct lattice point. All amino acids were considered to be of uniform size connected with covalent bonds of uniform lengths. A protein fold corresponded to a self avoiding walk over the embedding. The walking algorithm tracks the sites visited to avoid visiting them again. A contact was assumed to exist between two residues if they were not adjacently covalently connected but were on adjacent lattice points. The energy of the protein in a particular conformation was calculated according to the formula,

where γ(Ai, Aj) is the contact potential between residue type Ai at position i and residue type Aj at position j, and Uij is equal to one if residues i and j are not adjacent in sequence but are on adjacent lattice sites, and zero otherwise. The value of γ(Ai, Aj) is obtained from the symmetric interaction matrix given by Miyazawa & Jernigan [14].

Evolution of lattice proteins

We have simulated two evolution models. Model A, corresponding to the evolution of a set of ten protein sequences (shown in Table 1) evolving to the alternative fates shown in Figure 1 and Model B corresponding to the evolution of the same set of proteins evolving without allowing subfunctionalization. Cells that did not bind ligands A and B (Model A) and ligands A and B in the same molecule (Model B) died. All molecules also needed to fold to be active.

In each model, we considered 1000 haploid cells that did not recombine between copies (independently for all ten gene duplicate pairs), with a protein molecule evolving according to a Poisson distribution with an average of 1 DNA mutation per gene per generation after the duplication event with a transition to transversion ratio of 2. After every generation, 10% of genes were knocked out at random to simulate mutations to transcriptional regulatory sequences and the cells were subsequently divided into the different fates shown in Figure 1. The next generation of cells was picked randomly from the living cells of the previous generation to keep a constant population size, with a 5% selective advantage to the neofunctionalized cells according to the Wright-Fisher selection model [15]. At the start of each generational round, each cell was viable with two gene copies to bind each set of ligands, as shown in Table 1.