Gene duplication has long been viewed as a key driver of biological complexity in Eukaryotes [14]. Duplicate genes mainly arise via small-scale tandem or segmental duplication events or via large-scale whole genome duplications (WGDs). The latter are especially common in plants [5, 6]. Indeed, comparative genomic studies indicate that all extant seed and flowering plants have experienced one or more WGDs in their evolutionary history [712].

Following gene duplication (whether via tandem, segmental or WGD events), most duplicate copies become pseudogenes (i.e. lose their function) or are lost entirely due to deletions [13]. This is expected because of relaxed purifying selection due to functional redundancy. Large-scale deletions are especially common following WGDs, as the neopolyploid returns back to its ancestral diploid condition, a process referred to as diploidization. Nevertheless, some gene duplicates are retained, and these surviving duplicates appear to contribute importantly to the evolution of biological complexity and phenotypic novelty, in part because such genes are less constrained evolutionarily than are single copy genes [1416].

Several models have been put forward to explain how duplicate genes avoid pseudogenization, as well as to account for why some duplicate genes are retained and others are not [17]. These include (1) neofunctionalization, in which one of the duplicates (i.e. paralogs) acquires a new function; (2) subfunctionalization, in which ancestral function is partitioned among paralogs [1]; (3) relative dosage, in which duplicate genes are retained (or lost) to avoid dosage imbalances [18, 19]; and (4) absolute dosage, in which the fixation of duplicate genes is due to selection favoring an increase in gene dosage [20] or metabolic flux [21].

In this paper, we focus on the predictions of the relative dosage model, also known as the Gene Balance Hypothesis (GBH) [18, 22], as this hypothesis has garnered the most support from real data [19, 2326]. According to the GBH, genes with a large number of interactions (i.e., “connected genes”) should be retained disproportionately following WGD events thereby maintaining optimal stoichiometry among their products; when a WGD event occurs, all genes are duplicated simultaneously and so relative gene dosage should not change. In small-scale duplications (e.g., tandem events), the increased dosage of a single, highly connected gene could result in decreased fitness, or even in lethality. Therefore, connected genes are expected to be differentially lost following small-scale duplications. Conversely, genes that work alone or have few interactions, such as those involved in disease resistance, are more likely to be retained following tandem duplications.

Patterns of gene retention in Arabidopsis thaliana are largely consistent with GBH predictions. For example, highly connected genes such as transcription factors have been preferentially retained after WGDs in A. thaliana, but disproportionately lost following small-scale duplications [23, 24]. The bias towards regulatory genes chiefly derives from duplicates of intermediate age (circa 50–70 mya), which are mainly WGD-associated [13]. Similar findings have been reported for poplar [26] and rice [25]. In contrast, paleologs (paralogs arising from WGD events) in the Compositae family are enriched for genes annotated to structural components or cellular organization gene ontology (GO) categories, while genes involved with transcription appear to be significantly under-represented [10]. In A. thaliana and Sorghum bicolor, both WGD and tandem mechanisms of duplication are associated with paralogs involved in high metabolic flux networks [21], an observation best predicted by the absolute dosage model.

In addition to mechanism of duplication, the fate of paralogs may be influenced by genetic background, various environmental factors, epigenetic effects, genetic drift, and the mechanism of gene dosage-compensation [15, 21, 27]. Another potential issue concerns the faster turnover rates of tandem paralogs relative to those originating via WGDs [7, 14, 28, 29]. As a consequence, the sampling of tandem paralogs is biased towards young gene duplicates whereas that of WGD paralogs is skewed towards old duplications. As far as we are aware, this bias has not previously been accounted for when inferring patterns of duplicate gene retention.

Here we investigate the impact of duplication mechanism, gene function, and age of duplication in the retention of duplicate genes. Our analyses consider both WGD and tandem duplications, as these are the two most frequently invoked mechanisms to explain how paralogous gene pairs are generated in plant genomes [3, 23, 24, 30]. We have targeted 25 plant species with fully sequenced genomes that include the basal land plants, Physcomitrella and Selaginella, the basal flowering plant Amborella, and as well as 14 flowering plant orders. This diverse array of taxa enables comparisons of taxa with highly contrasting histories of polyploidy, including at least one species with no known WGD in its evolutionary history (Selaginella). This is critical, because it allows us to control for potential biases caused by unequal duplicate gene turnover rates. Our focus is on genes annotated as transcription factors, since differential retention of duplicated transcription factors provides the main support for the GBH. We specifically address the following questions: (1) Is the turnover rate of WGD paralogs persistently lower than that of tandem paralogs? (2) Which functional gene categories are consistently overrepresented among WGD and/or tandem paralogs? (3) Does variation in duplicate gene retention depend significantly upon the age of WGD paralogs? and (4) To what extent do our results support for the Gene Balance Hypothesis?


Origin and turnover rate of paralogs

For each of the 25 study species, we calculated Ks time divergence between pairs of paralogs and used a synteny-based approach to categorize members of all gene families as derived from WGD or tandem duplications. Duplicate pairs whose origins were uncertain based on available data were classified as “undefined”. Across the 25 target genomes, the majority of paralogs detected had Ks ≤ 2 (Table 1) including 79% of paralogs in A. thaliana, 86% in Glycine max, and 92% in Malus domestica. Paralogs with Ks > 2 were excluded from our analyses due to concerns that Ks saturation could impair reliable inferences [31]. Most species displayed clear prominent peaks in their Ks age histograms, which is illustrated by histograms for five species with contrasting histories of polyploidy (Fig. 1). Histograms for the remaining 20 species are depicted elsewhere (Additional file 1: Figure S1). In the K-S goodness of fit test, all histograms for all species except Carica deviated significantly (P < 0.05) from the null model of constant duplicate gene birth and death (Additional file 1: Table S1). SiZer maps identified a significantly increasing gradient in the Ks age histograms of WGD-derived paralogs of most species, which provides support for polyploid signals being well distinguished from background duplications.

Table 1 Distribution of paralogous gene pairs for 25 plant species targeted by this study
Fig. 1
figure 1

Ks age distributions (a and b) and SiZer maps (c to e) of five plant species. a Brown bars, all paralogs (background); black bars, WGD-derived paralogs predicted by DAGchainer; yellow bars, paralogs annotated as transcription factor activity (GO:0003700). b Brown bars, background; gray bars, tandem-derived paralogs predicted by DAGchainer. SiZer maps for c All paralogs; d WGD-derived paralogs; e Transcription factor paralogs

The Ks age histograms of WGD-derived paralogs (Fig. 1a, depicted in black) were clearly distinct from those of the tandem-derived paralogs (Fig. 1b, depicted in gray). While tandem histograms exhibited a descending slope (similar to a half-parabola) for most of the species, WGD-derived paralog histogams had peaks that overlapped with peaks from histograms of all paralogs (Fig. 1a, depicted in brown). SiZer maps also confirmed the presence of peaks for WGD-derived paralogs histograms (Fig. 1d).

Because of our focus on transcription factor paralogs, their Ks age histograms are shown (Fig. 1 and Additional file 1: Figure S1; depicted in yellow) along with the Ks histograms of WGD- and tandem-derived paralogs. The SiZer maps (Fig. 1e) showed increasing gradients for transcription factor paralogs that overlapped with the slopes of WGD-derived paralogs.

Biased retention of paralogs after large- and small-scale duplications

To assess the universality of the GBH across land plants, we identified the most strongly overrepresented GO functional categories in both predicted WGD- and tandem-derived paralogs in these 25 genomes. We found that WGD- and tandem-derived paralogs did not share the top 10 most frequently overrepresented GO categories (Fig. 2a and c). While the most overrepresented categories of WGD-derived paralogs fell under macromolecular complexes (GO:0032991), internal to cell (GO:0005622), and cytoplasm (GO:0005737) functional GO categories; those of tandem-derived paralogs grouped into programmed cell death (GO:0012501), defense response (GO:0006952), and apoptotic process (GO:0006915) GO categories.

Fig. 2
figure 2

Heat maps of GO categories across 25 plant species. a The 10 most frequent GO categories overrepresented among WGD-derived paralogs. b Transcription factor activity category (GO:0003700) enrichment analysis for WGD-derived paralogs. c The 10 most frequent GO categories overrepresented among tandem-derived paralogs. Color gradient represents the Corrected P value calculated by the ErmineJ software: brown colors, significant over-representation (P < 0.05); yellow colors, reduced or non-significant enrichment; and gray color, no enrichment

In six species, WGD-derived paralogs were not enriched for the overrepresented GO categories found in the remaining plant species. Five of them—Cariaca, Ricinus, Populus, Selaginella, and Physcomitrella—have few WGD-derived paralogs predicted by DAGchainer (Table 1), consistent with possible under-estimation or misidentification of WGD-derived paralogs in these species (see Discussion below). For another five taxa—A. thaliana, Medicago, both Oryza subspecies, and Populus—WGD-derived transcription factor paralogs were overrepresented (Fig. 2b). Surprisingly, WGD-derived transcription factor paralogs were not significantly overrepresented in Arabidopsis lyrata, which shares the same WGD events as A. thaliana, although there was a trend in the expected direction.

Unexpectedly, transcription factor activity (GO:0003700) WGD-derived paralogs were not significantly overrepresented in 20 plant species, ten of which exhibit evidence of recent WGDs in their evolutionary history, with a significantly increasing gradient in SiZer (Fig. 1 and Additional file 1: Figure S1) within Ks range < 1 (and consistent with previous reports—see below). Finally, results from our analyses of tandem duplications showed tandem-derived transcription factor paralogs were significantly underrepresented across the 25 focal genomes.

Biased retention toward ancient transcription factors

We analyzed the biased retention of transcription factor paralogs based on Ks time divergence as opposed to mechanism of duplication. This was accomplished by mapping known WGD events onto a phylogeny for the 25 species targeted by this study (Fig. 3, Additional file 1: Table S2).

Fig. 3
figure 3

Phylogenetic distribution of transcription factor retention biases among 25 plant species. The phylogenetic tree was adapted from PLAZA 3.0. Symbol code: Black circles on the tree branches, all known WGD events we also identified in this study; Open circles, suggested ancient WGD events we did not examine; triangles, species with WGD-derived transcription factor paralogs significantly overrepresented; pentagons and stars, species with transcription factor paralogs significantly overrepresented in range 1.5 < Ks ≤ 2 and range 1 < Ks ≤ 2, respectively

In general, transcription factor (GO:003700) paralogs tend to be overrepresented amongst ancient (Ks > 1) duplication regardless of mechanism of duplication (Fig. 3). Eleven of the 25 focal species exhibited significant enrichment at Ks range > 1.5 (Fig. 3, pentagons), but no such retention bias at lower Ks ranges (≤1.5). When we compared transcription factor paralog enrichment at Ks > 1.0 versus < 1.0, 17 species showed significant enrichment for the older transcription factor paralogs (Fig. 3, stars). For four of these, A. thaliana, Medicago, and the two Oryza subspecies, the overrepresented transcription factors originated from WGD events (Fig. 2b). However, for the remaining 13 species, the ancient paralogs are not obviously associated with a WGD event. Although A. thaliana, Oryza sativa ssp. indica, and Solanum exhibited significant signals of polyploidy in the Ks range < 1 (Fig. 1; Additional file 1: Figure S1), their transcription factor paralogs were only significantly overrepresented in the Ks range > 1 (Fig. 3f).

In genomes of only four taxa (Carica, Malus, Manihot, and Populus) were recent transcription factor paralogs overrepresented, and only for Populus were WGD-derived transcription factor paralogs significantly overrepresented (Fig. 2b).

In addition to analyzing the retention of transcription factor paralogs, we submitted our data to enrichment analysis aiming to find additional GO categories that could have experienced biased retention patterns. A number of GO categories, including those involved in transcription, regulation, transport, and response to stimulus were frequently overrepresented among ancient paralogs (Ks > 1) and not exclusively associated to WGDs (Additional file 1: Figure S2). While three of these functional GO categories—cell periphery (GO:0071944), plasma membrane (GO:0005886), and response to abiotic stimulus (GO:0009628)—were overrepresented among WGD-derived paralogs; two categories—response to stimulus (GO:0050896) and catalytic activity (GO:0003824)—were overrepresented among tandem-derived paralogs.


Tandem paralogs have faster turnover rate

Our synteny-based approach identified pairs of WGD-derived genes similar to those that have been reported in previous studies. In A. thaliana, for example, circa 80% of the 2740 duplicate gene pairs we classified as WGD-derived are in common with the list of polyploidy-derived paralogs published by [23]. Differences among studies may be due to new gene annotation tools that recently became available. In some instances, the number of paralogs predicted as having their origin in WGD events can be underestimated due to widespread genomic changes (e.g., gene loss and/or chromosomal rearrangements) after polyploidization events [19]. Such processes are particularly problematic for ancient polyploidization events, which may explain the low number of WGD paralogs we predicted in the basal plants, Amborella and Physcomitrella, as well as for Lotus, Carica, and Ricinus (Table 1). On the other hand, our approach indicates the presence of a small number WGD-derived paralogs in Selaginella, which is not thought to have a WGD in its evolutionary history (Table 1). This result could be evidence for Selaginella as ancient polyploidy. Alternatively, it suggests that Selaginella has had an ancient large segmental duplication or some fraction of the identified WGD derived paralogs are false positives. However, we selected WGD pairs using a syteny based approach, which is the most conservative method presently available.

Tandem paralogs were similarly identified based on the genomic coordinates of genes. In Eucalyptus, 32% of its 36,449 protein-coding genes originated via tandem events, which is the largest proportion of tandem-derived paralogs amongst the 25 plant species we investigated. Physcomitrella exhibited the smallest proportion (~1%) of tandem-derived paralogs. These findings are very similar to those previously reported for Eucalyptus [32] and Physcomitrella [33], respectively.

We identified peaks in the Ks age histograms; based on SiZer maps, these peaks likely result from WGDs (Fig. 1 and Additional file 1: Figure S1). Previous studies have also identified these WGD events using data that span across several families [34], or from a given plant species [9, 10, 32]. In the Ks histogram of A. thaliana, for example, there were two prominent peaks (Fig. 1), which coincided with the α and β polyploid events reported by early investigations [3436]. In our analysis, the tail of the most recent duplication masked the second peak; thus, a single, significantly increasing slope was identified by SiZer. In A. lyrata, SiZer identified two significant peaks as expected given the recent history of polyploidy in Arabidopsis [36].

Differences in the Ks age histograms from WGD- and tandem-derived paralogs indicates that the turnover rate of tandem paralogs is faster than that of WGD paralogs, as previously suggested by others [7, 14, 29, 33]. The pattern we uncovered suggests lower turnover rates of transcription factor paralogs than those observed for tandem paralogs. Furthermore, it appears that the origin and biased retention of transcription factor paralogs are not restricted to large-scale duplication events.

Patterns of transcription factor retention following WGDs

Consistent with the expectations of the GBH, WGD- and tandem-derived paralogs did not share the top 10 most frequently overrepresented GO categories. Six species—Malus, Cariaca, Ricinus, Populus, Selaginella, and Physcomitrella—were exceptions and did not share the most frequent GO categories, which is consistent with the possible under-estimation or misidentification of WGD-derived paralogs in these species. In Malus, for example, the GO categories that were overrepresented include: plasma membrane (GO:0005886), response to abiotic stimulus (GO:0009628), response to biotic stimulus (GO:0009607), and response to endogenous stimulus (GO:0009719). Analyses of an EST library of Malus domestica also found that these categories were overrepresented [37]. Consistent with the GBH, we did not found tandem-derived transcription factor paralogs overrepresented in any of 25 focal genomes.

Other findings were inconsistent with the predictions of the GBH. In plants, the genome of A. thaliana has been frequently used to support dosage-constraints of transcription factors [23, 24]. Unexpectedly, our findings reveal transcription factor activity (GO:0003700) WGD-derived paralogs to be significantly overrepresented in only five plant taxa—A. thaliana, Medicago, the two Oryza subspecies, and Populus. Ten of the 20 remaining study species exhibited evidence for recent WGDs. Other studies have also reported a downward bias in the retention of transcription factor paralogs following WGD events. In Compositae paleologs, for example, it has been observed that genes involved with structural components or cellular organization were significantly overrepresented; whereas transcription factors were significantly underrepresented [10]. These authors argued that patterns of intrinsic selection on different gene categories may vary across higher taxonomic categories. The fate of paralogs originated by either WGD or small-scale events would depend on intrinsic properties, such as gene function and the environment in which the new polyploid was born [21].

Age of duplications contribute to biased gene retention

Regardless the mechanism of duplication, we showed that ancient paralogs of transcription factors were preferentially retained over paralogs of more recent origin. In agreement to our findings, a previous study in A. thaliana reported that genes involved in transcriptional regulation showed greater retention after the later (β) genome duplication than after the youngest (α) duplication [24]. Likewise, transcription factors not directly associated with WGDs were overrepresented among genes of ancient origin in A. thaliana [13]. Again, our results indicate that out of 25 plant species with very different histories of polyploidy, such as A. thaliana which has two recent WGD events [35] and Vitis which has no known recent WGD events [9], 17 share this pattern of biased retention of ancient transcription factor paralogs. Although transcription factor paralogs with recent origin were over-represented in four species (Carica, Malus, Manihot, and Populus), we could only clearly determine that those of Populus were WGD-derived paralogs. The over-representation of young (Ks < 0.5) transcription factor paralogs in Carica is intriguing, given that no WGD events likely took place in its recent evolutionary history [38] and that DAGchainer only predicted tandem-derived transcription factor paralogs for Carica within the Ks range ≤ 1.0 (Additional file 1: Table S2). Given that Carica lacks recent WGD events [38] and we did not identified transcription factors paralogs originated from WGD events within Ks < 1, the many transcription factor paralogs of Carica appear to derive from small-scale duplications within its genome.

Our findings differ from a recent study of core gene families in 37 angiosperm genomes [13], which reported remarkable consistency in the rate at which genes return to a single copy state, as well as in the gene families that are retained as multi-copy. The findings were related to differences in gene function and the authors concluded that similar selection pressures within and between lineages are largely responsible for the non-random patterns observed, at least for core genes [13]. The apparent differences between the two studies derive partly from the fact that core gene families represent a fairly small fraction (13%) of all gene families and that single copy genes in were included their analysis, which drive many of the reported patterns. In contrast, we restricted our analyses to duplicate genes.


Our analyses imply that both the fixation and retention of duplicated genes are context-dependent events. Thus, while the mechanism of duplication is clearly important, so are the characteristics of the particular lineage in which the duplication arises, as well as timing of duplication. Although our results show that many transcription factor paralogs do indeed derive from large-scale duplication events, this is not conclusive evidence for the GBH. Observations seemingly inconsistent with the GBH include, for example, the preferential retention of transcription factor paralogs in taxa with no apparent history of polyploidy or following tandem duplications in Carica, as well as the absence of such retention biases following some recent WGDs (e.g. Glycine, Helianthus, and Zea). Nonetheless, the most important observation in this paper—the strong bias toward ancient transcription factor duplicates seen in most plant genomes—may be interpreted in a manner consistent with the GBH. Possibly, all plant lineages are the product of multiple ancient WGDs, the earliest of which are no longer detectable. Under the GBH, the duplicates from the first polyploidization would be most likely to be retained to maintain optimal stoichiometry among gene products. The number of paralogs is expected to grow rapidly with each polyploidization event. With so many paralogs, changes in the amount of the gene product might be tolerated and a copy of the gene can be lost or diverge. This could lead to the pattern we see—biased retention toward ancient transcription factor paralogs—and also might account for the weaker signal we see among recent transcription factor paralogs. It even could account, in part, for the greater tolerance of recent tandem transcription paralogs seen in Carica.


Data collection and selection of paralogs

Full genome annotations, protein-gene codes, DNA sequences, gene families, and Gene Ontology (GO) annotations from the 25 focal species were retrieved from PLAZA 2.5 and 3.0 Dicots [39], with the exception of sunflower (Helianthus annuus), as detailed in Additional file 1: Table S3. Protein-gene code files with alternative transcripts removed were used to identify paralogous gene pairs using BLASTp all-against-all, with an e-value cutoff of e−20, with a minimum 50% identity, alignment length > 300 bp, number of mismatches < 550, and number of gap opens < 30. Self hits were removed and only paralogous gene pairs with both copies belonging to the same gene family were maintained for further analysis. For the selection of paralog pairs for H. annuus, CDS sequences and BLASTn all-against-all were used based on HA412.v1.1 version of the genome (

Determining paralog duplication mechanism

The DAGchainer software package [40] was used to predict the mechanism of by which paralogs originated based on their genomic coordinates. WGD-derived paralog pairs were predicted by running DAGchainer to find syntenic/collinear regions among chromosomes, in the same species, using default parameters and ignoring tandem duplication alignments (-s and -I options). Tandem-derived paralogs were predicted by using the accessory segmental duplication tool, also made available by DAGchainer, to find collinear sets of homologous genes, with the ‘max intervening genes value’ set to 10. All the other paralog pairs, not predicted as WGD or tandem-derived, were marked as undefined (UD), as these paralogs may have been originated by either large- or small-scale duplications.

Age of duplication events

We calculated relative divergence times for each paralog pair in terms of synonymous substitutions per synonymous site (Ks). First, we aligned the nucleotide sequences of gene pairs using TranslatorX [41], based on protein alignments performed by MUSCLE v3.8.31 [42]. Divergence times (Ks) were calculated with the yn00 software from the PAML v4.1 package [43]. This method assumes the F3x4 codon frequency model and accounts for transition/transversion rate bias and codon usage bias, which is an approximation of the maximum likelihood method recommended for pairwise comparisons in the manual of PAML. Because of issues associated with Ks saturation and stochasticity [31], only paralogs with Ks ≤ 2 and Standard Error (SE) < 0.5 were used in further analyses.

Custom python scripts were used to parse the BLAST all-against-all output in order to identify the closest paralog gene pairs. First, self hits were removed. Then, paralogs were organized into a single gene list and then used to select the corresponding paralog pair(s) for each of these genes based on the following three rules: (I) if a single gene was predicted as WGD-derived by DAGchainer, keep the duplicate pair with the lowest Ks value, while still allowing pairing with tandem-derived or undefined genes; (II) if not predicted as WGD-derived, but predicted as tandem-derived, keep the gene pair with the lowest Ks value; and (III) if the single gene was not predicted as WGD- or tandem-derived, keep the undefined paralog pair with the lowest Ks value.

GO annotation and over-representation analysis

Functional Gene ontology GO terms (categories) were determined for each gene and paralog pair and then evaluated for enrichment by the ErmineJ v3.0.2 software [44]. All the three GO domains (Biological Process, Molecular Function and Cellular Component) were included in the Over-Representation Analysis (ORA), with a minimum gene set size equal to 10 and the Best Scoring gene replicate treatment. Eight different groups of paralogs were analyzed: WGD-derived, tandem-derived, and paralogs representing the following Ks ranges: (A) 0 < Ks ≤ 0.5, (B) 0.5 < Ks ≤ 1, (C) 1 < Ks ≤ 1.5, (D) 1.5 < Ks ≤ 2, (E) 0 < Ks ≤ 1 and (F) 1 < Ks ≤ 2. The GO categories were considered overrepresented if Corrected P < 0.05, as calculated by the ErmineJ software.


K-S goodness of fit test

The Kolmogorov-Smirnov test [45] was used to evaluate if the age distribution (Ks) of all duplicates (background) deviated significantly (P < 0.05) from a simulated null hypothesis of constant duplicate gene birth and death.

SiZer maps: identifying significant peaks in Ks histograms

Significant peaks in the Ks histograms were found by SiZer [46] implemented on R software, with the following command line: SiZer.1 < − SiZer(x, y, h = c(.05,5), degree = 1, derv = 1). A SiZer map is a way of examining when the p-th derivative of a scatterplot-smoother is significantly negative, possibly zero or significantly positive across a range of smoothing bandwidths. In a SiZer map, blue indicates a significantly increasing gradient, red is a significantly decreasing gradient, purple is a non-significant gradient and gray indicates that data are too sparse for reliable estimation.