Expression and regulatory asymmetry of retained Arabidopsis thaliana transcription factor genes derived from whole genome duplication
Transcription factors (TFs) play a key role in regulating plant development and response to environmental stimuli. While most genes revert to single copy after whole genome duplication (WGD) event, transcription factors are retained at a significantly higher rate. Little is known about how TF duplicates have diverged in their expression and regulation, the answer to which may contribute to a better understanding of the elevated retention rate among TFs.
Here we assessed what features may explain differences in the retention of TF duplicates and other genes using Arabidopsis thaliana as a model. We integrated 34 expression, sequence, and conservation features to build a linear model for predicting the extent of duplicate retention following WGD events among TFs and 19 groups of genes with other functions. We found that TFs was the least well predicted, demonstrating the features of TFs are substantially deviated from duplicate genes in other function groups. Consistent with this, the evolution of TF expression patterns and cis-regulatory cites favors the partitioning of ancestral states among the resulting duplicates: one “ancestral” TF duplicate retains most ancestral expression and cis-regulatory sites, while the “non-ancestral” duplicate is enriched for novel regulatory sites. By modeling the retention of ancestral expression and cis-regulatory states in duplicate pairs using a system of differential equations, we found that TF duplicate pairs in a partitioned state are preferentially maintained.
These TF duplicates with asymmetrically partitioned ancestral states are likely maintained because one copy retains ancestral functions while the other, at least in some cases, acquires novel cis-regulatory sites that may be important for novel, adaptive traits.
KeywordsExpression divergence cis-regulatory evolution Duplicate retention
Area Under Curve-Receiver Operating Characteristic
Whole genome duplication
Plant genomes are replete with paralogous genes derived from a variety of duplication events and mechanisms, particularly whole genome duplication (WGD) [14, 15, 20, 21, 52, 56, 61, 72, 73, 77, 78]. Two ancient WGD events took place prior to the divergence of angiosperms . Subsequently, more than a dozen WGD events have occurred across a variety of angiosperm lineages [33, 41, 47, 55, 65, 75], including three in the lineage leading to Arabidopsis thaliana . As the last known WGD event in the Saccharomyces cerevisiae [30, 79] and human [12, 53] lineages occurred prior to the radiation of angiosperms, WGD occurs much more frequently in plants relative to other eukaryotic lineages.
WGD accounts for ~ 90% of the expansion of TF families across plants lineages  and TFs are consistently enriched among WGD duplicates across divergent plant species [10, 36, 63]. In addition, plant TF duplicates derived from WGD are retained at higher rates than most plant genes with other functions [62, 63]. These duplicate TFs contribute significantly to plant adaption , agricultural traits , and domestication . The expansion of several TF families coincides with major events in the evolution of plants, such as the migration to land and expansion of flowering plants [11, 64, 76]. TF duplication is also central to the evolution of flowering time , floral structures  and fruit development [38, 43].
Because WGD results in duplication of all genes in a genome, the differences in the degrees of expansion of different gene families [7, 24, 37, 62] must result from differential rates of gene retention. Previously, a collection of features including sequence properties (e.g. gene length), biochemical activities (e.g. expression level), evolutionary characteristics (e.g. substitution rates), and annotated functions have been used to assess the properties of retained duplicates in general [26, 45]. It remains an open question how well these properties may explain the retention rates of genes duplicated via different WGD events and in specific groups of genes, such as TFs and genes with other functions. It is also unknown how these properties differ between TFs and other functional groups of genes.
In this study, we first modeled the percent retention of TFs as a group and 19 other function groups of genes using 34 gene features in three broad categories (expression, sequence, and conservation). Then, to assess how the ancestral and extant functions of duplicate pairs have diverged relative to their ancestral function, we determined how gene expression and cis-regulatory sites of TF duplicates have likely evolved post WGD by inferring the ancestral expression and cis-regulatory states of extant TF duplicates. Finally, we modeled the evolution of TF WGD duplicates as a system of differential equations which tracks the change in frequency of duplicate pairs retaining the ancestral state in both, one, or neither to assess whether the partitioning of TF duplicates pairs is maintained by a bias against losing the ancestral state in the second duplicate copy.
Results & discussion
Retention of duplicate genes in different function groups following WGD
Statistics for the best fitting model for the odds ratio of duplicate retention for each WGD-event
Features explaining degrees of retention across function groups and WGD events
The importance of all features used in the linear models of duplicate retention in function groups across each WGD event
Expression Mean (AtGenExpress)
Expression Maximum (RNASeq)
Number of Domains
Nucleotide Diversity (Pi)
Expression Correlation (AtGenExpress)
Expression MAD/Median (AtGenExpress)
Protein Length (in Amino Acids)
Maximum Percent Identity
However, certain parameters in our model do show sensitivity to what functional groups are used. In order to test the robustness of our models of duplicate retention, we made new, truncated data sets by leaving out one functional group per set and performed our optimization procedure again on each truncated set (see Additional file 1: Tables S3-S5). While most parameters show small deviation in response to the removal of individual functional groups (5–11% relative to the mean), we observed cases where the standard deviation was > 15%: the number of domains (15.7%) and nucleotide diversity (26.2%) in the α WGD model, as well as nucleotide diversity (15.1%) and maximum percent identity (16.5%) in γ WGD model. This elevated variance in these parameters when we used the truncated dataset is primarily driven by the removal of three functional groups: defense response, TFs, and translation. Of these, the TF group stands out as, without TFs, nucleotide diversity is dispensable in model of α WGD retention (F-statistic = 12.73 without nucleotide diversity, F-statistic = 12.74 with nucleotide diversity). In addition, our model fit of γ WGD retention is no longer significant after leaving TFs out (p = 0.12). This is expected given that estimates of TF retention are more underestimated in α and γ models than any other functional group (Fig. 1c) and thus the retention of TFs likely represents an extrema relative to most functional groups (Additional file 3: Figure S2).
Although the degree of retention predicted by the models closely align with the actual values for each function groups across each event (R2, α = 0.87, β = 0.83, γ = 0.65; see Fig. 1c), the estimates of different parameters is affected by the choice of functional groups being considered. The presence or absence of TFs in particular is highly influential which is to be expected given that TFs have such a high degree of duplicate retention. This is further demonstrated by the fact that the Rd of TFs was underestimated in all three models (black arrows, Fig. 1c, Additional file 3: Figure S2), and the net underestimation of TF retention summed across all models (Rd = 1.25) is 45.5% higher than the next nearest functional group (ubiquitin transferase, Rd = 0.86). For this reason, we chose to examine the feature distributions among TFs duplicates relative to those from other functional groups. Specifically, the retention of duplicates from the γ event is correlated with maximum percent identity, but the magnitude of the parameter associated with the feature is reduced by 37.9% if TFs are excluded (Additional file 1: Table S5). The identities of TF WGD-singletons (only one copy retained) to their best matches (66.9%) are significantly higher compared to the genome-wide WGD-singleton average (61.3%, Welch’s t-test, p = 1.9e-223), although identities of WGD-duplicates to the best matches are similar between TFs (71.3%) and genome-wide average (72.5%). The higher than average identity of TFs explains why the removal of TF functional group has such an impact on the γ model and the estimation of the effect of maximum percent identity in particular. In spite of this, the error in the γ model for TF retention was 0.802 even when TFs were included, the largest underestimation of all of our model predictions. However, if we assume TF WGD-singletons had a more typical distribution of maximum percent identity (i.e. duplicate are 10% higher on average than 5.6%) the predicted degree of TF retention of the γ event becomes 2.94 (red dot, Fig. 1c), reducing the error by almost half in our original model.
In addition to the linear models for predicting degrees of retention at the function group level, we have established machine learning models incorporating the same features to predict whether a gene likely have retained duplicate or not (Additional file 4: File S1). Similar to the linear model, the machine learning model performed the poorest when predicting TFs (Area Under Curve-Receiver Operating Characteristic = 0.75) compared to predicting all genes (0.88, Additional file 4: File S1). Taken together, we demonstrated that degree of retention for genes in different function groups are related to multiple features that are impacted by the timing of WGD events. However, while these features are useful for predicting the degree of retention for some function groups, they systematically underestimated degree of retention for TFs. The behavior of TFs departs from the norm in part because underlying differences in the features of TFs and genome average.
Partitioning of ancestral expression states following TF duplication
The “partitioned” state of TF WGD-duplicates pairs is over-represented at lower degrees for more ancient β and γ WGD events (Fig. 3d). We confirmed that there is indeed significant interaction between the expression state of a TF WGD-duplicate pair and the timing of the WGD event (ANOVA, p < 2e-16), indicating that partitioning occurred relatively quickly after the most recent WGD, but that these partitioned patterns were not necessarily maintained as the duplicates age. Next we asked if TF duplicate expression levels tend to increase or decrease when they deviate away from the ancestral state using each expression data subset. For the LightDev (left panel, Fig. 3d), Ctrl, and Stress expression level subsets (Additional file 6: Figure S4), deviation from ancestral expression states among duplicates tend to be small (i.e. mostly by one quartile) and negative. In contrast, we found that TFs were equally likely to increase or decrease differential expression in response to stress compared to the ancestral state (Fig. 3d, Additional file 6: Figure S4). We also modeled the transition from ancestral expression (O) to higher (+) and lower (−) expression level states following WGD (see Methods). The results of these models can be found in (Additional file 7: Figure S5). In the two-parameter model (the rates from O to + and - were allowed to differ), the rate of evolution from O to - was 1.9~3.1 times more frequent than that from O to +. For the Diff subset, O to - was 1.2 times more frequent but not significant (p = 0.43). These results further suggest that the evolution of TF duplicates favors decreasing expression levels relative to the ancestral expression state. However, when looking at differential expression in response to stress, TF duplicates can evolve in either direction with similar likelihood. Thus, following duplication, TF duplicates may have increased or decreased responses to stress, rather than losing the response altogether, in sharp contrast to the patterns when all duplicate genes were considered [81, 82].
Asymmetry in the partitioning of ancestral expression
Asymmetry in the partitioning of ancestral cis-regulatory sites
The ancestral copy is likely retained due to selection of inherited ancestral states. How about the non-ancestral copy? One possibility is that, despite the extreme asymmetry, some non-ancestral copies may still retain some ancestral functions that are subjected to selection. Another hypothesis is that the non-ancestral copy is retained because it has acquired a novel function in the form or new expression or regulatory states. To test this, we applied our model of ancestral-state partitioning to cis-regulatory sites. Using putative binding sites of 345 A. thaliana TFs , we inferred ancestral cis-regulatory sites of ancestral TFs (see Methods). Loss of an ancestral cis-regulatory site in only one TF copy (57%) occurs more often than expected (42.3%; t-test, p < 1e-323). In contrast, observed retention (10.5%, expected = 24.0%) and loss (16.2%, expected = 18.5%) of ancestral cis-regulatory sites in both WGD-duplicates were significantly less frequent than expected (p < 1e-323). In addition, the partitioning patterns of ancestral cis-regulatory sites were highly asymmetric (Kolmogorov–Smirnov test, p < 2.2e-16; Fig. 4c). Thus, much like what we observed for expression, TF WGD-duplicates can be classified into ancestral and non-ancestral copies with regard to cis-regulatory sites.
Most importantly, in 177 of the 249 duplicate pairs with ≥1 novel regulatory sites (71.0%), the non-ancestral copy tend to have more novel cis-regulatory sites (Fig. 4d), significantly higher than random expectation (50%, p < 3.8e-12). In addition, the novel cis-regulatory sites are only found in the non-ancestral copies in 61.8% of duplicate pairs, compared to 14% of pairs where all of the novel sites are in the ancestral copies. Novel cis-regulatory sites are also over-represented (odds ratio = 3.54) in the promoters of putative non-ancestral genes compared to ancestral ones (Fisher’s Exact Test, p < 2.2e-16). These patterns suggested that, the acquisition of novel cis-regulatory sites likely contribute to the retention of the non-ancestral TF duplicate copies. This conclusion is likely similar if we consider novel expression states because the ancestral and non-ancestral designation defined according to expression levels tend to have the same designation based on cis-regulatory sites (59.8%, compared to expected by random association at 24.6%, p = 1.8e-20).
Patterns of WGD-duplicate divergences and partitioning results from evolutionary bias
We found the two-parameter model to be significantly better at explaining the observed difference in WGD-duplicate states over time (Likelihood Ratio Test, p < 2e-14). Considering expression states, the O➔I transition rate were 7 to 13 times higher than the I➔II transition rate (Fig. 6b). Thus, the number of partitioned WGD-duplicates accumulated rapidly post WGD, followed by a relatively slow accumulation of cases where ancestral expression states had been lost in both duplicates. We also assessed a four-parameter model (O➔I, I➔II, II➔I, I➔O) of expression state evolution that was not better than the two-parameter model. In contrast, applying this same approach to model regulatory site evolution revealed that the four-parameter model is significantly better (p of 4.8e-13 and 1.2e-11 vs. one and two-parameter models, respectively; Fig. 6c). The rates governing the O➔I transition (x) are two orders of magnitude higher than the I➔II transition (w, Fig. 6d). Importantly, in the four-parameter model for cis-regulatory sites, there was a high rate of O➔I transition estimated at the early stage of WGD (blue curve, Fig. 6c). In addition, an appreciable proportion of partitioned duplicates lost ancestral regulatory sites in the second copy (green curve, Fig. 6c). This is in sharp contrast compared to the transition rate estimate over time for expression where second copies tend not to lose ancestral expression state (Fig. 6b), indicating that regulatory sites are faster evolving and more labile compared to expression states.
In this study, we used linear models to assess how expression, conservation, and sequence structural features of genes in these functional groups may explain their retention rate difference. The value distributions of TF features are significantly different from genes in the rest of the genome that result in lower predictability. When considering each TF WGD-duplicate pair, one TF duplicate tends to have reduced expression level relative to the inferred ancestral level. In addition, we found that ancestral expression and cis-regulatory sites tends to be partitioned between TF duplicates asymmetrically such that there are distinct ancestral and non-ancestral duplicates. Interestingly, the non-ancestral TF duplicates tend to gain novel cis-regulatory sites that likely contribute to new expression patterns. Finally, we demonstrate a preference for maintaining partitioned expression and cis-regulatory site states between TF WGD-duplicate pairs.
Multiple mechanisms have been proposed to explain why duplicate genes are retained. The gene balance hypothesis [4, 5, 6] has been proposed to specifically explain the retention duplicates of TFs and other genes with larger numbers of interactions/functions [1, 42, 62, 63]. The hypothesis stipulates that duplicate genes with products that form multimeric complexes will tend to be retained to maintain the stoichiometry [5, 6] and enables future sub- and/or neofunctionalization . We found that 7.5 and 13.9% of duplicates TF pairs have retained > 80% of ancestral expression in both copies in the Stress and LightDev data set respectively that may still be retained due to dosage balance. Nonetheless, most duplicates have substantially diverged expression patterns. For example, in > 60% cases (a case refers to a TF-WGD duplicate pair expressed in one of expression data subsets), ≥1 ancestral expression states are found uniquely in each duplicate. This partitioning of ancestral subfunctions between both duplicate copies is a hallmark of subfunctionalization , in which both duplicate copies are selected to maintain the full set of ancestral functions.
However, the partition of ancestral expression states is highly asymmetric in most cases. Although they can still be maintained by sub-functionalization, this asymmetry suggests that, if we assume that expression patterns can be treated as proxies of gene function, some TF WGD-duplicates take on only a small part of their ancestral functions and thus defined as non-ancestral. We found that the non-ancestral copies tend to have more novel cis-regulatory sites (Fig. 4d), suggesting that the gain of these novel sites may lead to neofunctionalization  or to escape from adaptive conflict , both of which involve the evolution of new or improved function that is selected for. The above observations are consistent with the suggestions that subfunctionalization may be a transition state to neofunctionalization . The asymmetry may also suggest that the non-ancestral TF duplicate copies may be decaying functionally and are on their way to become pseudogenes, as suggested in a case study . This can be due to genome fractionation/dominance, where one genome loses duplicates at a significantly higher frequency following WGD [59, 68].
To further improve our understanding of what roles all of these mechanisms play in TF duplicate retention will benefit from more detailed modeling of TF evolution. In this study, linear models for retention prediction and ODE models of ancestral expression and regulatory site evolution are based on WGD events that is > 50 million old. It will be crucial to consider data from other species with more recent WGD events to elucidate the early dynamics of TF evolution. In addition, we demonstrate that non-ancestral duplicates inherited fewer ancestral cis-regulatory sites tend to gain novel sites. It remains to be determined experimentally whether these novel sites control new expression patterns and, most importantly, is selected for rather than neutrally evolving. Finally, our study focuses on the overall pattern of TF evolution. It is anticipated that different TF families will evolve differently from each other. In future studies, it will be important to assess factors influencing retention for individual TF families.
Genome sequences, gene annotation, and expression data
Genome sequences, protein sequences, and gene annotation information for A. thaliana was obtained from Phytozome v10 (https://phytozome.jgi.doe.gov/pz/portal.html). WGDs were defined according to Bowers et al.  who used BLAST  to identify candidate duplicate gens in A. thaliana, with a hard Expect value cutoff of 1e-10. Duplicate pairs were used to identify syntenic regions and these regions were dated using by comparing duplicate pairs to orthologs from other species for which the time of divergence from A. thaliana had been estimated. Dating only employed pairs where matches between duplicates and orthologs were in > 35 aminio acids. Additionally, tandem genes in A. thaliana were defined as pairs of reciprocal best BLAST hits with an e-value <1e-10 and a threshold based on the number of annotated, non-homologous genes between the putative tandem duplicates (≤ 5 intervening genes, ). Expression microarray data for this study was taken from AtGenExpress [22, 31, 58], normalized using RMA  in R as performed previously . The array data was divided into four groups: control conditions (in environmental condition experiments, Ctrl), light and development set (LightDev), abiotic and biotic stress treatments (Stress), and differential expression between stress treatments and controls (Diff) (Additional file 1: Table S9). The Diff data contains the log2 normalized difference between data sets for each stress condition/treatment/duration and its corresponding controls. In addition to microarray data, we have included a set of 214 RNA-sequencing samples (Additional file 1: Table S10) from A. thaliana Col1 wildtype from the Sequence Read Archive (https:// www.ncbi.nlm.nih.gov/sra) as of September 30, 2014. Raw sequence reads were processed using Trimmomatic , with a quality threshold of 20, window size of 4, and hard-clipping length of 3 for leading and trailing bases. Processed reads were then mapped to the A. thaliana genome using Tophat2  and expression levels calculated with Cufflinks , both with a maximum intron length of 5000 bp.
Defining TFs and other groups of genes in A. thaliana
TFs were defined according to the criteria used by the Plant Transcription Factor Database  with 1717 annotated TF loci in A. thaliana. To assess the degrees of TF duplicate retention after each WGD event, we defined a set of “functional groups” for comparison following from the procedure used in Maere et al. . To compare among genes with divergent functions and to ensure the log odds indicative of the degrees of retention could be defined for each group, function groups were defined using Gene Ontology (GO)  terms in the molecular function and biological process categories from The Arabidopsis Information Resource (https://www.arabidopsis.org/), and only groups containing 100–2000 genes and ≥ 20 WGD-duplicate pairs were kept. We excluded GO:0006355 (regulation of transcription, DNA-templated) due to its substantial overlap with the TF group we have defined above. The remaining 19 function groups include: ATP Binding (GO:0005524), catalytic activity (GO:0003824), defense response (GO:0006952), DNA endoreduplication (GO:0042023), hydrolase activity hydrolyzing O-glycosyl compounds (GO:0004553), kinase activity (GO:0016301), lipid binding (GO:0008289), oxidoreductase activity (GO:001649), oxygen binding (GO:0019825), protein binding (GO:0005515), proteolysis (GO:0006508), response to auxin (GO:0009733), response to chitin (GO:0010200), RNA binding (GO:0003723), transferase activity, transferring glycosyl groups (GO:0016757), translation (GO:0006412), transporter activity (GO:0005215), ubiquitin-protein transferase activity (GO:0004842), zinc ion binding (GO:0008270). A list of genes in each group can be found in Additional file 1: Table S1.
Fitting odds ratio of duplicate retention within each group of genes for each WGD event using linear models
Where Dg,w and D¬g,w are the numbers of WGD-duplicate genes in group g and those not in group g (¬g), respectively. Sg,w and S¬g,w are the numbers of WGD-singleton genes in group g and those not in group g (¬g), respectively. The 95% confidence interval around the point-estimate Rg,w was defined using the “fisher.exact” function in R, the details of which can be found at in Fay . For each WGD event, we established a general linear model with the glm function in the R environment which relates the Rg,w to a set of features of each gene group. The 34 features (predictor variables, Additional file 1: Table S2) were filtered with the following procedures to prevent over-fitting because we have only 20 function groups. We calculated the correlation between all features to find all cases where the absolute value of correlation was > 0.7. The considerations for which features to keep included: (1) how well each feature correlated with Rg,w on its own, (2) whether the feature was derived from a subset of another feature, and (3) the number of other features with a correlation > 0.7 (favored the elimination of more features). In addition to the above criteria, one data set (protein-protein interactions) was eliminated because of a high frequency of missing values (88%). The synonymous substitution rate (dS) feature and any feature using dS in their calculation were also excluded because they would be highly correlated with WGD timing and confound our analyses comparing the three WGD events. The filtering step left 11 features for building the general linear model. Following fitting the glm function, features were ranked according to their p values from the least to the greatest and the feature with the largest p value was dropped. The model was then fit to the reduced feature set and features were once again ranked. This process was repeated until the F-statistic (a measure of goodness of fit of the given model against a null model where all coefficients are set to zero) of the model was maximized and the final p value was calculated based on the maximal F-statistic. To evaluate the robustness of our models, we generated truncated versions of our data sets by leaving our one functional group and refitting the model, eliminating additional parameters if necessary to obtain the F-statistic maximizing models. Parameter estimates for the final model and each leave-one-out model can be found in Additional file 1: Tables S3-S5.
Inferring ancestral expression levels and cis-regulatory sites
DNA-binding domains were identified in TF protein coding sequences using hmmscan via HMMER3  based on the Pfam-A version 29.0 HMMs  with a threshold e-value of 1e-5. TFs were classified into families according to their DNA-binding domains and 44 of 59 TF families with ≥4 members were used for further analysis (Additional file 1: Table S11). For each TF family, full-length protein sequences were aligned using MAFFT  with default parameters. The phylogeny of each TF family was obtained using RAxML  with the following approach: rapid Bootstrapping algorithm, 100 runs, GAMMA rate heterogeneity, and the JTT amino-acid substitution model. These trees were then mid-point rooted with retree in PHYLIP . Given the prevalence of duplication events and the tendency for TF duplicates to be retained in the plant lineages, homologs from other plants will be interlaced with TFs from A. thaliana in the phylogenies. This makes it challenging to hypothesize proper outgroup sequences. As such, we determined that midpoint rooting, while less than optimal, was the most consistent method we could apply across all TF family trees.
The mid-point rooted trees were used to infer the ancestral gene expression states and the cis-regulatory sites of WGD-duplicate TF pairs with BayesTrait  as was done in our earlier study . Bayes Trait randomly assigns an evolutionary rate to the transition between possible states and uses these rates to determine mostly probably state of a given ancestral node. The likelihood of the observe states is then calculated and used to evaluate the current tree model and adjust evolutionary rates. This process is repeated iteratively to maximize the likelihood until either a maximum number of iterations or convergence is reached. This process is performed 100 times for each tree in order to evaluate the robustness of the inferred state and we only used ancestral states which were present in > 50 trees which is a non-trivial threshold as there are five possible states for each expression condition (each quantile and the ambiguous state). Further detail can be found at (http:// www.evolution.rdg.ac.uk/ BayesTraitsV2.0Files/TraitsV2Manual.pdf).
The expression data sets used are described in Additional file 1: Table S9. The discretized gene expression state (0,1,2,3) was based on the quartiles of gene expression levels within each experiment. Thus the inferred, ancestral expression state was also discretized. For cis-regulatory sites, the binding targets of 345 A. thaliana TFs were defined based DNA Affinity Purification-Sequencing data  from the Plant Cistrome Database (http://neomorph.salk.edu/dap_web/pages/index.php) where at least 5% of the read associated with a site were found to be in the 200 bp peak region. We inferred whether a site was present or absent (0,1) in the common ancestor of a duplicate pair. For both expression and regulatory site data, in cases where there was a missing value, it was explicitly included as an ambiguous state. To call the ancestral state from the expression or cis-regulatory site data, we required a posterior probability > 0.5. Cases where the called state was ambiguous or no majority existed were excluded from further analysis.
Asymmetry of the retention of ancestral expression and regulatory sites
Where FA and FB are the frequency with which ancestral expression was retained for duplicates A and B, respectively. By definition, FA + FB = 1, such that YA,B has value between 0 (when FA = FB, no asymmetry) and 1 (when either FA or FB = 1, maximum asymmetry).
With the asymmetry values for each TF pair, an average asymmetry value of all TF pairs was calculated for each expression dataset, as well as for the union of all TF duplicates from all datasets (1239 values total) to assess how the observed degree of asymmetry compared to what would be expected from if every partitioned state was independent (i.e. each gene has an equal chance of retaining the ancestral state regardless of the outcome of previous partitioning events). We also defined two subsets of the LightDev, Stress, and Diff data sets using the first and last element of each times series respectively because the expression of genes at different points of a time series are potentially correlated. The number of genes with > 5 partitioned conditions genes decreased in the subsets of LightDev (all = 334, first = 327, last = 325), Stress (all = 347, first = 265, last = 272), and Diff (all = 351, first = 277, last = 269) data sets. We excluded the Ctrl data set because it is composed of only four series, mean that no genes could pass the > 5 partitioned condition cutoff.
The expected distribution of asymmetry values for the expression states of TF WGD-duplicates (under the assumption of independent of partitioning events) was determined by conducting a series of Bernoulli trials equal to the total number of partitioned states amongst TF-WGD duplicates. In each of these trials there was an equal probability that either the first or second duplicate receive the ancestral state. The results of these trials were then grouped according the exact per gene distribution of partitioned states in TF-WGD duplicates and an asymmetry value was calculated for each group. This procedure was repeated 1000 times using an independent set of trials and subsequent groupings.
For assessing cis-regulatory site asymmetry, only TF WGD-duplicates with ≥5 inferred ancestral cis-regulatory sites we considered (402 WGD-duplicate pairs total). Similar to expression state asymmetry, in each duplicate pair the ancestral and non-ancestral duplicates were defined according to the number of inherited ancestral sites. For each WGD-duplicate pair, the degree of asymmetry of cis-regulatory site among a TF pair was defined analogous to what was done for expression. The expected distribution of asymmetry values for the cis-regulatory sites of TF WGD-duplicates was determined using the same procedure as for expression states.
Ordinary differential equation models of TF state evolution
Where O, +, and - are the frequency of TF WGD duplicate genes retaining the ancestral expression states, having a higher-than-ancestral expression level, and having a lower-than-ancestral expression level, respectively. The parameters x, y, w, and z define the transition rates between these states. This system of equations was solved in Maxima (http://maxima.sourceforge.net/index.html) and best parameters for the observed distribution of duplicates pairs were determined using maximum likelihood estimates calculated with the bbmle package in R (https://cran.r-project.org/web/packages/bbmle/index.htmll). Non-linear minimization was used to approximate an initial guess, although the actual initial parameters often needed to be adjusted to reach a convergent solution. The best fit parameters for this single duplicate expression state evolution model can be found in Additional file 1: Table S12.
Where O, I, and II are the frequency of TF WGD duplicate pairs where both, one, or neither duplicate retained the ancestral expression state. The parameters x, y, w, and z define the transition rates between these states. This system of equations was solved and the initial and best parameters were estimated in the same fashion as above. The best fit parameters for this pairwise expression state evolution model can be found in Additional file 1: Table S12. The same model was also applied to ancestral regulatory sites with O, I, and II representing the frequency of TF WGD duplicate pairs where both, one, or neither duplicate retained the ancestral regulatory site.
We thank Johnny Lloyd and Zing Tsung-Yeh Tsai for their advice regarding modeling duplicate retention and analyzing the importance of predictive features.
This work was supported in part by the National Science Foundation (IOS-1546617 and DEB-1655386) and the Department of Energy Great Lakes Bioenergy Research Center (DOE Office of Science BER DE-SC0018409) to S.-H.S., and an NSF Graduate Research Fellowship (Fellow ID: 2015196719) and Graduate Research Opportunities Worlewide Fellowship to C.B.A.
Availability of data and materials
AtGenExpress expression data is available through The Arabidopsis Information Resource (http://www.arabidopsis.org/portals/expression/microarray/ATGenExpress.jsp). All RNA-Seq data is available through the SRA at the accession listed in Additional file 1: Table S7. DNA Affinity Purification-Sequencing are available at (http://neomorph.salk.edu/dap_web/pages/index.php). Software is available at the following sites: MAFFT (https://mafft.cbrc.jp/alignment/software/), HMMER (http://hmmer.org/), RAxML (https://sco.h-its.org/exelixis/software.html), PHYLIP (http://evolution.genetics.washington.edu/phylip.html), BayesTraits (http://www.evolution.rdg.ac.uk/BayesTraitsV3.0.1/BayesTraitsV3.0.1.html).
NLP and SHS designed the study. NLP, CBA, EFW performed the analyses. All authors wrote the manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 17.Felsenstein J. PHYLIP - phylogeny inference package (version 3.2). Cladistics. 1989;5:164–6.Google Scholar
- 20.Freeling M. Bias in plant gene content following different sorts of duplication: tandem, whole-genome, segmental, or by transposition. Annu Rev. Plant Biol. 2009;60:433–53.Google Scholar
- 22.Goda H, Sasaki E, Akiyama K, Maruyama-Nakashita A, Nakabayashi K, Li W, Ogawa M, Yamauchi Y, Preston J, Aoki K, et al. The AtGenExpress Hormone and Chemical Treatment Data Set: Experimental Design, Data Evaluation, Model Data Analysis and Data Access. Plant J. 2008;55:526–42.Google Scholar
- 31.Kilian J, Whitehead D, Horak J, Wanke D, Weinl S, Batistic O, D’Angelo C, Bornberg-Bauer E, Kudla J, Harter K. The AtGenExpress Global Stress Expression Data Set: Protocols, Evaluation and Model Data Analysis of UV-B Light, Drought and Cold Stress Responses. Plant J. 2007;50:347–63.CrossRefGoogle Scholar
- 34.Lehti-Shiu MD, Panchy N, Wang P, Uygun S, Shiu SH. Diversity, expansion, and evolutionary novelty of plant DNA-binding transcription factor families. BBA. 2016;1860:3–20.Google Scholar
- 54.Rastogi S, Liberles DA. Subfunctionalization of duplicated genes as a transition state to neofunctionalization. BMC Evol Biol. 2005;14:5–28.Google Scholar
- 60.Schranz ME, Quijada P, Sung SB, Lukens L, Amasino R, Osborn TC. Characterization and Effects of the Replicated Flowering Time Gene FLC in Brassica rapa. Genetics. 2002;3:1457–68.Google Scholar
- 80.Zhang Z, Belcram H, Gornicki P, Charles M, Just J, Huneau C, Magdelenat G, Couloux A, Samain S, Gill BS, et al. Duplication and partitioning in evolution and function of homoeologous Q loci governing domestication characters in polyploid wheat. Proc Natl Acad Sci USA. 2011;108:18737–42.CrossRefGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.