Background

Given the pervasiveness of whole genome doubling (WGD) and tripling (WGT) in the ancestral lineages of plant species, a widespread feature of plant genome publications is the display of the distribution of duplicate gene identities (or similarities, distances, Ks,...). This is illustrated in Fig. 1, which represents the distribution of similarities between syntenically aligned duplicate genes [1, 2] in the durian (Durio zibethinus) genome [3]. The two visually distinguishable but overlapping parts of the distribution are the legacy of two successive polyploidization events. The means (t1 and t2), variances and proportion of the total sample of each component of the distribution can be estimated by mixtures of models techniques such as EMMIX [4].

Fig. 1
figure 1

Distribution of gene pair similarities. Pairs in the Durio zibethinus (COGE ID 51764) genome after two rounds of whole genome tripling. Discrimination point H=85.2%. Cut-off for pairs not originating in polyploidization > 98%

These distributions can be explained and generated by a discrete-time branching process model of polyploidization and fractionation (not time-homogeneous), mathematically represented by the product of successive r-nomial distributions, the output of one being the input of the next, where the parameters, namely the r-nomial probabilities, express resistance to gene loss through fractionation [59].

These parameters, which are key to understanding the cycle of polyploidization and fractionation, can be estimated using the information inferred from the distribution of gene pair similarities.

One aspect of the branching process that cannot be inferred from the study of pairs only is the ploidy r of the various events. Thus, durian is known to have undergone two whole genome triplings [9, 10], the γ tripling almost 120 million years ago [11], ancestral to most flowering plants, and a more recent tripling (10-20 Mya) not shared by even closely related species, like cacao. By just looking at the distribution of similarities in Fig. 1 engendered by these events, however, there is no direct way of knowing whether one or both of the two component distributions represent WGD, WGT, or other polyploid events. To resolve this problem, the main point of this presentation, we propose to add triples of similar (> 50 %) genes to the study of gene pairs. Our technique, based on the branching process, responds to our concern in previous ad hoc treatments [7, 12] of how to use triples rigorously from a statistical point of view.

In the next section, we summarize the general branching process approach to analyzing the distribution of gene pair similarities. We then focus on four competing two-event models involving WGD and/or WGT. We define four types of gene triplet according to whether the gene pairs within them were created by the first event, the second event, or both. Within each model, we calculate the expected number of triplets of each type. Thus creates an “underlying” profile of triplet distribution to compare to the “observed” profile of triplets in the data. Because of the way the two components of the pair similarity distribution overlap, however, the origin of each triplet in the data is not always obvious. Thus we create a “predicted” profile of triplet distribution by grafting a paralog divergence model onto the branching process, making use of a maximum likelihood dividing point between the two components. We apply this analysis to the genomes of durian, poplar Populus trichocarpa [13] and cabbage Brassica oleracea [14], each of which has a different sequence of polyploidization events. These histories are captured correctly for the first two, but the results for B. oleracea prompt an extension to three-event models, which we carry out, and suggest further work to higher numbers of events.

The branching process and two-event models

Denote by mi the total number of individuals (genes)at time ti,i=1,…,n. Set m1=1. At time ti, i=1,…n−1, each of the mi genes is replaced by ri≥2 progeny, but only j≥1 of them survive until time ti+1, with probability \(u_{j}^{(i)}\).

Of the total of mi genes at time ti, let \(a_{j}^{(i)}\) be the number for which j progeny survive until time ti+1, so that

$$ m_{i}=\sum_{j=1}^{r_{i}}{a_{j}^{(i)}},\ \ \ \ \ m_{i+1}=\sum_{j=1}^{r_{i}}j{a_{j}^{(i)}}. $$
(1)

The probability distribution of the evolutionary histories represented by the given \(\mathbf {r}=\{r_{i}\}_{i=1}^{n-1}\) and the variable \(\mathbf {a}=\left \{a_{j}^{(i)}\right \}_{j=1\dots r_{i}}^{i=1\dots n-1}\) is

$$ P(\mathbf{r;a})= \prod_{i=1}^{n-1}\Bigg[\binom{m_{i}}{a_{1}^{(i)},\dots,a_{r_{i}}^{(i)}}\prod_{j=1}^{r_{i}} \left(u_{j}^{(i)}\right)^{a_{j}^{(i)}}\Bigg]. $$
(2)

The expected number of genes at time tn is

$$ \mathbf{E}(m_{n})=\sum_{\mathbf{a}}P(\mathbf{r;a}){m_{n}}. $$
(3)

This is illustrated by the sample trajectory in Fig. 2, in which a WGT at time t1, with all three progeny surviving – the 3-nomial sample has value 3 – is followed by another independent WGT at time t2 where the three lineages show one, two or all three offspring surviving, i.e., the independent 3-nomials samples have values 1,2 and 3, respectively. We will study the case of two successive polyploidy events, with r1 and r2 taking on values 2 or 3, i.e., WGD or WGT, in all four combinations, i.e., in the set \(\mathscr {M}\) of “models”, denoted (3,3),(3,2),(2,3) and (2,2). Because of the limited information that can be inferred about each component of the distribution of similarities, we can only infer the probabilities of samples of value 1, 2, or 3, so that we are limited to 2-nomials and, with some assumptions, 3-nomials, by far the biologically most important cases.

Fig. 2
figure 2

Sample trajectory, starting from a single gene, of branching process based on two whole genome triplings

To infer parameters like fractionation rate in the polyploidization history of a genome, based on the distribution of gene pair similarities, we need to know the ploidies ri of the various events. This motivates us to extend our study from gene pairs only to also include gene triplets.

Methods

Triplet probabilities in four models

With no loss of generality, we study triplets of similarities among three genes, rather than the triplets of genes themselves. A triplet is a (multi-)set {ti,tj,tk}, where each of i,j and k may be 1 or 2. Let \(\mathscr {T}=\{\{t_{1},t_{1},t_{1}\}, \{t_{1},t_{1},t_{2}\}, \{t_{1},t_{2},t_{2}\},\{t_{2},t_{2},t_{2}\}\}\). We classify each kind of triplet we according to whether each of the three paralogies among the three pairs of gene originates in the first event or the second event. Thus in the branching process illustration in Fig. 2 the blue dots represent genes that form a {t2,t2,t2} triplet and the red dots form a {t1,t1,t2} triplet. The single red dot combines with the three pairs of blue genes to form three additional {t1,t1,t2} triplets. And there are a further nine {t1,t1,t2} triplets in the sample. We can calculate the expected number of triplets of each type by enumerating the triplets of each type in each possible trajectory of the process, and multiplying by the probability of this trajectory from Expression (2). The enumeration within a trajectory is easily done by considering every triple of genes at time tn and identifying the last common ancestor of each pair. The probabilities of the trajectories have previously been calculated [57]. We then sum these results over all trajectories. These are listed in Table 1.

Table 1 Formulae for the expected numbers of triplets WM(Δ) of each type Δ, by branching model M

This table provides the expected number of triplets WM(Δ) of each type \(\Delta \in \mathscr {T}\) produced by the branching process for each model \(M \in \mathscr {M}\). For a given genome, the four numerical values of WM(·) constitute the underlying profile of the model M. The underlying profiles for each model based on maximum likelihood values of u and v are given in the top half of Tables 2, 3 and 4 below. Because of the limited number of parameters that can be inferred from the distribution of similarities, we assume u=u2 and v=v2.

Table 2 Durian model predictions before (underlying) and after imposition of mutational divergence
Table 3 Populus model predictions before (underlying) and after imposition of mutational divergence
Table 4 Brassica oleraceae model predictions before (underlying) and after imposition of mutational divergence

It can be seen in Table 1 that the profiles of triplet types produced by the different models are very different. If we could observe the triplet profile produced by the branching process underlying a given set of data, we could easily identify which model was responsible. However, we only see the data after mutational processes have applied. When a mutational divergence model is applied to the similarities, a single trajectory of the branching process, producing one ideal type of triplet, can produce many very different data triplets.

We can try to categorize the set of triplets in a set of data by how closely they resemble one of the four basic types. If the two components of the similarity distribution were completely separate, this would also be an easy matter. But the usual large overlap between the components means that we cannot automatically ascribe any data triplet to any particular underlying triplet.

A statistical approach

As a solution to this problem, we first try to find the best transition, or cutoff point H somewhere between the peaks of the two components. For this we compute the product of the probability density at each similarity value less than H, according to the component with mean at t1, and the density at each similarity value greater than H, according to the component with mean at t2, and maximize with respect to H. I.e.,

$$ {\begin{aligned} H=\max_{h\in(0,1)}\prod_{x\le h}\frac{1}{\sqrt{\sigma_{1}^{2}}}\exp\left[-\frac{(x-\mu_{1})^{2}}{2\sigma_{1}^{2}}\left]\prod_{x> h}\frac{1}{\sqrt{\sigma_{2}^{2}}}\exp\right[-\frac{(x-\mu_{2})^{2}}{2\sigma_{2}^{2}}\right] \end{aligned}} $$
(4)

We then categorize the triplets in the data according to the transition value H. If a similarity x is less than H we classify it as being produced at time t1, and we write xI. If it is greater than H, we classify it as being created at time t2 and we write xJ. This creates eight “octants”, defined by the 8=23 combination of the three triplet similarities, which in turn are collapsed into the four types of triplet in \(\mathscr {T}\) tabulated in Table 1, {t1,t1,t1},{t1,t1,t2}(representing three octants), {t1,t2,t2} (representing three octants) and {t2,t2,t2}. The number of triplets of the four types we call the observed profile.

Although we can compare the observed profile with the underlying profile, this comparison is not too meaningful since it neglects the fact that many of the data triplets classified as one type may be generated by a different underlying type, not as an error, but simply as a result of the normal process of duplicate gene sequence divergence clearly operative in the more or less dispersed and overlapping components of the distribution of gene pair similarities.

We can, however, take this process into account in producing a predicted profile for each model. We first calculate the variance-covariance matrix Σ of the t1 similarities in triplets containing at least of them and t2 similarities in triplets containing at least two of these. We fixed covar(t1,t2)=0, in accordance with the Markov nature of the branching process.

For each model \(M\in \mathscr {M}\), we construct the predicted profile of triplet types by integrating over the trivariate normal with means drawn from the EMMIX analysis or identified by eye with the distribution component peaks, and covariance estimated as above, restricted to the domains defined by the transition point. Thus our prediction of {t1,t1,t1} triplets would involve a restriction to the domain (I,I,I) where all three coordinates are less than or equal to H. Our prediction of apparently {t1,t1,t2} triplets would be confined to the three octants where two coordinates are less than or equal to H and one is greater. The integrals are weighted by W(Δ), the expected number of triplets. For example, the predicted number of {t1,t1,t2} triplets would be;

$$\begin{array}{@{}rcl@{}} P_{M}&=&\sum_{\Delta\in \mathscr{T}}W_{M}(\Delta)\int_{(I,I,J)\cup(I,J,I)\cup(J,I,I)} N({\mathbf X; \mu,\Sigma})d\mathbf{X}, \end{array} $$
(5)

where μ is the vector of component means. To summarize, we have defined three types of triplet profile:

  • the observed profile, based on triples of genes all having high similarity scores with each other, which can be compiled from the list of gene pairs produced by the SYNMAP function of COGE [1, 2],

  • the underlying profile for each model \(M\in \mathscr {M}\), produced by the combinatorial probabilities of the branching process for each model \(M\in \mathscr {M}\),

  • the predicted profile for each model \(M\in \mathscr {M}\), produced by grafting a gene pair divergence model on the underlying profile.

For comparative purposes we normalize the underlying and predicted profiles so that the total number of triples is the same as the observed profiles.

Results

We compare the three profiles for three well-studied flowering plant genomes that are known to have undergone multiple polyploidizations in the last 120 million years, to see if our method predicts the right combination of WGT and WGD.

Durian

Starting with the durian genome, the (3,3) model, known to represent true evolutionary history, is the only one with a credible prediction profile in Table 2, the only one that has reasonable values for all four triplet types. The three others all fail to predict one or both of the {t1,t1,t1} and {t2,t2,t2} triplets. This indicates the potential of our statistical method, since the original durian sequence article [3] did not recognize the second event as a tripling.

Poplar

The predicted profile of the (3,2) model in Table 3 summarizes the true history of the Populus trichocarpa genome (COGE ID 25127), whose gene pair similarity distribution is displayed in Fig. 3. Along with γ, this shares the ancient “salicoid” WGD with other members of the Salicaceae family [15]. (3,2) is the only model that correctly identifies both the γ event as a WGT, and the more recent event as a WGD.

Fig. 3
figure 3

Distribution of syntenic gene pair similarities in Populus trichocarpa. Discrimination point H=84.5%. Cut-off for pairs not originating in polyploidization > 97.5%

Cabbage

The recent ancestor of Brassica oleracea genome (COGE ID 26018), underwent a WGT that gave rise not only to the crucifers and mustard genera, but also radishes and other related genera. Early than that a WGD called the α doubling is apparent in the whole range of family Brassicacea genera, including Arabidopsis. A still earlier WGD, the β doubling, occurred in the order Brassicales lineage that includes the Brassicaceae. Thus the cabbage genome counts γ,β,α and a Brassica WGT in its evolutionary history [16, 17]. In Fig. 4, we see that at least the two recent components are clearly distinguishable, so we first carried out an analysis excluding gene pairs of similarity less than 76%.

Fig. 4
figure 4

Distribution of syntenic gene pair similarities in Brassica oleracea. Discrimination point H=83.3%. Cut-off for pairs not originating in two recent polyploidization ≤76%

This analysis in Table 4 does not give satisfactory results. Indeed, the (3,3) profile matches the observed profile much better than the (2,3) profile does. This can be partially attributed to substantial number of similarities generated by the β and even γ doublings greater than 76%.

We can partially correct this by adding a third event to our branching process. This leads to eight models instead of four, and ten kinds of triplet, summarized in Table 5.

Table 5 Formulae for the expected numbers of triplets after three events

We fix the mean of the first component at 71% to account for the γ event, that for the second component, representing the α event, at 79.5% and we find two ML discrimination points, as in Fig. 5.

Fig. 5
figure 5

Distribution decomposed into three events. Discrimination points H1=73%,H2=85%

The results of this are shown in Table 6. Here the (3,2,3) model is just as close to the observed profile than the competing (3,3,3) model, with the notable exception of {t2,t2,t2} triplets. The absence of a distinction between the α and β events means that the similarities they generate are all conflated to yield an excess of t2, and consequently an excess of t2 triples, so that a WGT is inferred rather than a WGD.

Table 6 Brassica oleracea three-event model predictions before (underlying) and after imposition of mutational divergence

The obvious remedy for this would be to construct four-event models (sixteen of them), with profiles consisting of 20 different triplets. We leave this for further work. In general the number of models is exponential: 2m for m events, while the number of triples follows the polynomial (i.e., cubic) tetrahedral sequence (A00292 in [18]) \(\frac {1}{6}m(m+1)(m+2)\), so that eventually there would not be enough data to discriminate among the models. Choosing among models with different numbers of events would require some standard for model selection such as the Akaike or Bayesian information criteria.

Conclusions

We model the process of fractionation to account for the distribution of gene pair similarities after a number of whole genome doublings, triplings, etc., each followed by a period of duplicate gene loss. The model is a discrete-time branching process, with synchronous birth number ri≥2 across the ith generation population and deaths determined by a ri-nomial law conditioned on at least one survivor.

The observations of gene pair similarities consist of a mixture of normals, each component generated by one event, with the event time estimated by the sequence divergence from the event to the present. Despite the overlapping distributions, we can estimate the mean (via a local mode), standard deviation and proportion of the sample.

Statistics on gene pairs alone do not allow us to infer ri, so we introduce the study of gene triplets. We find formulae for the expected number of each kind of triplet, categorized as to which events produced the similarities among the three pairs of genes.

We develop a way of grafting a gene divergence model on this underlying profile of triplets to produce a predicted profile of the number of triplets of each kind. This can then be compared with the observed number of triplets.

Further work

Distinguishing among the four models combining tetraploidization or hexaploidization in two successive events is the simplest example of a more general problem. The theoretical way is clear to extending these ideas to include, for example, octoploidization through the extraction of quadruplets instead of triplets. In addition, there is a straight forward extension to the case of three or more successive polyploidization events, which we have undertaken in the study of Brassica oleracea. Here, three events, three normal components and two transition points are estimated from the distribution of similarities. The combinatorial probabilities have been worked out for this case and many others, and the methodology is available to complete this.

It is true that in some cases, such as that we presented in [9], concerning Durio zibethinus, the ploidy is evident from the clear presence in the SYNMAP self-comparison dotplots of sets of r regions covering a large proportion of the genome, each set represented by exactly \(\binom {r}{2}\) synteny blocks showing synteny among all r regions. (r=3 in the case of Durio.) Clear cases like this are rare, however, especially for genomes where the last polyploidization is more remote in time.

In previous work [7], we used additional information, beyond that contained in the similarity distribution, to confirm the recent hexaploidization of Brassica rapa against the alternative of tetraploidization. This kind of data, however, namely speculation about the number of single-copy genes in the current genome, was extremely subjective in that report, and is unreliable even when assessed by the best available methods on well-assembled and annotated genomes.

A distribution of gene pair similarities is generated in the comparison of two related genomes as well as in the self-comparison of a single genome. The number of orthologous gene pairs available when comparing two related genomes is generally much greater than the number of paralogous pairs identified in the self-comparison of two genomes, simply because the loss by fractionation of one copy of a duplicated gene does not eliminate all related orthology pairs: the other remaining copy and its orthology pairs remain intact. This suggests an avenue to improved accuracy of polyploidy levels inference. The larger number of data, however, may not always compensate for the fact that the speciation component of the similarity distribution is always the most recent one [9], so that the more remote (earlier) components associated with polyploidy become statistically less clear and informative.