Abstract
Background
The loss of duplicate genes  fractionation  after whole genome doubling (WGD) is the subject to a debate as to whether it proceeds gene by gene or through deletion of multigene chromosomal segments.
Results
WGD produces two copies of every chromosome, namely two identical copies of a sequence of genes. We assume deletion events excise a geometrically distributed number of consecutive genes with mean µ ≥ 1, and these events can combine to produce singlecopy runs of length l. If µ = 1, the process is genebygene. If µ > 1, the process at least occasionally excises more than one gene at a time. In the latter case if deletions overlap, the later one simply extends the existing run of singlecopy genes. We explore aspects of the predicted distribution of the lengths of singlecopy regions analytically, but resort to simulations to show how observing run lengths l allows us to discriminate between the two hypotheses.
Conclusions
Deletion run length distributions can discriminate between genebygene fractionation and deletion of segments of geometrically distributed length, even if µ is only slightly larger than 1, as long as the genome is large enough and fractionation has not proceeded too far towards completion.
Similar content being viewed by others
Background
The process of whole genome doubling (WGD) gives rise to two copies of each chromosome in a genome, containing the same genes in the same order. Through an attrition mechanism known as fractionation, one of each pair of duplicate genes is lost over evolutionary time. The rare deletion of both copies of a gene can be excluded from our considerations of the interleaving patterns of deletions from duplicated regions first discovered by Wolfe and Shields [1]. The retention of one copy of each pair is what differentiates the WGD/fractionation model from approaches to gene duplication, insertion and deletion in the study of comparative genomics, pioneered by ElMabrouk [2].
An important biological controversy arises from the question of whether duplicated genes are deleted through random excision  elimination of excess DNA  namely the deletion of chromosomal segments containing one or more genes [3], which we term the "structural" mechanism, or through geneby gene events such as epigenetic silencing and pseudogenization [4], which are "functional" mechanisms. This question is important to evolutionary theory because it speaks directly to the role of WGD, and gene duplication in general, in disrupting gene order, in creating functional innovation, and in the radiation of new species. It is a question of whether selection operates on the level of simply permitting nonlethal deletions or whether more subtle effects are in play, such as dosage balance of interacting genes.
This debate may be formulated in terms of deletion events removing a number X of contiguous genes, where X is drawn from a geometric distribution γ with mean µ. Here the oneatatime deletion model is represented by µ = 1, while the random number of deletions at a time holds if µ > 1. In the latter case, the possibility of two overlapping events is handled by a biologically realistic additive runlength assumption.
In this paper, we investigate the discrimination problem of choosing between the two models based on deletion runlength statistics (resulting from overlapping deletion events). This involves comparing an observed genome containing singlecopy genes, originally members of duplicate pairs, to the predictions of the models for µ = 1 and for µ > 1. This requires knowledge of the runlength distribution, given a total number of deleted genes and remaining duplicate pairs. While this is easily calculated for the case µ = 1, the the distribution for the opposing scenario µ > 1 is not known.
In the first part of this paper, we analyze aspects of the deletion runlength distribution ψ when µ > 1 for the deletionlength distribution γ, including some new and surprising analytical results, the clearest of which pertain to a continuous analog of the problem. We then show why it is difficult to describe ψ in closed form or other easily computable format. In the second part, we simulate the distribution and carry out a study of the discrimination problem for various values of µ, genome size N and θ, the proportion of undeleted genes at time t. We conclude with a discussion of the remaining mathematical problems to be solved before the method can be applied to data from real WGD descendants.
Results
The models
For modeling purposes, we consider a doubled genome made up, at the outset, of a pair of identical linear chromosomes each containing genes g_{1}, . . . , g_{ N }, where N is large enough so that we can neglect end effects  particular behaviors near g_{ 1 } and g_{ N } . At each time t = 1, 2, . . ., one such doubled gene g_{i} is chosen at random, and a value X is chosen from a geometric distribution γ with mean µ. If X = a, then g_{i}, g_{i+1}, . . . , g_{i+a−1}are deleted from one of the genomes  they become singlecopy genes  unless some of these are already singlecopy. In the latter case, we skip existing singlecopy genes and proceed to convert the next doublecopy genes we encounter until a total of a doublecopy genes have been converted to singlecopy. Note that this overlapping of deletion events never occurs if µ = 1 since, in this case, by definition, exactly one doublecopy gene is selected and deleted in each step. For simplicity, we assume all deletions take place from one and the same genome. In a more complete model, deletion events would occur on one or the other chromosome, with probabilities φ and 1 − φ [5].
The "skipping" procedure, introduced in [6], is a natural way to model the deletion process, since deletion of part of a chromosome and the subsequent rejoining of the chromosome directly before and directly after the deleted fragment means that this fragment is no longer "visible" to the deletion process. As observers, however, we have a record of the deleted genes, as one copy of each gene must be retained in the genome.
Overlapping deletion events and skipping result in the creation of runs of singlecopy genes whose length is the sum of a number of geometric variables. The sum of r identical geometric variables produces a negative binomial distribution with parameter r, but the skipping process does not involve the sum of identical random variables, since a deletion with a large value of a is more likely to overlap an existing single copy region than a deletion with small a. Thus, at any point of time t > 0, the distribution ψ_{t} of singlecopy run lengths will tend to contain a higher frequency of runs of length 1, and of very long runs, than would be generated by the negative binomial. On the other hand, the distribution of run lengths of the remaining doublecopy genes is geometrically distributed with a probability distribution ρ_{t}, where the mean ν_{ t } decreases with t [5, 6].
Analysis of overlap probabilities
An attempt to determine ψ_{ t }analytically starts with the calculation of how many deletion events have overlapped to form a run of singlecopy genes at time t. In [6], we derived a formula to predict whether a deletion event would create a new run of singlecopy genes, probability p_{0}; overlap exactly one existing run, thus extending it without changing the total number of runs, probability p_{1}; overlap two runs, producing one larger combined run in place of the two preexisting ones, probability p_{2}; and so on. Other probabilities deal with the events that a run "touches" a preexisting run without overlapping it. These probabilities all depend solely on γ and ρ_{ t }. For example, we examine the case of p_{ 0 }. The other probabilities are all formulated in analogous ways.
The proportion of genes in singlecopy runs of length l is l{\rho}_{\text{t}}\left(l\right)/{\nu}_{\text{t}}, where {\nu}_{t}={\sum}_{{}_{l>0}}l{\rho}_{t}\left(l\right). The probability p_{0} that a deletion event falls within a run of doublecopy genes without deleting the terms at either end is
where j indexes the starting position of the deletion within a run of length l, and a is the number of genes deleted in the event.
This formula requires quadratic computing time, but the p_{ i } for higher i, require polynomial time of degree i + 2. Here we exemplify with p_{0} to show that these probabilities can in fact be reduced to closed form, so that computing time is a negligible constant. The formula in (1), when expanded, consists of a number of partial sums of the geometric distributions γ and ρ_{ t }and means of these distributions, all of which are readily reduced to closed form, plus sums of terms of the form {\left[\left(\text{1}\frac{1}{\mu}\right)\left(\text{1}\frac{1}{{v}_{t}}\right)\right]}^{l} and l{\left[\left(\text{1}\frac{1}{\mu}\right)\left(\text{1}\frac{1}{{v}_{t}}\right)\right]}^{l}, which themselves can be considered in terms of a geometric distribution with mean ζ, where
Then (1) reduces to:
For large ν_{ t }, i.e., during the early stage of the process,
Typically, µ is somewhere between 1 and 2, [3, 4], and ν_{ t } of the order of 10^{3} or 10^{4}. Thus p_{0} is initially only slightly less than 1 but declines rapidly as ν_{ t } decreases exponentially.
We proceed in an analogous way to derive closed forms for p_{ 1 }, p_{ 2 }, . . ., but it is perhaps more instructive here to present the continuous version of the deletion process. Here the two identical chromosomes at time t = 0 are linear segments, long enough in comparison with the other parameters of the model so that end effects can be ignored. At each time t = 1, 2, . . . , a random point g is chosen on the chromosome, and a value X is chosen from an exponential distribution
with mean µ. If X = a, then the segment [g, g +a] is deleted from one of the genomes  [g, g + a] becomes a singlecopy region  unless part of it is already singlecopy. In the latter case, we skip existing singlecopy regions and proceed to convert the next doublecopy region we encounter until a total measure a of doublecopy regions have been converted to singlecopy.
In analogy with the discrete model, the combined length of the remaining doublecopy segments is exponentially distributed according to probability distribution σ_{ t }, with a mean ν_{ t } that decreases with t.
The proportion of undeleted regions accounted for by segments of length ldl is \frac{l\sigma \left(l\right)}{{v}_{t}}dl, where {\nu}_{t}=\underset{0}{\overset{\infty}{\int}}l\sigma \left(l\right)dl. Then the probability p_{0} that a deletion event falls completely within an undeleted segment is
Carrying out the integrations, we find
which is reminiscent of the relation (4) in the discrete case with large ν_{ t }.
The probability p_{1} that a deletion event overlaps exactly one existing run of deletions is:
It can be proved by induction that the probability a deletion event overlaps exactly q existing runs of deletions is:
Thus we have the surprisingly uncomplicated result that the number q of preexisting runs of singlecopy regions overlapped by a new deletion event is geometrically distributed on q = 0, 1, . . . with parameter µ/(µ + ν_{ t }).
On the runlength distribution
Although having a closed form for p_{ q }constitutes progress towards the computation of the runlength distribution ψ_{ t }, or eventually towards some analytical results on it, how to find this distribution remains a difficult question. As mentioned in the previous section, long deletion events will be involved in more skipping than small ones. This is illustrated in Figure 1, where runs built from a small number of events tend to be composed of shorter deletions, especially when µ is large. Had we just added independent samples from a geometric distribution, the curves in the figure would have been horizontal lines.
How to account for the distorting effect of skipping on the runlength distribution will require additional insight and research. In the interim, we may use simulations to study the discrimination problem.
Simulations
We first simulated the fractionation process for all combinations of the following parameter values:

gene number N = 100 to 900, in steps of 100.

µ = 1.0 to 2.4, in steps of 0.1.

Proportion of the genes deleted, 1 − θ = 0.1 to 0.9, in steps of 0.1.
For each combination of the parameters µ, N and θ, we calculated the distribution of run lengths l for singlecopy regions, and similarly for doublecopy regions. The simulation was repeated 1000 times and the frequencies of length (l = 1,2,3,...) of runs of deleted genes were averaged over the 1000 trials to get a reasonably accurate estimate of the cumulative F_{ µ,N,1−θ }. Similarly we estimated the cumulative G_{ µ,N,1−θ }for runs of remaining doublecopy genes.
Once the cumulative distributions were established, we then carried out the actual discrimination study. For each value of µ and N, we sampled 1000 new individual trajectories of the deletion process until 1 − θ = 90 % of the genes were deleted. For each value of 1 − θ = 0.1, 0.2, . . . , 0.9, we created "bins" corresponding to the fifteen values of µ for which we had constructed cumulatives. Then for each sample S_{ i }, at each 1 − θ = 0.1, . . . , 0.9 we counted the frequency of runs of deleted genes of length = 1, 2, . . . and constructed a cumulative distribution. We calculated the KolmogorovSmirnov statistic {D}_{\mathit{\mu},N,1\theta}^{\left(Si\right)}between the sample cumulative and the previously established distribution F_{ µ,N,1−θ }for each fifteen values of µ and assigned the sample to the bin corresponding to the minimal value of this statistic, which was our estimate \widehat{\mu} for that sample.
Figure 2 shows the distributions of \widehat{\mu} for the 1000 samples S_{1}, . . . , S_{1000}. The four panels are the results of N = 900, 300, 200 and 100. A separate distribution is drawn for each of the trial values of µ used to generate the samples. For N = 900 (top left), there is a clear pattern of the mode of the distribution to occur at the same value of µ that generated the data, though the distributions become more spread out for higher values of µ. The same pattern may be seen for N = 300 (top right), though considerably degraded. This loss of accuracy of \widehat{\mu} continues through N = 200 (bottom left) and N = 100 (bottom right), where the modes for \widehat{\mu} when µ = 1.1 are in the µ = 1.0 bin.
With all four values of N in Figure 2, the most accurate inference is made for µ = 1, the genebygene model. This brings us back to the original problem of discriminating between the genebygene "functional model" (µ = 1) and the random excision "structural" model (µ > 1). Figure 3 shows the frequency with which we estimate \widehat{\mu}=\text{1}, for various values of µ and N = 200 or 900, as a function of 1 − θ, the proportion of genes deleted. The upper curves in the figure show that we can correctly identify the µ = 1 model around 7085% of the time; more for N = 900 and less for N = 200, as long as 1 − θ < 50%. In other words, the type I error of a test of H_{0} : µ = 1 against H_{1} : µ > 1 with these parameters and procedures, is about 1530%. The lower curves show that incorrectly inferring \widehat{\mu}=\text{1} occurs around 20% of the time when µ = 1.2, but very rarely for µ = 1.9 or even µ = 1.5, until 1 − θ begins to exceed 50%. In other words, if now H_{ m } : µ = m, for some constant m > 1, is the null hypothesis and H_{0} is the alternative, then the type I error is very small unless m is very close to 1 (e.g., m = 1.2) or 1 − θ is large (e.g., >50% if m = 1.5).
Up to now, we have examined only runs of singlecopy genes. What of the runs of remaining doublecopy genes? Figure 4 compares some of the results from the same simulations as Figure 3, but using the cumulative G_{ µ,N,1−θ } for runs of doublecopy genes as well as F_{ µ,N,1−θ }for runs of singlecopy genes. The main observation is that the doublecopy approach systematically infers µ = 1 with higher frequency for small values of 1 − θ, whether or not this inference corresponds to the generating µ. It systematically infers µ = 1 with lower frequency for large values of 1 − θ, again whether or not this is correct. These simulations establish ranges of values of N, µ and 1 − θ for which we can and cannot discriminate between the two models.
Conclusions
In this work we have made some progress in deriving the runlength distribution ψ_{ t } for singlecopy regions, although this problem is still not completely resolved. From an analytical point of view, it is unexpected and interesting that in the continuous version of the problem, the number of preexisting runs overlapped by a deletion event follows a geometric distribution.
The simulation study showed the much greater difficulty in distinguishing between the structural and functional models when the mean µ of the deletion size distribution is 1.1 rather than 1.9, when N is 100 rather than 900, and when the proportion of genes deleted is bigger than 50% rather than less than 40%. The latter effect is also apparent in empirical studies [7].
Our simulation results are based on a "binning" strategy for determining \widehat{\mu} for the purposes of discrimination, rather than an asymmetrical testing approach comparing the hypotheses µ = 1 and µ > 1. This is justified by the lack of any biological significance, and high rates of error, in comparing \mu =\text{1}+\in and µ = 1 for very small \phantom{\rule{0.5em}{0ex}}\in, as well as the global picture it offers of the degradation of discriminatory power as a function of µ, N and θ.
This work has for the first time enabled the systematic discrimination between the two models of duplicate deletion following WGD. Future research will continue on the analytical determination of ψ_{ t }as well as extension to the "twosided" deletion models proposed in [5]. Eventually, we will have to allow processes of genome rearrangement to disrupt runs of singlecopy genes or doublecopy genes, as in [7]. It is these kinds of model that will eventually be useful for analyzing data from real genomes.
References
Wolfe KH, Shields DC: Molecular evidence for an ancient duplication of the entire yeast genome. Nature. 1997, 387: 70813.
ElMabrouk N: Genome rearrangement by reversals and insertions/deletions of contiguous segments. Combinatorial Pattern Matching, 11th Annual Symposium Lecture Notes in Computer Science. Edited by: Giancarlo, R., Sankoff, D. 2000, Springer, 1848: 222234.
van Hoek MJ, Hogeweg P: The role of mutational dynamics in genome shrinkage. Molecular Biology and Evolution. 2007, 24: 24852494.
Byrnes JK, Morris GP, Li WH: Reorganization of adjacent gene relationships in yeast genomes by wholegenome duplication and gene deletion. Molecular Biology and Evolution. 2006, 23: 11361143.
Sankoff D, Zheng C, Wang B: A model for biased fractionation after whole genome duplication. BMC Genomics. 2012, 13: S1, S8
Wang B, Zheng C, Sankoff D: Fractionation statistics. BMC Bioinformatics. 2011, 12: S9, S5
Sankoff D, Zheng C, Zhu Q: The collapse of gene complement following whole genome duplication. BMC Genomics. 2010, 11: 313
Acknowledgements
Research supported in part by grants to DS from the Natural Sciences and Engineering Research Council of Canada (NSERC). CFBAN was a Mitacs Globalink intern. DS holds the Canada Research Chair in Mathematical Genomics.
Declarations
Publication of this article was funded by NSERC Discovery grant DG 88672008 RGPIN.
This article has been published as part of BMC Bioinformatics Volume 16 Supplement 17, 2015: Selected articles from the Fourth IEEE International Conference on Computational Advances in Bio and medical Sciences (ICCABS 2014): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S17.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
All authors participated in the research, wrote the paper, read and approved the manuscript.
Rights and permissions
This article is published under an open access license. Please check the 'Copyright Information' section either on this page or in the PDF for details of this license and what reuse is permitted. If your intended use exceeds what is permitted by the license or if you are unable to locate the licence and reuse information, please contact the Rights and Permissions team.
About this article
Cite this article
Sankoff, D., Zheng, C., Wang, B. et al. Structural vs. functional mechanisms of duplicate gene loss following whole genome doubling. BMC Bioinformatics 16 (Suppl 17), S9 (2015). https://doi.org/10.1186/1471210516S17S9
Published:
DOI: https://doi.org/10.1186/1471210516S17S9