Hypothesis testing of meiotic recombination rates from population genetic data
Abstract
Background
Meiotic recombination, one of the central biological processes studied in population genetics, comes in two known forms: crossovers and gene conversions. A number of previous studies have shown that when one of these two events is nonexistent in the genealogical model, the point estimation of the corresponding recombination rate by population genetic methods tends to be inflated. Therefore, it has become necessary to obtain statistical evidence from population genetic data about whether one of the two recombination events is absent.
Results
In this paper, we formulate this problem in a hypothesis testing framework and devise a testing procedure based on the likelihood ratio test (LRT). However, because the null value (i.e., zero) lies on the boundary of the parameter space, the regularity conditions for the large-sample approximation to the distribution of the LRT statistic do not apply. In turn, the standard chi-squared approximation is inaccurate. To address this critical issue, we propose a parametric bootstrap procedure to obtain an approximate p-value for the observed test statistic. Coalescent simulations are conducted to show that our approach yields accurate null p-values that closely follow the theoretical prediction while the estimated alternative p-values tend to concentrate closer to zero. Finally, the method is demonstrated on a real biological data set from the telomere of the X chromosome of African Drosophila melanogaster.
Conclusions
Our methodology provides a necessary complement to the existing procedures of estimating meiotic recombination rates from population genetic data.
Keywords
Recombination rates Gene conversion Hypothesis testingBackground
Meiotic recombination is one of the essential evolutionary factors responsible for promoting genetic diversity within species. There are two major types of meiotic recombination events: crossovers and gene conversions. Unlike crossover, which is a reciprocal event, gene conversion is a unidirectional event that involves the transfer of a short segment of one parental chromosome (called a ‘conversion tract’) to the other parental chromosome. Crossovers and gene conversions play different roles in shaping the pattern of linkage disequilibrium (LD) observed in natural populations: “Recombination between pairs of markers that are far apart are almost exclusively crossovers, whereas pairs of markers that are close together are affected by both crossovers and gene conversion events” [1]. Thus, studying these two biological processes and characterizing their basic parameters has direct implications for population genetic studies.
There is a growing body of work on coalescent-based statistical approaches to jointly estimating the crossover rate, the gene conversion rate, and the mean conversion tract length from population genetic data. Building on a popular framework called the “Product of Approximate Conditionals” (PAC) model [2], Gay et al. [3] have proposed a likelihood-based method to incorporate gene conversion events. Yin et al. [4] have extended and improved the model further by explicitly modeling overlapping gene conversion events. On the flip side of these two frequentist approaches, Bayesian Markov chain Monte Carlo (MCMC) techniques have also been developed to estimate recombination rates from population genetic data [5],[6].
Despite the marked progress in the joint estimation of the aforementioned three parameters, these methods are less suitable when one of the two recombination events is absent in the genealogical model. The corresponding population parameter, especially the gene conversion rate when the gene conversion event is nonexistent, tends to be overestimated by the maximum likelihood (or maximum a posteriori) point estimation. This is unfortunately inevitable because the true parameter value (i.e., zero) lies on the boundary of the parameter space. The use of inaccurate parameters may limit the efficacy of these approaches, and can also hinder population genetic analyses based on these estimators. Therefore, it has become necessary to obtain statistical evidence from population genetic data about whether one of the two recombination events is absent.
The goal of this article is to propose a rigorous procedure to perform hypothesis testing for this problem. Our approach is based on the likelihood ratio test (LRT). One of the classical regularity conditions for the asymptotic distribution of the LRT statistic requires the null value to be an interior point in the parameter space. However, because this condition is not satisfied, it is invalid to apply the standard chi-squared approximation in this setting. We thus develop a parametric bootstrap procedure to obtain an approximate p-value of the observed test statistic. Coalescent-based simulations are conducted to demonstrate the soundness and effectiveness of our approach. The bootstrap estimates of the null p-values closely follow the theoretical prediction, while the estimated alternative p-values tend to concentrate closer to zero. Finally, we apply the method to a real biological data set from the telomere of the X chromosome of African D. melanogaster. The result suggests that while gene conversion is likely to play a leading role in shaping the observed polymorphism in these regions, crossover may not have been greatly suppressed in a short segment of s u(w ^{ a }) locus.
Methods
We begin by reviewing some previous statistical models used for point estimation of recombination parameters from population genetic data. In developing our hypothesis testing procedure based on the likelihood ratio test (LRT), we adopt the likelihood function of the OVERPAINT model that offers greater flexibility by allowing for overlapping gene conversions [4],[7]. Throughout this paper, ρ and γ are used to refer to the population-scaled crossover and gene conversion rates (per kb), respectively. The mean length of gene conversion tracts (kb) is denoted by λ.
The PAC model and the GenCo model
Instead of maximizing the true but intractable likelihood function l, the idea of the PAC model is to use the approximate likelihood l _{PAC} as a surrogate function to estimate recombination parameters from the sampled haplotypes. The original PAC model [2] has only considered the estimation of the crossover rate ρ, in which case l _{PAC} becomes a one dimensional function. Gay et al. [3] have extended the model by incorporating gene conversion events, and their model GenCo can be used to jointly estimate the crossover rate ρ, the gene conversion rate γ, and the mean conversion tract λ.
The choice of the approximate conditional probabilities Open image in new window in the GenCo model assumes that h _{k+1} is an imperfect mosaic copy of h _{1},...,h _{ k }. In particular, h _{k+1} is considered to consist of a mixture of segments from h _{1},...,h _{ k } with a small number of mutations, and its mosaic structure is the result of a joint effort by the crossover and gene conversion events. To capture this imperfect copying process, Gay et al. [3] have designed a factorial hidden Markov model (HMM) [12],[13] with two independent hidden chains. The crossover chain is modeled as a Poisson process with rate ρ along the sequence; for the gene conversion chain, both initiation and termination of a conversion tract are modeled as Poisson processes, with rates γ and 1/λ respectively. The joint configuration of the states in these two chains determines the index of the haplotype from which the copying is performed. See [3] and Figure two(a) in [4] for more details.
The OVERPAINTmodel
If site j+1 is not in a conversion tract (G _{j+1} is in the null state ∅), the crossover chain evolves according to the same Poisson process as defined in the GenCo model [3]. Otherwise, if site j+1 is in a conversion tract (G _{j+1}=∅), the crossover chain keeps track of the state in the previous site, i.e., the indicator function sets X _{j+1}=X _{ j }.
In addition to constructing coupled hidden chains to capture the alternating pattern of gene conversion, another key feature of the OVERPAINT model is to allow for overlapping gene conversion events in the copying process. This is motivated by the observation that it is possible for the coalescent model with gene conversion to generate genealogies in which the gene conversion tracts partially overlap or are completely nested within each other. See [4] and [7] for details of the OVERPAINT model, including the exact form of the initial and transition probabilities of hidden chains as well as the forward-backward algorithm to compute the approximate conditional probabilities Open image in new window .
Here, we use L _{OVERPAINT}(l,l,l) to refer to the likelihood function of the OVERPAINT model and f(λ) to denote the density of λ that corresponds to (1). The prior can also be interpreted as a regularizer to penalize very small or very large values of λ, and hence can yield more stable numerical results [7].
Motivation examples
In the settings of nonzero crossover and nonzero gene conversion rates, the studies in [4],[7] have shown that the OVERPAINT model provides a substantial improvement over the GenCo model in the accuracy of point estimation. However, as we will show below, the point estimators tend to be inflated and thus become unreliable when one of the recombination rates lies on the boundary of the parameter space, i.e., ρ=0 or γ=0. In conducting the simulation, 100 data sets with gene conversions only (ρ=0) and crossovers only (γ=0 and λ=0), respectively, are independently generated by the coalescent simulation program MS [17]. In each simulation, we generate a 20 kb region using θ=1.0/kb for the mutation rate and λ=0.5 kb for the mean tract length if the gene conversion rate γ=0, then we obtain the point estimation of all three parameters ρ,γ and λ by maximizing (2).
Summary of parameter estimates on simulated data sets with gene conversions only ( ρ =0)
ρ | |||||
---|---|---|---|---|---|
0.5 | 0.03(0.05) | 1.50(1.21) | 0.56(0.23) | 60 | 74 |
1.0 | 0.03(0.05) | 1.81(2.01) | 0.59(0.22) | 77 | 90 |
2.5 | 0.05(0.06) | 3.08(1.77) | 0.54(0.19) | 90 | 99 |
5.0 | 0.05(0.07) | 4.55(1.69) | 0.52(0.14) | 96 | 99 |
10.0 | 0.12(0.15) | 9.31(4.18) | 0.48(0.15) | 97 | 100 |
Summary of parameter estimates on simulated data sets with crossovers only ( γ =0)
γ | |||||
---|---|---|---|---|---|
0.5 | 0.45(0.22) | 0.71(0.62) | 0.66(0.25) | 6 | 11 |
1.0 | 0.75(0.29) | 0.71(0.60) | 0.73(0.28) | 4 | 10 |
2.5 | 1.54(0.68) | 0.78(0.61) | 0.81(0.25) | 14 | 19 |
5.0 | 2.59(0.96) | 1.21(0.79) | 0.79(0.22) | 7 | 20 |
10.0 | 5.24(8.94) | 2.89(2.81) | 0.75(0.29) | 4 | 13 |
In what follows, we will mainly focus on testing the null hypothesis H _{0}:ρ=0 (no gene conversion), but our testing procedure as outlined in Algorithm 1 can also be easily modified to testing H _{0}:ρ=0, as we will demonstrate in the section of “Results and discussion”.
Parametric bootstrap
where L _{OVERPAINT}(ρ,0,0 H) denotes the function in (2) computed with crossover rate ρ only (i.e., the original PAC model in [2]).
As Figure 2 and Additional file 1: Figure S1 show, the null distribution of the LRT statistic ρ(H) and its critical value (the 95% quantile) depends on the crossover rate ρ, which is an unknown nuisance parameter under the null hypothesis H _{0}. This observation motivates us to develop a parametric bootstrap procedure [19] to obtain an approximate p-value for the observed test statistic ρ(H), as outlined in Algorithm 1. Instead of constructing the whole null distribution of the LRT statistic, we draw B samples of size n from the null hypothesis with a crossover rate of Open image in new window , which is the parametric estimate of the nuisance parameter ρ under H _{0}. We then evaluate the test statistic on each bootstrap sample, and count the proportion that exceed the observed statistic.
Results and discussion
Simulation study
To evaluate the performance of our testing procedure, we use the same parameter settings as in the section “Motivation examples” to conduct the simulation. All reported p-values are based on B=200 bootstrap samples.
p-values under the null hypothesis
Summary of the estimated nuisance parameter γ under the null hypothesis H _{ 0 }: γ =0
γ | n =20 | n =35 | ||||
---|---|---|---|---|---|---|
0.5 | 0.65(0.26) | 87 | 100 | 0.71(0.22) | 91 | 100 |
1.0 | 1.04(0.37) | 94 | 100 | 1.09(0.31) | 99 | 100 |
2.5 | 2.00(0.58) | 89 | 100 | 2.22(0.47) | 99 | 100 |
5.0 | 3.33(0.75) | 90 | 100 | 3.72(0.64) | 97 | 100 |
10.0 | 7.52(1.40) | 75 | 100 | 8.19(1.01) | 88 | 100 |
p-values under the alternative hypothesis
A real biological application
Bootstrap p -values for segments of the su(s) locus in D. melanogaster
Segment | s1 | s2 | s3 | s4 | s5 | All |
---|---|---|---|---|---|---|
Length (kb) | 1.8 | 1.8 | 1.6 | 2.4 | 2.3 | 4.1 |
H _{0}:ρ=0 | 0.32 | 0.66 | 0.01 | 0.15 | 0.54 | 0.03 |
H _{0}:ρ=0 | 0.86 | 0.46 | 0.92 | 0.64 | 0.36 | 0.45 |
Bootstrap p -values for segments of the s u ( w ^{ a } ) locus in D. melanogaster
Segment | s1 | s2 | s3 | s4 | s5 | s6 | All |
---|---|---|---|---|---|---|---|
Length (kb) | 0.4 | 1.0 | 1.1 | 1.8 | 1.2 | 1.5 | 2.5 |
H _{0}:ρ=0 | 0.01 | 0.17 | 0.31 | 0.19 | 0.22 | 0.30 | 0.0 |
H _{0}:ρ=0 | 0.03 | 0.71 | 0.49 | 0.55 | 0.88 | 0.89 | 0.31 |
For the s u(s) locus, the p-values against H _{0}:ρ=0 for all the segments (including the whole locus) show no evidence of detecting crossover. However, a small p-value (0.01) against H _{0}:ρ=0 is observed for the shortest segment s3, and the overall effect is to provide a strong evidence of gene conversion for the whole locus (p-value = 0.03). This is consistent with the conclusion that gene conversion is likely to play a leading role in shaping the observed polymorphism in this region [20].
A similar pattern of the p-values holds for the s u(w ^{ a }) locus, except that the p-values against H _{0}:ρ=0 and H _{0}:ρ=0 for the shortest segment s1 are both significant at the 5% level: 0.01 and 0.03, respectively. This could imply that while gene conversion rate is high in this short segment, crossover may not have been greatly suppressed. It could also suggest a higher proportion of gene conversions that are accompanied by crossover events.
Conclusion
In this work, we have introduced a hypothesis testing procedure that can provide statistical evidence from population genetic data about whether one of the two recombination events is absent. By extensive coalescent simulation studies, we have shown that our parametric bootstrap approach is able to yield accurate estimates of the null p-values that closely follow the theoretical prediction. On the other hand, the bootstrap estimates of the alternative p-values tend to concentrate closer to zero. Our results on real SNP data sets from the s u(s) and s u(w ^{ a }) loci of African D. melanogaster indicate a strong evidence of detecting gene conversion in short segments of these regions. Moreover, crossover may also play an important role in a short segment of the s u(w ^{ a }) locus. We believe that our method provides a necessary complement to the existing procedures of estimating meiotic recombination rates from population genetic data, and expect it to be applied to other data sets.
Additional files
Notes
Acknowledgements
The author would like to acknowledge BioMed Central for a waiver of the article processing charge. I would also like to thank Prof. Yun S. Song, Prof. Michael I. Jordan, and Dr. Danping Liu for helpful suggestions and discussions. An allocation of computer time from the UA Research Computing High Performance Computing (HPC) and High Throughput Computing (HTC) at the University of Arizona is gratefully acknowledged.
Supplementary material
References
- 1.Wall JD: Close look at gene conversion hot spots . Nat Genet. 2004, 36 (2): 114-115. 10.1038/ng0204-114.PubMedCrossRefGoogle Scholar
- 2.Li N, Stephens M: Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data . Genetics. 2003, 165 (4): 2213-2233.PubMedPubMedCentralGoogle Scholar
- 3.Gay JC, Myers S, McVean G: Estimating meiotic gene conversion rates from population genetic data . Genetics. 2007, 177 (2): 881-894. 10.1534/genetics.107.078907.PubMedPubMedCentralCrossRefGoogle Scholar
- 4.Yin J, Jordan MI, Song YS: Joint estimation of gene conversion rates and mean conversion tract lengths from population SNP data . Bioinformatics. 2009, 25 (12): 231-239. 10.1093/bioinformatics/btp229.CrossRefGoogle Scholar
- 5.Wang Y, Rannala B: Population genomic inference of recombination rates and hotspots . Proc Natl Acad Sci. 2009, 106 (15): 6215-6219. 10.1073/pnas.0900418106.PubMedPubMedCentralCrossRefGoogle Scholar
- 6.Padhukasahasram B, Rannala B: Bayesian population genomic inference of crossing over and gene conversion . Genetics. 2011, 189 (2): 607-619. 10.1534/genetics.111.130195.PubMedPubMedCentralCrossRefGoogle Scholar
- 7.Yin J: Computational methods for meiotic recombination inference. PhD thesis, University of California, Berkeley, Berkeley, CA, 2010.Google Scholar
- 8.Kingman JFC: The coalescent . Stochastic Processes Appl. 1982, 13 (3): 235-248. 10.1016/0304-4149(82)90011-4.CrossRefGoogle Scholar
- 9.Wiuf C, Hein J: The coalescent with gene conversion . Genetics. 2000, 155 (1): 451-462.PubMedPubMedCentralGoogle Scholar
- 10.Wiuf C: A coalescence approach to gene conversion . Theor Popul Biol. 2000, 57 (4): 357-367. 10.1006/tpbi.2000.1462.PubMedCrossRefGoogle Scholar
- 11.Song YS, Lyngsø R, Hein J: Counting all possible ancestral configurations of sample sequences in population genetics . IEEE/ACM Trans Comput Biol Bioinform. 2006, 3 (3): 239-251. 10.1109/TCBB.2006.31.PubMedCrossRefGoogle Scholar
- 12.Rabiner L: A tutorial on HMM and selected applications in speech recognition . Proc IEEE. 1989, 77 (2): 257-286. 10.1109/5.18626.CrossRefGoogle Scholar
- 13.Ghahramani Z, Jordan MI: Factorial hidden markov models . Mach Learn. 1997, 29: 245-273. 10.1023/A:1007425814087.CrossRefGoogle Scholar
- 14.Hilliker AJ, Harauz G, Reaume AG, Gray M, Clark SH, Chovnick A: Meiotic gene conversion tract length distribution within the rosy locus of drosophila melanogaster . Genetics. 1994, 137 (4): 1019-1026.PubMedPubMedCentralGoogle Scholar
- 15.Jeffreys AJ, May CA: Intense and highly localized gene conversion activity in human meiotic crossover hot spots . Nat Genet. 2004, 36 (2): 151-156. 10.1038/ng1287.PubMedCrossRefGoogle Scholar
- 16.Nocedal J, Wright SJ: Numerical Optimization . 2000, Springer, New YorkGoogle Scholar
- 17.Hudson RR: Generating samples under the Wright-Fisher neutral model of genetic variation . Bioinformatics. 2002, 18 (2): 337-338. 10.1093/bioinformatics/18.2.337.PubMedCrossRefGoogle Scholar
- 18.Ferguson T: A Course in Large Sample Theory. Chapman & Hall/CRC Texts in Statistical Science . 1996, Chapman and Hall/CRC, United KingdomCrossRefGoogle Scholar
- 19.Efron B, Tibshirani RJ: An Introduction to the Bootstrap. Chapman & Hall/CRC Monographs on Statistics & Applied Probability . 1994, Chapman and Hall/CRC, United KingdomGoogle Scholar
- 20.Langley CH, Lazzaro BP, Phillips W, Heikkinen E, Braverman JM: Linkage disequilibria and the site frequency spectra in the su(s) and su ( w ^{a}) regions of the Drosophila melanogaster X chromosome. Genetics2000, 156:1837-1852.,Google Scholar
Copyright information
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.