Background

Thanks to the rapid development of sequencing technology and the lowering of sequencing costs in the last decade, the availability of large quantities of sequence data provides an unprecedented opportunity for researchers to investigate the role of rare variants in complex traits [1,2,3,4]. But due to the low minor allele frequency (MAF < 5%) and thus resulting in weak linkage disequilibrium (LD) with nearby markers, detecting rare variant (RV) association with complex traits faces great challenges [5,6,7,8]. One challenge is that detection of rare causal variants with traditional designs usually requires a large sample, which will be the high cost [3, 6]. Thus cost-effective design should be considered to reduce sample size. Another challenge is that the statistical power with test statistics of single-marker tests is generally low in genetic association studies of rare variants with more moderate or weak genetic effects [8,9,10]. To date many statistical methods have been developed for rare variant association analysis, including burden tests [11,12,13], variance-component tests [14, 15], series of sequence kernel association tests [10, 16, 17]. Any of these methods has relative perfect performance in special scenario, but none of them can overwhelm others in all scenarios [8, 9], especially for quantitative traits.

In fact, rare variant association analysis in the past several years mainly focused on the qualitative trait. Only a few statistical methods have been developed for the quantitative trait [13, 18,19,20,21]. One approach for rare variant association analysis of quantitative traits is the linear regression model. However, most regression-based methods rely on the normality assumption of the phenotype [8, 21]. Another commonly used approach adopts phenotype extreme selection design where one can transform the quantitative trait association study into case-control association study of qualitative traits by treating the upper extreme as cases and the lower extreme as controls in a strategy using extreme phenotype [22,23,24,25]. Extreme phenotypes of a quantitative trait are generally considered to be more informative. Moreover, a smaller sample size for extreme-phenotype sampling than that for random sampling is needed to achieve similar power [23, 24].

In this report, we use phenotype extreme selection design and Kullback-Leibler distance (KL-distance) [26] to propose a simple statistic method to identify rare variants for quantitative traits. Two stages strategies are adopted in our analysis where association analysis and fine mapping will be done in the first stage and the second stage, respectively. This method will compare the frequency distributions of rare variant in two extreme phenotypes based on KL-distance. Our method has three features: (1) it has increasing power with the stringency of the sample selection, (2) it is affected slightly by non-causal variants and the opposite effect variants.in the first stage for association analysis, and (3) it is not depend on the frequencies of non-causal variants and just dependent on the frequencies of causal variants in the second stage for fine mapping. Through simulation studies, we investigate the performance of the proposed method and compare it with three commonly used methods of the burden test [12], the sequence kernel association test (SKAT) [17], and the optimal test that combines SKAT and the burden test (SKAT-O) [10].

Results

Type I error rate and power for association analysis

Table 1 exhibits the estimated type I error rates of the statistic TKL for the extreme sample with sample-selection threshold value of 20, 10, and 5% and with sample size of 1000 and 1500. It can be seen that, under various genetic parameters, type I error rates of TKL are not appreciably different from the nominal alpha levels, which indicates the validity of the statistic TKL.

Table 1 Estimated type I error rates of the statistic TKL

Figure 1 shows the results of power for 9 scenarios when sample sizes are 1000 and 1500. It is found that the power of the statistic TKL with the sample size of 1500 is nearly 0.20 larger than that with the sample size of 1000, indicating that the power of the statistic TKL significantly increase with the increasing of the sample size. It can be seen that, under the same sample size, the powers of the statistic TKL with the low 5% samples and the up 5% samples are highest and the powers with the low 20% samples and the up 20% samples are lowest, which indicates that the powers of the statistic TKL increase with the stringency of the sample selection. It is observed from scenarios {1, 2, 3} that, when rare variant effects are in the same direction, the powers of the statistic TKL increase with the increasing of the number of causal variants. The same above conclusions are observed when 80% causal variants have positive effect and 20% causal variants have negative effect (scenarios {4, 5, 6}) and when there is the same number of causal variants with positive effects and negative effects (scenarios {7, 8, 9}). By comparing the powers under scenarios {1, 4, 7} with 10 causal variants, the powers under scenarios {2, 5, 8} with 20 causal variants, and the powers under scenarios {3, 6, 9} with 50 causal variants, we found that, when the number of causal variants with negative effect increases, the power of the statistic TKL decreases slightly. From Fig. 1, we can observe that, among four statistics of the TKL, the burden test, the SKAT, and the SKAT-O, the power of TKL is higher than that of other three statistics. The burden test, the SKAT, and the SKAT-O are severely affected by the number of non-causal variants and the opposite effect variants, especially when there are the same number of opposite effect variants. Although non-causal variants and the opposite effect variants affect the power of the TKL, the impact is slight. For example, when the sample size is 1500 and the number of causal variants is 50 for 10% sample-selection threshold value (B2), as the number of variants with negative effect increases from zero to 25, the powers of the burden test, the SKAT, and the SKAT-O decrease from ~ 0.80, ~ 0.79, and ~ 0.84 to ~ 0.23, ~ 0.63, and ~ 0.74, respectively, with the decline rate of 71.2, 20.2, and 12.0%. Nevertheless, when the number of variants with negative effect is 25, the TKL still achieves ~ 0.83 power, with the decline rate of just 7% comparing to ~ 0.90 when the number of variants with negative effect is zero.

Fig. 1
figure 1

Empirical power of four statistics from the extreme samples with 20% threshold value a, 10% threshold value b, and 5% threshold value c when the sample sizes are 1000 (a1, b1, c1) and 1500 (a2, b2, c2) at a 0.05 significance level

Power for fine mapping

In fine mapping study, the QTL can be located by the maximum value of the measure lKL. So we sample 10 times from each of 100 simulation populations where each sample includes 750 individuals with the up-extreme phenotype of Y > U and 750 individuals with the lower-extreme phenotype with Y < L. For each sample, we calculate the value of the measure lKL for each variant. In order to guard against noisy distributions of the measure lKL, we adopt the 5-point moving-average method to determine the maximum value. We count the number (here, we denote it B) of the maximum values that locate at variant 10 or variant 11. Then the probability that the maximum values of lKL locate at variant 10 or variant 11 is B/1000. We refer this value as the power of lKL, which measure the likelihood of fine mapping the QTL. Table 2 shows the results of the power for lKL. It can be seen that the power of lKL for fine mapping under dominant model is highest and the power of lKL for fine mapping under recessive model is lowest. The power of lKL increases with increasing of the heritability h2 of the causal variant and the stringency of the sample selection. For example, power of lKL under dominant model with the heritability h2 of 0.01 is 0.52, 0.62, and 0.67 at 20, 10, and 5% sample-selection threshold value, respectively; power of lKL under dominant at 5% sample-selection threshold value increase from 0.67 to 0.83 with the heritability h2 of the causal variant increasing from 0.01 to 0.10. We also investigate the effect of different sample sizes (e.g., 2n =1000, 1500, and 2000). As expected, power of lKL increases with the increasing sample size (data not shown). In order to assess the performance of lKL, we compare it with two LD measures l [27] and pexcess [28] with case-control design using extreme samples. Table 2 also lists the powers for l and pexcess. We found that the powers of lKL and l are nearly the same and higher than those of pexcess.

Table 2 The power of the QTL fine mapping for three LD measures by use of five-point moving average

Discussion

In this report, we present a robust approach to identify rare variant of quantitative traits. The proposed approach adopts phenotype extreme selection design and KL-distance method. We use a two-stage strategy in our analysis where the first stage is association analysis and the second stage is fine mapping of QTL if the first stage is positive result. We propose a statistic TKL for association analysis and a LD measure lKL for fine mapping. Simulation studies present the performance of the proposed method. We found that the power of the TKL increases with the stringency of the sample selection and the increasing of the number of causal variants. The TKL here has higher power for association analysis than three existing statistics. Meanwhile, the impact of non-causal variants and the opposite effect variants on the TKL is slight. The LD measure lKL for fine mapping in the second stage has a good feature of not dependence on the frequencies of non-causal variants and just dependence on the frequencies of causal variants. These results show that our method can be used to detect rare variant associated with quantitative traits. At the same time, we found that the proposed method can be easily extended to case-control study by treating cases and controls as samples with upper extreme phenotype and lower extreme phenotype, respectively.

In rare variant association analysis, in order to achieve high statistical power of tests, usually a large sample with high sequencing costs is needed. Thus less costly sequencing design is preferred in rare variant association study. For quantitative traits, extreme phenotypes are generally considered to be more informative because of rare causal variants enriched among them. One can use a smaller sample size for extreme-phenotype sampling to achieve similar power as that for random sampling [23, 24]. Moreover, because extreme phenotypes of quantitative traits relative to human health are of primary clinical significance and thus data set can be obtained easily for subjects with extreme phenotypic values, using extreme phenotype samples in association analysis will make our study useful and practical. Here we use KL-distance to construct the statistics TKL to measure the difference between two probability distributions of rare variants in two extreme populations. Based on the principle that the larger TKL value is, the more dissimilar two probability distributions of rare variants, the statistics TKL can be used as a test statistic to quantify the magnitude of association between the variants and the quantitative trait in the first stage of association analysis. We found that the statistic TKL here for association analysis has higher power than three existing statistics of the burden test, the SKAT, and the SKAT-O. Moreover, whereas increasing the number of non-causal variants and the opposite effect variants result in decreasing severely the powers of the burden test, the SKAT, and the SKAT-O, non-causal variants and the opposite effect variants affect slightly on the TKL. The TKL has relatively stable power with small change range under various parameters set.

In the second stage of fine mapping, lKL is essentially a measure of LD between the variant and the QTL. Although LD between rare variant and QTL maybe weak [24], the maximum value across all rare variants can be usually found to identify the causal variant (QTL). The measure lKL here has a good performance of just dependence on the frequency of the causal variant. In practice, not dependence on the frequency of the non-causal variant can eliminate“noise” and even bias introduced by varying frequencies of non-causal variants. In our early works, we proposed the LD measure l for mapping common variant of the QTL [27]. The performance of the measure l for mapping rare variant is unknown. We found from theory analysis that the two LD measures lKL and l are parallel and have the same performance, that is, both of them can quantify LD between the variant and the QTL and do not depend on frequencies of non-causal variants. The difference between them is that the measure lKL here is based on KL-distance and the measure l is based on entropy theory. Another LD measure for fine mapping is pexcess [28]. The pexcess is originally developed for fine mapping common variant of qualitative trait. We compare the performance of these three LD measures for fine mapping rare variant of quantitative traits using extreme samples. We found from theory analysis and simulation study that lKL is superior to pexcess.

It is noted that, in practice, we do not know how many causal variants there are in the region established through association analysis at first stage. Although we considered a region having only a single causal variant, our method works for the general case with a region consisting of multiple causal variants. In fact, when there is a region linked to a quantitative trait has multiple causal variants, we can detect all causal variants using following steps: (1) lKL is used to mapping a causal variant with the maximum value of lKL; (2) TKL is used to do association analysis for all variants except the causal variant detected in (1). If the association analysis result is positive, then return to (1). All causal variants will be found when the association analysis result is negative. It should be noted that we use the permutation procedure to assess the statistical significance of the statistic TKL for association analysis. Permutation procedure may need more computing time to conduct simulation. But with the development of high-performance computing, computing time may not be a problem in our study. In addition, it can be seen that our method involves only rare variants. A phenotype may affected by common variants or both common variants and rare variants. So our further work will involve extensive field for common variants or both common variants and rare variants.

Conclusions

The statistic TKL is affected slightly by non-causal variants and the opposite effect variants. The power of the TKL for association analysis of rare variants increases with the stringency of the sample selection for quantitative traits. Extreme phenotypes allow TKL to achieve higher power than three commonly used methods. The LD measure lKL for fine mapping is independent of the frequencies of non-causal variants and just dependent on the frequencies of causal variants.

Methods

In this study, all datasets were publically available and no research requiring ethics approval was conducted.

We consider an interesting gene region with k biallelic variants and assume that each variant has a minor allele m with the MAF Pm and a normal allele M with the allele frequency PM (Pm + PM = 1). The variants are indexed by i (i = 1, ..., k). The index i may or may not correspond to the variant orders. Let Xi be minor allele count at ith variant carried by a subject. Assume that there is a quantitative trait Y: Y = β0 + G + ε, where \( G=\underset{i=1}{\overset{k}{\Sigma}}{\beta}_i{X}_i \), β0 is the mean baseline value, and ε is residual due to random environmental effects. Without loss of generality, we assume β0 =0 and ε~N(0, σ2). To simplify our presentation, we use a measure with a superscript “U” to indicate a measure in the upper extreme population that has phenotypic values of the quantitative trait Y > U (U is an upper-threshold value, chosen from the continuous distribution of the study quantitative trait). We also use a measure with a superscript “L” to indicate a measure in the low extreme population that has phenotypic values of the quantitative trait Y < L (L is an low-threshold value, chosen from the continuous distribution of the study quantitative trait). Assume NU and NL subjects are sequenced with k variants in the upper extreme population and in the low extreme population, respectively, which are indexed by j (j = 1,..., NU/ NL). Denote \( {X}_{ij}^{\mathrm{U}} \) and \( {X}_{ij}^{\mathrm{L}} \) as the number of copies “m” for jth subject at ith variant in the upper extreme population and in the low extreme population, respectively. Then the frequencies of Pm and PM at ith variant in the upper extreme population and in the low extreme population, denoted as \( {p}_{mi}^{\mathrm{U}} \), \( {p}_{Mi}^{\mathrm{U}} \), and \( {p}_{mi}^{\mathrm{L}} \), \( {p}_{Mi}^{\mathrm{L}} \), respectively, are estimated as follows:

\( {p}_{mi}^{\mathrm{U}}=\frac{\underset{j=1}{\overset{N^{\mathrm{U}}}{\Sigma}}{X}_{ij}^{\mathrm{U}}}{2{N}^{\mathrm{U}}} \), \( {p}_{Mi}^{\mathrm{U}}=1-{p}_{mi}^{\mathrm{U}} \), \( {p}_{mi}^{\mathrm{L}}=\frac{\underset{j=1}{\overset{N^{\mathrm{L}}}{\Sigma}}{X}_{ij}^{\mathrm{L}}}{2{N}^{\mathrm{L}}} \), and \( {p}_{Mi}^{\mathrm{L}}=1-{p}_{mi}^{\mathrm{L}} \).

A statistic test for association analysis in the first stage

In the first stage, we propose a statistic test for association analysis. We define a k-dimensional random vector \( {\tilde{p}}_m={\left({\tilde{p}}_{m1},\cdots, {\tilde{p}}_{mk}\right)}^T \) as the proportion of the minor allele m among all k variants, where \( {\tilde{p}}_{mi}=\frac{\underset{j}{\Sigma}{X}_{ij}}{\underset{i=1}{\overset{k}{\Sigma}}\underset{j}{\Sigma}{X}_{ij}} \) and Xij is the number of copies “m” for jth subject at ith variant. In the upper extreme population and in the low extreme population, the k-dimensional random vector of the proportion of the minor allele m are denoted as \( {\tilde{p}}_m^{\mathrm{U}}={\left({\tilde{p}}_{m1}^{\mathrm{U}},\cdots, {\tilde{p}}_{mk}^{\mathrm{U}}\right)}^T \) and \( {\tilde{p}}_m^{\mathrm{L}}={\left({\tilde{p}}_{m1}^{\mathrm{L}},\cdots, {\tilde{p}}_{mk}^{\mathrm{L}}\right)}^T \), respectively, where \( {\tilde{p}}_{mi}^{\mathrm{U}}=\frac{\underset{j=1}{\overset{N^{\mathrm{U}}}{\Sigma}}{X}_{ij}^{\mathrm{U}}}{\underset{i=1}{\overset{k}{\Sigma}}\underset{j=1}{\overset{N^{\mathrm{U}}}{\Sigma}}{X}_{ij}^{\mathrm{U}}} \) and \( {\tilde{p}}_{mi}^{\mathrm{L}}=\frac{\underset{j=1}{\overset{N^{\mathrm{L}}}{\Sigma}}{X}_{ij}^{\mathrm{L}}}{\underset{i=1}{\overset{k}{\Sigma}}\underset{j=1}{\overset{N^{\mathrm{L}}}{\Sigma}}{X}_{ij}^{\mathrm{L}}} \) (i = 1, 2, ⋯, k). We compare the two probability distributions \( {\tilde{p}}_m^{\mathrm{U}} \) and \( {\tilde{p}}_m^{\mathrm{L}} \) using the KL-distance which is defined as in Kullback & Leibler [26], here, we denote it the statistic TKL:

$$ {\mathrm{T}}_{KL}=H\left({\tilde{p}}_m^{\mathrm{U}},{\tilde{p}}_m^{\mathrm{L}}\right)=\frac{1}{2}\left(\underset{i=1}{\overset{k}{\Sigma}}{\tilde{p}}_{mi}^{\mathrm{U}}\cdot \log \frac{{\tilde{p}}_{mi}^{\mathrm{U}}}{{\tilde{p}}_{mi}^{\mathrm{L}}}+\underset{i=1}{\overset{k}{\Sigma}}{\tilde{p}}_{mi}^{\mathrm{L}}\cdot \log \frac{{\tilde{p}}_{mi}^{\mathrm{L}}}{{\tilde{p}}_{mi}^{\mathrm{U}}}\right) $$
(1)

It is easy to find the relationship between TKL and the frequencies of Pm and PM as follows:

$$ {\mathrm{T}}_{\mathrm{KL}}=\frac{1}{2}\left(\underset{i=1}{\overset{k}{\Sigma}}\frac{p_{mi}^{\mathrm{U}}}{\underset{i=1}{\overset{k}{\Sigma}}{p}_{mi}^{\mathrm{U}}}\cdot \log \left(\frac{p_{mi}^{\mathrm{U}}}{p_{mi}^{\mathrm{L}}}\cdot \frac{\underset{i=1}{\overset{k}{\Sigma}}{p}_{mi}^{\mathrm{L}}}{\underset{i=1}{\overset{k}{\Sigma}}{p}_{mi}^{\mathrm{U}}}\right)+\underset{i=1}{\overset{k}{\Sigma}}\frac{p_{mi}^{\mathrm{U}}}{\underset{i=1}{\overset{k}{\Sigma}}{p}_{mi}^{\mathrm{U}}}\cdot \log \left(\frac{p_{mi}^{\mathrm{L}}}{p_{mi}^{\mathrm{U}}}\cdot \frac{\underset{i=1}{\overset{k}{\Sigma}}{p}_{mi}^{\mathrm{U}}}{\underset{i=1}{\overset{k}{\Sigma}}{p}_{mi}^{\mathrm{L}}}\right)\right) $$
(2)

TKL is the mean between two KL-distances where one is the KL-distance between \( {\tilde{p}}_m^{\mathrm{U}} \) and \( {\tilde{p}}_m^{\mathrm{L}} \) and the other is the KL-distance between \( {\tilde{p}}_m^{\mathrm{L}} \) and \( {\tilde{p}}_m^{\mathrm{U}} \). KL-distance provides a non-symmetric measure of how big of the difference between two probability distributions are. The KL-distance is always non-negative and equal to 0 only if two distributions are identical. It can be seen that TKL is a non-negative and symmetric measure of the two probability distributions \( {\tilde{p}}_m^{\mathrm{U}} \) and \( {\tilde{p}}_m^{\mathrm{L}} \). So, TKL can be used as a statistic to quantify the magnitude of association between the variants and the quantitative trait: a much larger TKL value will be observed under the alternative hypothesis of association compared to that under the null hypothesis of no association.

A KL-distance index for fine mapping of QTL in the second stage

Assume that a region linked to a quantitative trait has already been established through association analysis at first stage. In order to simplify our presentation, we assume that this region contains only a causal variant with a minor allele a (with frequency pa) and a normal allele A (with frequency pA = 1 − pa), here, we call it the quantitative trait locus (QTL). We consider the quantitative trait Y = GQ + ε, where GQ is the genotypic value at the QTL and ε~N(0, σ2). We hope to fine map this region by calculating the linkage disequilibrium (LD) measure between the QTL and a variant. We still use KL-distance to construct this measure through comparing the probability distributions of allele m and M at a variant in the upper extreme population and in the low extreme population. Following the previous symbols, let Pm and PM be the frequencies of allele m and M at a variant. From Eq. (1), we have

$$ H\left(\left\{{p}_m^{\mathrm{U}},{p}_M^{\mathrm{U}}\right\},\left\{{p}_m^{\mathrm{L}},{p}_M^{\mathrm{L}}\right\}\right)=\frac{1}{2}\left({p}_m^{\mathrm{U}}\cdot \log \frac{p_m^{\mathrm{U}}}{p_m^{\mathrm{L}}}+{p}_M^{\mathrm{U}}\cdot \log \frac{p_M^{\mathrm{U}}}{p_M^{\mathrm{L}}}+{p}_m^{\mathrm{L}}\cdot \log \frac{p_m^{\mathrm{L}}}{p_m^{\mathrm{U}}}+{p}_M^{\mathrm{L}}\cdot \log \frac{p_M^{\mathrm{L}}}{p_M^{\mathrm{U}}}\right) $$
(3)

From Appendix, \( H\left(\left\{{p}_m^{\mathrm{U}},{p}_M^{\mathrm{U}}\right\},\left\{{p}_m^{\mathrm{L}},{p}_M^{\mathrm{L}}\right\}\right) \) can be asymptotically expressed as a function of LD (δam) between the QTL and the variant:

$$ H\left(\left\{{p}_m^{\mathrm{U}},{p}_M^{\mathrm{U}}\right\},\left\{{p}_m^{\mathrm{L}},{p}_M^{\mathrm{L}}\right\}\right)\approx \frac{\delta_{ma}^2\cdot {\left({b}_U-{b}_L\right)}^2}{2{p}_m\cdot {p}_M} $$
(4)

Assume that there is an initial complete association between the variant allele m and the QTL allele a, at the 0th generation when the allele a is initially introduced into the study population. Let \( {\delta}_{ma}^{(0)} \) be the initial complete LD between the allele a and m at the 0th generation, \( {\delta}_{ma}^{(0)}={p}_M\cdot {p}_a \). After n generations, the LD between the allele m and a is \( {\delta}_{ma}^{(n)}={\left(1-\theta \right)}^n{\delta}_{ma}^{(0)}={\left(1-\theta \right)}^n\cdot {p}_M\cdot {p}_a \) [29], where, θ is the recombination between the QTL and the variant. Then we have

$$ H\left(\left\{{p}_m^{\mathrm{U}},{p}_M^{\mathrm{U}}\right\},\left\{{p}_m^{\mathrm{L}},{p}_M^{\mathrm{L}}\right\}\right)\approx \frac{\delta_{am}^2\cdot {\left({b}_U-{b}_L\right)}^2}{2{p}_M\cdot {p}_m}=\frac{{\left(1-\theta \right)}^{2n}{p}_a^2\cdot {p}_M^2\cdot {\left({b}_U-{b}_L\right)}^2}{2{p}_M\cdot {p}_m} $$
(5)

Now we define a LD measure, here, we denote it as lKL, as follows:

$$ {l}_{KL}=\frac{p_m}{p_M}H\ \left(\left\{{p_m}^{\mathrm{U}},{p_M}^{\mathrm{U}}\right\},\left\{{p_m}^{\mathrm{L}},{p_M}^{\mathrm{L}}\right\}\right)\approx \frac{1}{2}{\left(1-\theta \right)}^{2n}{p}_a^2\cdot {b}^2 $$
(6)

where b = bU − bL. It can be seen that lKL is a decreasing function of the recombination θ and reaches its maximum at θ = 0. So we can use lKL to find the variant closest to the QTL and thus fine map the QTL. Notice from Eq. (6) that lKL is independent of the frequency of the variant, just only dependent on the frequency of the QTL.

Simulation

Simulation for association analysis

To evaluate the performance of the test statistic TKL, we perform a series of simulation studies. We consider k = 100 variants with MAF values of causal variants determined by a uniform distribution U (0.001, 0.01) and MAF values of non-causal variants determined by a uniform distribution U (0.001, 0.05). The genotype data are simulated similar to those in Wang and Elston [30]. We first generate haplotypes for k variants based on a latent variable Z = (Z1, ⋯, Zk) from a multivariate normal distribution with covariance structure cov(Zi, Zj) = 0.4i − j between any two latent components. Then we combine two haplotypes to obtain the genotype value for each individual Xi = (Xi1, ⋯, Xik). A phenotype Y under the null hypothesis of no association is generated using the model Y = ε with ε~N(0, 1) (β1 = ⋯ = βk = 0). Under the alternative hypothesis of association, we randomly chose s variants as causal variants while other k-s variants as non-causal variants having βj = 0. Here, we let s = 10, 20, 50 in which 10, 20%, or 50% of rare variants were causal. For causal variants under the alternative hypothesis, we set βj = c ⋅ log10(pmi) as used in Lee et al. [10], where c is 0.6, 0.3, and 0.2 for different values of s and different direction of the effects of causal variants. We consider 9 scenarios in the simulation study with the parameter values detailed in Table 3. We conduct 1000 simulations for each scenario. In each simulation, we select three extreme sample strategies, the low 20% and the up 20%, the low 10% and the up 10%, and, the low 5% and the up 5%, each of which consists of 2 N individuals including N individuals in an upper sample and N individuals in a lower sample. The statistical significance is assessed by a permutation procedure. We first calculate the value of the data-based statistic TKL for each simulation. Then we permute the “upper sample” and “lower sample” labels with equal probability and recalculate the statistic TKL for 1000 times. The estimated P value is then the proportion of permutation-based statistics that are larger than the data-based statistic in 1000 permutations for each simulation. For a given significance level α, the power/type I error rate is estimated as the proportion of rejecting the null hypothesis when p-value ≤ α in 1000 simulations. In order to compare the performance of the test statistic TKL with the existing methods, we also obtain the power for the burden, SKAT, and SKAT-O tests using case-control design with the same samples as for the test statistic TKL.

Table 3 The parameter values for power study

Simulation for fine mapping

To assess the performance of the LD measure lKL in fine mapping rare causal variants of quantitative traits, we conduct a simulation study using the method similar to those described in our early work [27, 28]. We consider a genetic region that has 21 variants, where only a variant locating at the middle of variant 10 and variant 11 is causal variant (that is, the QTL). The MAF of the causal variant is set to be 0.01(pa =0.01) and the MAFs for 20 other variants are uniformly determined with values ranging from 0.001 to 0.05. Other parameters in simulation include the ratio d/v (here, v and d are the genotypic values for individuals with genotypes aa and Aa, respectively), the thresholds L and U, the heritability (h2) of the causal variant, and the sample size (2 N) [31]. We let the ratio d/v be − 1, 0, and 1 which correspond to recessive, additive, and dominant models, respectively. Once the parameter values are chosen, a population with the effective size of 15,000 is simulated starting from the 0th generation, with an initial complete association between the minor allele a at the causal variant and m at other variants [P(m|a) = 1]. The population then evolved for 50 generations under random mating and genetic drift. A hundred populations are simulated for analyses.