Background

Over the past decade, genome-wide association studies (GWAS) have successfully identified many variants significantly associated with diseases and complex traits. Unfortunately, those variants only explain an extremely small proportion of phenotypic variation [1, 2] and there are many more variants with even smaller effects that we have yet to identify [1, 35]. Detecting all loci that harbor associated risk loci can help elucidate the biological mechanisms of diseases and complex traits. All biological follow-up studies have been performed on loci that harbor at least one significant variant. The standard association test used in GWAS examines one variant at a time to identify associated variants; we refer to this method as univariate testing.

Previous works have shown that many loci in the genome harbor more than one causal variant for a given disease or a trait [615]. The phenomenon is known as allelic heterogeneity, which is very common in Mendelian traits [16] and recent works have demonstrated widespread allelic heterogeneity in expression quantitative trait loci (eQTLs) and complex traits [17, 18]. The univariate test may be underpowered for a locus containing multiple causal variants with small effect sizes. Alternatively, an approach that considers the effects of multiple causal variants simultaneously may have increased statistical power to detect signals for the locus by aggregating the effects of causal variants.

In this paper, we propose a new model-based method for identifying the association between multiple variants in a locus and a trait we call Model-based Association test Reflecting causal Status (MARS). Our approach builds upon recent progress in fine-mapping approaches that attempt to identify causal variants in a locus. Causal variants are responsible for the association signal at a locus; however, each locus often contains tens to hundreds of variants tightly linked (linkage disequilibrium, LD) to the reported associated single nucleotide polymorphism (SNP). Therefore, the LD hinders the identification of causal variants at the risk locus. CAVIAR [13] is a recent fine-mapping approach that estimates the probability of each variant being causal, thus allowing an arbitrary number of causal variants by jointly modeling the association statistics at all variants. We extend the likelihood model of CAVIAR to explicitly incorporate the LD structure of data utilizing multivariate normal (MVN) distribution conditional on the causal status of the variants. MARS computes a likelihood ratio of a null model, where none of the variants are causal against an alternate model, where at least one variant is causal. Then, an efficient re-sampling approach is applied for the significance test.

Our method does not require individual-level data, which is often not provided in GWAS. MARS only requires summary statistics such as Z-scores and the LD of variants in a locus, which can be obtained from a reference dataset such as HapMap [19, 20] or the 1000 Genome project [21], and reports a p-value that indicates the significance of the association between the locus and the corresponding trait. This approach is related to set-based association tests that examine an association between a set of variants and a trait [2224]. MARS outperforms these previous methods because its underlying model, which builds upon the model of CAVIAR, explicitly models the joint distribution of observed statistics given multiple signals of associations. Furthermore, MARS uses a significance level that corresponds to the standard GWAS significance level, thus facilitating interpretation.

When applied to several simulated data sets, we show that MARS robustly controls type I errors and has improved statistical power compared to the univariate test and widely utilized set-based association tests, a fast and flexible set-Based Association Test (fastBAT) [25], Deterministic Approximation of Posteriors (DAP-G) [26], and Sequence Kernel Association Test (SKAT) [27]. In addition, to show the performance of MARS on both eQTL studies and GWAS, we have applied MARS to representative eQTL and GWAS datasets; Genotype-Tissue Expression (GTEx) data and Northern Finland Birth Cohort (NFBC) data, respectively. Applied to the data of 44 tissues provided by the GTEx consortium [28, 29], MARS identified more eGenes, which are genes with at least one variant significantly associated with cis compared to those reported by the GTEx consortium in most tissues, e.g., in the Whole Blood data, MARS identified 29% more eGenes than the consortium; 57% of the extra eGenes that had only been identified by MARS, i.e., not by consortium, were reported in studies elsewhere. To demonstrate the increased power of MARS on real data, we followed a strategy of applying MARS to an older data set and validated the additionally discovered loci using current datasets that have higher statistical power because they are much larger. Applied to the 2009 NFBC data, we show that MARS effectively identifies more association loci than the univariate test and show that many of the new loci have since been discovered in recent GWAS studies.

Results

Overview of MARS

Causal variants are those that are responsible for the association signal at a locus. The ultimate goal of the standard association test, which examines the association between each variant and a trait, is to find causal variants; we refer to this method as univariate testing. However, multiple causal variants with small effect sizes often exist in a locus. For these cases, the univariate test may not detect those associations due to its low statistical power. Alternatively, we can examine the aggregated effect of multiple variants simultaneously on the trait to increase statistical power.

We developed a novel statistical method referred to as Model-based Association test Reflecting causal Status (MARS). MARS examines the association between a set of variants and a trait. MARS requires summary statistics estimated for variants (e.g.,z-score) for a locus of interest and a correlation structure, LD, between the variants, which can be readily obtained from a reference dataset. To test the association between a set of variants of a locus and a trait, MARS estimates a likelihood ratio to compute a test statistic, which is referred to as Likelihood Ratio Test (LRT) statistic; LRTstat. We consider the likelihood of a null model (L0) and the alternative model (L1). Note that we are computing the likelihood ratio of a null model against the alternative model, not the full model, which is the standard form of the “Likelihood ratio test” uses. The null model assumes that there is no causal variant to the trait while the alternative model assumes that there is at least one causal variant to the trait. Then, we compute the LRTstat as L1/L0. Suppose that we test the association between m number of variants and a trait. Given the observed summary statistics, we can compute the LRTstat as follows:

$${LRT_{{stat}}} = \frac{\sum_{C\in \zeta} p(S|C)p(C)}{p(S|C_{0})p(C_{0})} $$

Here, S=[s1,⋯,sm]T indicates summary statistics of m variants and C indicates the causal status of m variants. C is a binary vector of length m, where 0 indicates that a variant is non-causal and 1 indicates that a variant is causal. Specifically, C0 indicates the causal status where none of the variants are causal and ζ is a set that contains all possible casual statuses except for C0. Since there are m number of variants, there are 2m possible causal statuses. In practice, we limited the number of allowed causal SNPs as well as the number of variants considered for a region to reduce the running time in experiments throughout the paper. We find that considering up to 3 causal variants and use 50 variants in a region to be reasonable in respect to both the running time and the accuracy in our experiments. However, a user may increase the numbers, which is amendable in high computing servers and may provide more accurate results in expense of the running time. For details, see the “Methods” section. To assess the association significance for a locus, we utilize the re-sampling approach, where we sample null statistics from a MVN distribution with a corresponding LD and estimate the LRTstats for the null statistics to generate a null panel of LRTstats;\({LRT}_{{stats}}^{NULL}\). From the null panel, we estimate the significance of \({LRT}_{{stat}}^{DATA}\) computed from the data. Figure 1 shows the basic overall process of MARS. The “Methods” section describes the details and techniques to make this process computationally feasible for big genomic data.

Fig. 1
figure 1

Overview of MARS. Here, we assume that we are testing an association between a locus of m variants and a trait. The leftmost panel shows the input of MARS; m number of summary statistics for the locus and an n×m matrix that contains genotypes of m SNPs for n samples. The next two panels on the bottom show the re-sampling process in which we sample the null statistics K times from an MVN distribution with a variance-covariance matrix of Σ that contains LD of the genotypes X. The rightmost panel shows the process by which we estimate LRTstats for the null panel from which we can compute a p-value for the data

MARS controls type I error while improving power in simulation studies

We demonstrate that MARS controls type I errors through simulations of null panels utilizing the GTEx data as a starting point and consider the SNPs ±1Mb around the transcription start site (TSS) of 10 genes of Whole Blood data from the GTEx consortium [28, 29]. Half of the genes are randomly selected from those reported as eGenes by the GTEx consortium and the other half are randomly selected from other genes, i.e., non-eGenes, of the GTEx consortium [28, 29]. For each gene locus, we simulate 107 null summary statistics according to the generative model described in the Materials and Methods section, which uses the LD structure estimated from the genotypes of the SNPs in the locus and applies MARS to compute the LRTstats.

To show that MARS controls type I error, the false-positive rates are estimated for different thresholds of α=5×10−6 to 5×10−2. Half of the simulated data is used to compute a threshold of LRTstats for the corresponding α; \(\phantom {\dot {i}\!} {LRT}_{threshold^{\alpha }}\) and the other half of the simulated data is used to compute a quantile of LRTstats smaller than the \(\phantom {\dot {i}\!}{LRT}_{threshold^{\alpha }}\). Figure 2a shows that MARS robustly controls type I error for all examined gene loci as the false-positive rates for different gene loci are very close to the corresponding α=5×10−6 to 5×10−2, respectively.

Fig. 2
figure 2

Comparison of eGenes identified by MARS and eGenes reported by GTEx consortium. a The plot shows that MARS controls type I error. The points represent 10 different gene loci used for the test where the five gene loci are from eGenes as reported by the GTEx consortium. b Box plot showing the percentage of power improvement of MARS over the univariate test for different effect sizes of two causal SNPs that exist in the data. The X-axis shows five different effect sizes of λ=4,4.5,5,5.5, and 6 used for the test. The Y-axis shows the percentage of power improvement. The red and black bars show the power improvement when loci not reported and reported as eGenes by the GTEx consortium are used for the test, respectively. c Plot comparing the percentage of power improvement of MARS over the univariate test when two and three causal variants implanted in the simulation data. Each variant has an effect size of λ=4.5.d Box plot showing the power of MARS, DAP-G, and fastBAT for different effect sizes. The X-axis shows the effect sizes of λ=2,2.5,3,3.5,4,4.5,5,5.5, and 6 used for the test. The Y-axis shows the power in percentages. The red, green, and blue boxplots show the power of MARS, DAP-G, and fastBAT, respectively

To show that MARS increases the statistical power, we performed extensive simulation studies for various scenarios and compared the power of MARS with those of the univariate test. Here, we defined the univariate test as a set-based association test that uses the maximum summary statistic among SNPs in the locus we are testing (for details see the “Methods” section). The same gene loci from the previous section are used for the test and we estimate power of each gene locus for cases with two causal variants implanted with different effect sizes of λ=4,4.5,5,5.5, and 6. For a fair comparison of the powers between the univariate test and MARS, we utilized the standard GWAS p-value threshold of 5×10−8. We simulated 108 summary statistics under the null model of no effect to generate a null panel and 108 summary statistics under the alternative model of effect size λ for two causal variants to examine the power. To set a threshold for computing the power, we utilized the concept of the univariate test (see “The standard univariate test, fastBAT, and DAP-G” sub-section of the “Methods” section). For each null statistic, we selected the maximum p-value to get 108 maximum p-value from the null panel. Then, we ordered the maximum p-values to get the quantile, q, where the maximum p-values corresponds to 5×10−8; q is used for setting the LRT thresholds. Then, the power is estimated as a quantile of alternative cases that show LRTstats greater than the LRT threshold. The details of the whole processes of computing the threshold and estimated the power is described in the “Power estimation” sub-section of the “Methods” section. The percentage of power improvement is defined as (power of MARS – power of the univariate test)/(power of the univariate test) ×100. Figure 2b shows that MARS has increased statistical power compared to the univariate test. While the extent of power improvements differs between the gene loci as LD structures differ between loci, it is clear for all cases that the powers are improved over the univariate test. Depending on the effect size λ implanted in the simulated data, the power has improved from 5.2 to 41.18% in our experiments and as expected, the smaller the effect size, the better MARS performs over the univariate test. The results do not show noticeable differences between the loci of the eGenes and of the non-eGenes used for the simulations.

In addition, we examine the cases where two and three causal variants, each with an effect size of λ=4.5, are implanted in the simulated data. As the number of causal variants increases from two to three, MARS shows a bigger power improvement over the univariate test (Fig. 2c). The result shows that the more causal variants that exist in a locus, the better MARS performs over the univariate test.

Besides the univariate test, we compared MARS with the widely used set-based association test methods, fastBAT [25], DAP-G [26], and SKAT [27]. Due to the heavy I/O of fastBAT and DAP-G (an extended version of DAP), we used 105 simulations and a threshold of 10−5. We computed the power of MARS, DAP-G, and fastBAT for the different effect sizes of λ=2,2.5,3,3.5,4,4.5,5,5.5,6 and show that MARS outperforms fastBAT and DAP-G for all cases by improving the power from 0.07 to 23.38% and from 0.6 to 10.61%, respectively, depending on the effect sizes in the experiments (Fig. 2d). In addition, we compared MARS with SKAT, optimal unified SKAT (SKAT-O) [27] and Meta-SKAT [30], which is designed for meta-analysis but applicable for a single study as well. As SKAT does not allow z-scores that we have used for our simulated studies as its input, we generated phenotypes with a range of effect sizes β for the comparison. The results show that MARS outperforms SKAT, SKAT-O, and Meta-SKAT. Additional file 1: Fig. S1 describes the experiments and results in detail.

MARS detects novel eGenes in GTEx data

Recently, a larger number of expression quantitative trait loci (eQTLs) studies have been reported. In particular, numerous cis-eQTLs, which are eQTLs that map to the approximate location of their gene-of-origin, have been identified. As part of this effort, the GTEx consortium reported eGenes, which are genes with at least one cis-eQTL. We applied MARS to GTEx data to show that it can detect more eGenes than those reported by the GTEx consortium. Among the 44 tissues provided by the GTEx consortium, we first applied MARS to the Whole Blood data for evaluation as this data contains the largest number of samples among all tissues.

For simulation studies and GWAS, we applied a threshold that corresponds to the p-values threshold of 5×108 utilizing the univariate test. However, for a fair comparison of the MARS results with those reported by the GTEx consortium, a different strategy has been used. We used 10,000 simulations, which is the number of simulations used by the GTEx consortium to compute their “empirical p-values” and select eGenes. To identify eGenes for MARS, we set the threshold as the border of empirical p-values between eGenes and other genes, referred to as non-eGenes, reported by the GTEx consortium. Figure 3 is a Venn diagram that compares the identifications of eGenes by MARS and those reported by the GTEx consortium. MARS identified 2043 extra eGenes that were not reported by the consortium, while MARS missed only 98 eGenes that were reported by the consortium [28, 29]. MARS and the GTEx consortium detected 6686 eGenes in common.

Fig. 3
figure 3

Comparison of the eGenes identified by MARS and those reported by GTEx consortium. a The red circle shows the eGenes identified by MARS and the blue circle shows the eGenes detected by GTEx version 6. Whole blood data was used for the analysis. b A Venn diagram comparing eGenes identified by GTEx version 6, GTEx version 7, FHS, and MARS. Whole blood data was used for all three studies. The blue, purple, green, and red circles show the eGenes identified by GTEx version 7, the Framingham Heart Study (FHS), GTEx version 7, and MARS, respectively. Note that MARS used data from GTEx version 6

To verify that the eGenes identified by MARS are true associations, we compared the extra eGenes with those reported by other studies that have larger sample sizes. Note that the results throughout the paper used data from GTEx version 6. GTEx version 7 has recently been published with more samples and improved technology in experiments. We expect more eGenes are detected in the newer version of the data as the power increases with the number of samples and etc. We compared the extra eGenes with those reported by GTEx version 7. In addition, we utilized a Whole Blood data of Framingham Heart Study (FHS) [31], which is independent of GTEx data but contains a larger sample size (5257 samples), to validate the extra eGenes. Figure 3b is a Venn diagram that compares the identifications of eGenes from four studies, GTEx version 6, GTEx version 7, FHS, and MARS. Among the 2043 extra eGenes, about 57% (1160 genes) were reported in either GTEx version 7 or FHS; 775 genes were reported as eGenes in GTEx version 7, 763 genes were reported as eGenes in FHS, and 378 genes were identified as eGenes by both GTEx version 7 and FHS. Even with the older version of the data, MARS still found more eGenes than GTEx version 7 and MARS is expected to identify even more eGenes when using data with a larger number of samples in further studies. Moreover, some of the 883 genes (Fig. 3b) that were only identified by MARS and not by GTEx version 6, GTEx version 7, or FHS, have biological evidence of being eGenes based on many studies in the literature. Variants of the SP140 (ENSG00000079263) gene are known to be related to multiple sclerosis (MS) [32] and chronic lymphocytic leukemia [33]. Sille et al. have demonstrated that the expression level of SP140 is regulated by cis-eQTLs in lymphoblastoid cell lines [34]. Besides, the SP140 protein levels are shown to be downregulated by a cis-acting mechanism in peripheral blood mononuclear cells (PBMCs) from MS patients [35]. HSPB8 (ENSG00000152137) has been recently identified as an eGene using PBMCs and the expression level of HSPB8 is known to be regulated by several SNPs [36]. Surfactant protein D encoded by SFTPD (ENSG00000133661) gene is known to be regulated by in a cis-acting manner in human blood [37], and CD83 (ENSG00000112149) has been recently identified as cis-eQTLs gene in CD19+ B lymphocyte [38]. Additionally, in Additional file 1: Fig. S2, we thoroughly analyzed 100 randomly selected genes and compared the p-values for MARS, the univariate test, and those reported by the GTEx consortium to show that MARS could identify more eGenes with better p-values. These results show that MARS is capable of identifying novel eGenes that cannot be detected using standard association test approaches. Additional file 2: Table S1 lists the 2048 extra eGenes and their identifications in GTEx version 7 and FHS.

One advantage of MARS is that once the null panel of LRTstats for each gene has been established, this can be applied to the gene in any other tissues. Utilizing the null panel of LRTstats estimated from the Whole Blood data of the GTEx consortium, we computed the p-values of the genes in all 44 tissues of GTEx using their summary statistics and LD structures. Figure 4 shows that MARS identifies comparable or more eGenes than the univariate test in addition to those reported by the GTEx consortium in all tissues. As expected, the number of eGenes identified by the univariate test and those reported by the GTEx consortium are very close to each other in all of the tissues. The numbers of genes differ between tissues due to factors such as sample size differences and we only used the common genes in each tissue and the Whole Blood data of the GTEx consortium because the null panel of LRTstats was estimated for genes in the Whole Blood data. Additional file 3: Table S2 provides a comparison of eGenes identified by MARS, the univariate test, and the GTEx consortium for each tissue. In addition, we compared eGenes identified by MARS, GTEx version 6, and GTEx version 7 and provided Venn diagrams as in Fig. 3 for all tissues (Additional file 1: Fig. S3).

Fig. 4
figure 4

Number of eGenes identified by MARS, the univariate test, and those reported by the GTEx consortium. The x-axis shows the 44 tissues provided by the GTEx consortium and the y-axis shows the number of eGenes identified by each method. The black circle shows the number of eGenes identified by MARS, the red cross shows the number of eGenes identified by the univariate test, and the green triangle shows the number of eGenes reported by the GTEx consortium

MARS detects more set-based associations in GWAS

We show the effectiveness of our method on GWAS by applying MARS to the Northern Finland Birth Cohort (NFBC) data [39]. The NFBC data consist of 10 traits collected from 5327 individuals, namely triglycerides (TG), high-density lipoproteins (HDL), low-density lipoproteins (LDL), glucose (GLU), insulin (INS), body mass index (BMI), and C-reactive protein (CRP) as a measure of inflammation, systolic blood pressure (SBP), diastolic blood pressure (DBP), and height. For NFBC data, we examined 51,762 loci, where each locus is defined as ±1 Mb of TSS of genes provided by the GTEx consortium.

MARS requires a lot of sampling to apply the standard GWAS p-value threshold of 5×10−8. To reduce the running time, we apply the idea of importance sampling [40] on MARS for the GWAS data, which well approximates the p-value estimated from the original sampling approach while reducing the sampling number dramatically, from 108 to 104 (Additional file 1: Fig. S4). For further details, see the “Fast and space-efficient sampling for MARS” sub-section of the “Methods” section. Figure 5a shows that, for all traits, MARS identifies more or comparable loci that are likely significantly associated with the traits. A total of 471 loci were only identified by MARS and not by the univariate test. Additional file 4: Table S3 lists significantly associated loci identified by MARS and by the univariate test and Additional file 1: Fig. S5 provides Venn diagrams that compare the identifications of MARS and the univariate test. To verify those extra loci, we searched the loci from other GWAS by utilizing the GWAS catalog [41]. As a result, several variants associated with 311 loci among those 471 extra loci have been previously reported [7, 4266]. For example, the rs6060369 locus associated with height was reported by large GWAS [4951]. The rs1800961 locus related to HDL was previously reported by large GWAS and meta-analysis GWAS [5961, 65, 66]. The rs6511720 locus related to LDL was discovered by several previous studies [45, 55, 56, 60, 61, 65]. A Venn diagram in Fig. 5b compares the number of loci identified by MARS and the univariate test as well as showing the number of loci for which at least one associated variant has been reported by the GWAS catalog. The list of SNPs, the corresponding loci found by previous studies, and their detailed information including PubMed id and SNP position are provided in Additional file 5: Table S4. Note that loci are defined based on the gene map of GTEx (±1Mb of TSS), so some loci may overlap (Additional file 1: Fig. S6). For the height, the univariate test found no associations while MARS found 53 associations. To verify the 53 identifications of MARS on the height, we performed the univariate test on the genetic investigation of anthropometric traits (GIANT) consortium data set [67] that contains 131,547 samples and is thus much larger than the NFBC data set and expected to have greater power on the association test. As a result, we found 5788 associations where all 53 associations that MARS found are included to show that MARS’s identifications on height are true-positive signals. These results demonstrate that MARS can efficiently identify novel associations in GWAS.

Fig. 5
figure 5

Significant associations identified by MARS and the univariate test in NFBC data. a Number of significant associations identified by MARS and the univariate test. The x-axis shows the 10 traits of NFBC data and the y-axis shows the number of set-based associations that will likely be associated with the traits. The blue bars show associations identified by MARS and the orange bars show associations identified in the univariate test. The 10 phenotypes are triglycerides (TG), high-density lipoproteins (HDL), low-density lipoproteins (LDL), glucose (GLU), insulin (INS), body mass index (BMI), C-reactive protein (CRP) as a measure of inflammation, systolic blood pressure (SBP), diastolic blood pressure (DBP), and height. b A Venn diagram showing the number of loci was found by the GWAS catalog for 10 traits. The red circle shows the number of loci identified by MARS, the blue circle shows the number of loci identified by the univariate test, and the green circle shows the number of identifications reported by the GWAS catalog

Discussion

Great efforts have been spent on finding the hidden heritability and many studies suspect that single-level variant tests miss signals due to the small effect sizes and power problems. Approaches that examine multiple variants together may have increased statistical power to detect risk loci with small effect sizes. Moreover, interpreting Genome-wide Association Studies (GWAS) at the gene level is an important step toward understanding the molecular processes that lead to disease [68, 69]. Several statistical approaches have been proposed that test the association between a set of variants and a trait or disease status; however, they simply use naïve statistics such as the mean or sum of χ2 for statistics in the risk loci [24, 25, 70, 71].

Our method examines the association between a set of variants and a trait considering the causal status and LD between variants, utilizing the model used in a recent fine-mapping approach [13, 72]. One of the advantages of MARS is that it may be applicable to data with only summary statistics, utilizing LD estimated from a global reference dataset such as 1000 genome data [73]. Note that a special attention may be required for the data from multiancestry or human leukocyte antigen (HLA) region, for which a reference dataset may not provide an accurate LD estimate due to population stratification [73] or complex LD patterns [74]. Another advantage of our method is that once a null panel of test statistics has been established for a locus, it can be applied to the locus in other studies, only if the analyzed variants (thus, the LD structures) of the locus are the same. For example, in our GTEx analysis, the null panel statistics for genes were established only once and applied to all of the 44 tissues. This may reduce the running time significantly as most of the running burden comes from the null panel generation.

Applied to extensive simulated data sets with different effect sizes and the number of causal variants, our method shows improved power compared to previous approaches including the widely used set-based association test, fastBAT and DAP-G, while successfully controlling type I errors. Especially, when there are many causal variants with small effect sizes, our method shows superior performance to the standard univariate association test approach. Applied to Genotype-Tissue Expression (GTEx) data, our method identifies more or comparable eGenes compared to the standard univariate approach as well as those reported by the GTEx consortium in all tissues. In addition, using Whole Blood data, we show that a large portion of the eGenes only identified by MARS have been reported by other larger studies and some of them have biological evidence of being eGenes based on previous literature. Lastly, utilizing the Northern Finland Birth Cohort (NFBC) data, we show the effectiveness of our method to the GWAS in that our method effectively identifies more association loci in GWAS compared to the standard association test approach. In the experiments, we have defined each locus as a gene; however, it could be defined as any set of variants that a user wants to apply to.

We note some limitations in our work. First, MARS is computationally costly compared to the standard GWAS method as MARS tests the significance of an association based on the re-sampling approach. However, in practice, we introduce fast and space-efficient sampling techniques including importance sampling to dramatically reduce the sampling time, which closely approximates the original result, while we were able to successfully handle big eQTL data sets that contain tens of thousands of genes and GWAS data sets with thousands of samples. Second, we limited the number of causal variants in a locus up to three in the simulated studies to reduce the running time in experiments. We believe this is a reasonable assumption, as it has been reported that a relatively small number of variants exist in a region [13] and MARS showed better results in practice, utilizing real eQTL and GWAS datasets: GTEx data and NFBC data. However, this may not be a general assumption and further investigations may require for the cases with a larger number of causal variants (>3), comparing with previous methods such as DAP-G and fastBAT, which may give better results for the cases. Third, our model is based on a linear model and can only be applied to common variants, not rare variants. For data that does not follow a normal distribution, we recommend to fit the data into a normal distribution using techniques such as inverse normal transformation. Also, we assume the summary statistics are corrected for population stratification. We can extend our likelihood ratio model using CAVIAR-gene [15] instead of CAVIAR to consider the population stratification in the future study. Lastly, MARS does not utilize existing functional data; some current methods utilize functional data to detect more eGenes [7577]. We can extend the statistical framework of MARS to utilize functional data in future work. Despite these limitations, MARS is a novel statistical method that can detect newly associated loci and increase the number of loci in follow-up studies. Through this, MARS can increase our biological understanding of diseases and complex traits.

Methods

In this section, we assume that phenotypic values are continuous values to ease some of mathematical derivations of computing the summary statistics. MARS only require the joint distribution of summary statistics to follow multivariate normal distribution. It has been shown that for binary phenotypes the joint distribution of summary statistics follows a multivariate normal distribution [15, 78, 79]; thus, summary statistics obtained from case/control phenotypes are applicable for MARS as well.

GWAS statistics

Consider GWAS on a quantitative trait where we genotype n individuals and collect a phenotype for them. Let Xi be a vector of length n with the standardized genotypic values (i.e., mean zero and variance one) of the ith marker that we are testing and Y be a vector of length n with the phenotypic values. We assume that the data-generation model follows the following linear additive model:

$$Y=\mu \mathbf{1_{n}}+X_{i}\beta_{i}+\mathbf{e} $$

Here, μ is the mean of the phenotypic values, 1n is a vector of n ones, βi is their coefficients, and e is a vector of length n sampled from \(\mathcal {N}(0, \sigma ^{2}\mathbf {I})\) accounting for the residual errors, where I is an n×n identity matrix.

Under this model, the phenotype follows a MVN with the following mean and variance:

$$ Y \sim \mathcal{N}\left(\mu\mathbf{1_{n}}+X_{i}\beta_{i}, \sigma_{e}^{2}\mathbf{I}\right) $$

By maximizing the likelihood of the model, we can estimate βi as follows:

$$ \hat{\beta}_{i} = \frac{X_{i}^{T}Y}{X_{i}^{T}X_{i}},\hat{\beta}_{i}\sim \mathcal{N}\left(\beta_{i}, \frac{\sigma_{e}^{2}}{(X_{i}^{T}X)^{-1}}\right) $$

and the summary statistic is computed as follows:

$$ S_{i}=\frac{\hat{\beta_{i}}}{\hat{\sigma_{e}}}\sqrt{X_{i}^{T}X_{i}}, S_{i} \sim \mathcal{N}\left(\lambda_{i},1\right) $$

, where λi is non-centrality parameter (NCP) and is equal to \(\frac {\beta _{i}}{\sigma _{e}}\sqrt {X_{i}^{T}X_{i}}\). We obtained the estimated values for μ,e, and σe as \(\hat {\mu }=\frac {\mu \mathbf {1_{n}}^{T}X_{i}}{n}, \hat {e}=Y-\mathbf {1_{n}}\hat {\mu }-\hat {\beta _{i}}X_{i}\), and \(\hat {\sigma _{e}}=\sqrt {\frac {\hat {e}^{T}\hat {e}}{n-2}}\), respectively.

The effect of linkage disequilibrium on the statistics

Consider the case where that the ith SNP is causal to a phenotype and the jth SNP is non-causal but in LD with the ith SNP. The correlation between the two variants is r, which is approximated by \(\frac {1}{n}X_{j}^{T}X_{i}\). The effect size of the jth SNP is computed as follows:

$$ \hat{\beta}_{j} = \frac{X_{j}^{T}Y}{X_{j}^{T}X_{j}},\hat{\beta}_{j}\sim N\left(r\beta_{i}, \frac{\sigma_{e}^{2}}{(X_{j}^{T}X)^{-1}}\right) $$

and the statistics for the jth SNP are computed as follows:

$$S_{j}=\frac{\hat{\beta_{j}}}{\hat{\sigma_{e}}}\sqrt{X_{j}^{T}X_{j}}, S_{j} \sim N\left(r\lambda_{i},1\right) $$

We can show that the covariance between the statistics is equal to the correlation of the genotypes as follows:

$$\begin{array}{*{20}l} \texttt{Cov}(S_{i}, S_{j})=&\frac{X^{T}_{i}X_{j}}{\sqrt{X^{T}_{i}X_{i}}\sqrt{X^{T}_{j}X_{j}}}\\ =&\texttt{Cor}(X_{i}, X_{j})\equiv r_{{ij}} \end{array} $$

Then, the joint distribution of the summary statistics for the two variants given their NCPs, λi and λj, follows a multivariate normal distribution as follow:

$$\left(\left[\begin{array}{ll}S_{i}\\S_{j}\end{array}\right] {\mid}{ \left[\begin{array}{ll}\lambda_{i}\\ \lambda_{j}\end{array}\right]}\right) \sim \mathcal{N} \left(\left[\begin{array}{ll}\lambda_{i}\\ \lambda_{j} \end{array}\right], \left[\begin{array}{ll}1&r_{{ij}}\\r_{{ij}}&1\end{array}\right]\right) $$

CAVIAR generative model

Now we consider the case with m SNPs. Given the true effect sizes of m SNPs, Λ=[Λ1,Λ2,⋯,Λm], the summary statistics of m SNPs, S=[S1,⋯,Sm]T, is as follows:

$$ (S|\Lambda) \sim \mathcal{N}(\Sigma\Lambda, \Sigma) $$
(1)

Here, Σ is a correlation matrix, where Σ{i,j}=rij. We utilize Fisher’s polygenic model and assume that the effect sizes follow a normal distribution. Let C be a binary vector of length m that indicates the causal status of m SNPs; 0 indicates that a SNP is non-causal and 1 indicates that a SNP is causal. Given the causal status C, we assume that the true effect size is as follows:

$$ (\Lambda|C) \sim \mathcal{N}\left(0,\Delta\right) $$
(2)

where Δ is a diagonal matrix, where Δ{i,i}=σ2 if the ith SNP is causal and Δ{i,j}=ε, otherwise. From Eqs. (1) and (2), the likelihood of summary statistics follows a multivariate normal distribution as follows:

$$ (S|C) \sim \mathcal{N}\left(0,\Sigma+\Sigma\Delta\Sigma\right) $$
(3)

Then the likelihood function is given as follows:

$$ \mathcal{L}(\Sigma, \Delta|S)=\frac{1}{\sqrt{(2\pi)^{m}|\Sigma+\Sigma\Delta\Sigma|}}\texttt{exp}\left(-\frac{1}{2}S^{T}(\Sigma+\Sigma\Delta\Sigma)^{-1}S\right) $$
(4)

We use a simple model where the probability that an SNP begins causal is γ, which is independent from other SNPs. To compute the prior of causal status, we use the same assumptions that are widely used in fine-mapping methods, and γ is set to 0.01 [13, 78, 8083]. We have shown that the choice of γ does not make big differences on the results (for details, see Additional file 1: Fig. S7) We assume each SNP is independent and that the probability of a SNP to be causal is equal to 0.01 [81, 82]. Therefore, we compute the prior probability as follows:

$$ p(C)=\prod^{m}_{i=1}\gamma^{|c_{i}|}(1-\gamma)^{1-|c_{i}|} $$
(5)

Here, |ci|=1 if the ith SNP is causal and |ci|=0, otherwise. Although we use a simple prior, we can incorporate external information by using the SNP-specific prior γi, which is the prior for the ith SNP, and then the prior probability for a more general case is \(\phantom {\dot {i}\!}p(C|\gamma =[\gamma _{1}, \cdots \gamma _{m} ])=\prod ^{m}_{i=1}\gamma _{i}^{|c_{i}|}(1-\gamma _{i})^{1-|c_{i}|}\).

Model-based Association test Reflecting causal Status (MARS)

MARS examines the association between a set of SNPs and a phenotype of interest. For the test statistic, we utilize a likelihood ratio test (LRT). We consider the likelihoods of two models: that of the null model (L0) and that of the alternative model (L1). The null model assumes that there is no causal SNP to the phenotype while the alternative model assumes that there is at least one causal SNP for the phenotype. Then, we can compute the test statistic as LRTstat=L1/L0. Given the observed marginal association statistics S and correlation matrix Σ, we can compute the LRTstat as follows:

$$ {LRT_{{stat}}} = \frac{\sum_{C\in \zeta} p(S|C)p(C)}{p(S|C_{0})p(C_{0})} $$
(6)

Here, we can compute the prior using Eq. (5) and the likelihood using Eq. (4). Since there are m SNPs, there are 2m potential causal statuses. In practice, we limit the number of allowed causal SNPs to two or three as which is consistent with reports from previous studies that a relatively small number of causal SNPs exist in a region. In addition, as the size of genes are often very large—many genes contain more than 10,000 SNPs within ±1Mb of TSS for the GTEx data—we order the SNPs by the value of its summary statistics and only used the top 50 SNPs for computing the LRTstats to reduce the running time and the space. Figure 6a shows this practical implementation of MARS used for the experiments. This strategy dramatically reduces the running time while well approximating the results using all the SNPs in the loci (Additional file 1: Fig. S8) because the causal SNPs are expected to be included in the top 50 SNPs. When limiting the number of causal SNPs up to three and using care \(\sum ^{3}_{i=1}\left (\begin {array}{c}50\\i\end {array}\right)\)casual statuses to consider and ζ becomes a set that contains all the possible casual statuses with 1, 2, or 3 causal SNPs. However, depending on the available computational power and size and properties of the data, the number of possible causal variants for running MARS may increase using options in the MARS program.

Fig. 6
figure 6

Practical implementation of MARS and fast and efficient sampling strategy for MARS. a To reduce the running time and space requirement, MARS uses the top 50 statistics instead of all the SNPs in its analysis. b For GWAS, we introduce a fast and efficient sampling strategy

eGene detection in GTEx data

To identify an eGene, we examine the association between the gene expression levels and SNPs within ±1Mb of TSS of the gene, which can be the candidates of cis-eQTLs for the gene. To assess the significance of a gene, we sample summary statistics from an MVN distribution under the null hypothesis, SN(0,Σ). Here, Σ is a variance-covariance matrix estimated from the SNPs within ±1Mb of TSS of the gene. Based on the simulation data, we order the SNPs by values of its summary statistics and used only top 50 SNPs for computing the LRTstats; \({LRT}_{{stats}}^{NULL}\) using equation (6). Then, we also select the top 50 SNPs of summary statistics to compute the LRTstat of the gene; \({LRT}_{{stat}}^{DATA}\), using the equation (6). The p-value of the gene is estimated as the quantile of \({LRT}_{{stat}}^{DATA}\) among \({LRT}_{{stats}}^{NULL}\). One of the advantages of MARS is that once the null panel, \({LRT}_{{stats}}^{NULL}\), has been estimated for a locus, the panel can be rapidly applied to the locus in any other tissues or traits to compute a p-value. We use the Whole Blood data, which contains the greatest number of samples among the 44 tissues, to estimate the null panels of 23,163 genes and applied the panels to all the other tissues. To compare the MARS results with GTEx’s results, we use 104 simulations, the number used by GTEx Consortium to compute their “empirical p-values” to select eGenes. To identify eGenes for MARS, we set the threshold as the border of empirical p-value between eGenes and genes other than those eGenes, referred to as non-eGenes, reported by the GTEx consortium, which is differ by tissues as GTEx used the FDR approach to find their eGenes. A similar process can be applied to detect eGenes in the univariate test except using the maximum summary statistic as the test statistic instead of LRTstat.

Power estimation

To show MARS increases statistical power over the univariate test, we compare the power between MARS and the univariate test. For a fair comparison, we utilized the standard GWAS p-value threshold of 5×10−8. We sampled 108 number of summary statistics under the null hypothesis, SNULLN(0,Σ) and 108 number of summary statistics under the alternative hypothesis, SALTN(ΣΛ,Σ). Here, Λ is a vector of length m, where m is the number of SNPs that contain zeros except for the causal SNPs. For example, for a simulation, in which two SNPs (e.g., SNP 1 and SNP 2) with effect size λ are implanted in the data, Λ is [λ,λ,0,...,0]. We examined the power for cases with two causal or three causal SNPs implanted in the simulated data, where the causal SNPs are randomly selected for each simulation. Then, we computed the p-value of SNULL using the univariate test, UNI pNULL, and found the quantile q, where the p-value equals to the standard GWAS p-value threshold of 5×10−8 as follows:

$$ q=\text{Number of (UNI}p^{NULL} < 5\times 10^{-8}) $$

We compute the LRTstats of SNULL as \({LRT}_{{stats}}^{NULL}\), using MARS and set the LRTstat at the quantile q as the threshold of LRTstats as \({LRT}_{{stat}}^{THR}\), which satisfies the following equation:

$$ \text{Number of} ({LRT}_{{stat}}^{NULL} > {LRT}_{{stat}}^{THR}) = q $$

Here, \({LRT}_{{stat}}^{THR}\) corresponds to the standard GWAS p-value threshold of 5×10−8. Now, we compute the LRTstats of SALT as \({LRT}_{{stats}}^{ALT}\), and the power of MARS is defined as the number of \({LRT}_{{stats}}^{ALT}\) that are greater than the \({LRT}_{{stat}}^{THR}\) as follows:

$$ \text{Power of MARS} = \frac{\text{Number of} ({LRT}_{{stat}}^{ALT} > {LRT}_{{stat}}^{THR})}{10^{8}} \times 100 $$

Similarly, the power of the univariate test is defined similarly by computing the p-value of SALT using the univariate test; UNI pALT, as follows:

$$ \text{Power of the univariate test} = \frac{\text{Number of (UNI}p^{ALT}< 5\times 10^{-8})}{10^{8}} \times 100 $$

In the power comparison of MARS, fastBAT and DAP-G, the power estimation process is the same as that described above except for that 105 simulations and a threshold of 10−5 is used instead of 108 simulations and a threshold of 5×10−8, respectively.

Fast and space-efficient sampling for MARS

To access the significance of associations, MARS uses a re-sampling approach that requires a lot of sampling from MVN distribution. There are two main obstructions to this standard re-sampling approach. One is that a locus may contain many SNPs; for example, many genes in the GTEx data contain >10,000 SNPs within ±1Mb of their TSS. When the number of SNPs m is very large, the standard re-sampling approach, SN(0,Σm×m), using the Cholesky decomposition [84] is impractical. This takes a lot of time and space as Σm×m itself often requires a few gigabytes of space. We can reduce the space and time complexity dramatically by utilizing the fact that Σm×m is a covariance matrix of X; Σm×m=XTX/n, where n is the number of samples. Instead of sampling statistics from MVN with the variance-covariance matrix of Σm×m; SN(0,Σm×m), we sample statistics from MVN with the variance-covariance matrix of In×n; SN(0,In×n). This neither takes time nor space because in general n<<m and n is not large. Then we multiply S by \(X^{T}/ \sqrt {N}\) to compute the statistics \(S=\frac {X^{T}}{\sqrt {N}}S^{*}\).

The other main obstruction of the standard sampling approach is that the number of sampling required to find a proper threshold for MARS may be very large. For the GTEx data, we compared the eGenes to those reported by the GTEx consortium and performed 10,000 samplings to determine the number of samples used for computing their empirical p-values. However, for the GWAS analysis, MARS needs to perform a lot of samplings to find a LRT threshold that corresponds to the standard GWAS p-value threshold of 5×10−8. Thus, for GWAS, we have applied importance sampling, which is an approximation method of standard sampling. The main idea of importance sampling is that it draws the sample from a distribution with thicker tails than a target distribution. Then, it uses importance weights so that the correct distribution is targeted [40]. The procedure is as follows. Instead of sampling from MVN with the variance-covariance matrix of In×n; SN(0,In×n), we sample statistics from MVN with the variance-covariance matrix of \(\sqrt {2}I_{n\times n}\); \(S^{*}_{{imp}} \sim N(0, \sqrt {2}I_{n\times n})\). Then, the new statistics from importance sampling become \(S_{{imp}}=\frac {X^{T}}{\sqrt {N}}S^{*}_{{imp}}\). We record an additional information, referred to as the importance weight, which defined as follows:

$$ W=\frac{f(S^{*}_{}|0,I_{n\times n})}{f(S^{*}_{{imp}}|0,\sqrt{2}I_{n\times n})} $$

Here, f indicates the probability density function of MVN. We repeat the process of sampling statistics \(S_{{imp}}^{*}\) and computing Simp and W, K times. We call each process as a run and after K runs we have a set of statistics \(\{S_{imp_{1}}, S_{imp_{2}}, \cdots, S_{imp_{K}}\}\) and a set of weights {W1,W2,⋯,WK}. Then, we estimate a univariate p-value from each Simp and compute the p-value threshold as the ratio of the sum of weights that have the univariate p-value <5×10−8 over the sum of all the weights as follows:

$$ \frac{\sum_{i}^{C}{W_{i}}}{\sum_{i}^{K}{W_{i}}} $$
(7)

Here, i indicates the index of a run and C is a set containing the indices of runs where the univariate p-value <5×10−8. Given summary statistics of a locus, we can access the significance of the locus by computing LRTstat of the summary statistics as \({LRT}_{{stat}}^{DATA}\). In addition, we compute the K number of LRTstats for the top 50 SNPs of the Simp as \({LRT}_{{stats}}^{NULL}\), as well. Then we compute the p-value of the locus as the ratio of the sum of weights where \({LRT}_{{stats}}^{NULL}>{LRT}_{{stats}}^{DATA}\) over the sum of all the weights as follows:

$$ \frac{\sum_{i}^{D}{W_{i}}}{\sum_{i}^{K}{W_{i}}} $$
(8)

Here, D is a set containing indices of runs with \({LRT}_{{stats}}^{NULL}>{LRT}_{{stats}}^{DATA}\). The association is significant if the p-value estimated from the Eq. (8) is smaller than the p-value threshold estimated from the Eq. (7). Applied to 10 randomly selected genes, we find that the p-value estimated from the 104 number of importance sampling well approximates the p-values estimated from the 108 number of the original re-sampling (Additional file 1: Fig. S4). Utilizing the importance sampling, we can reduce the number of samplings dramatically from 108 to 104 in GWAS experiments. Figure 6 shows an overview of the fast and efficient association strategy for MARS (Fig. 6b).

For the GTEx data analysis, we used MARS as described in Fig. 6a, where 104 number of samplings were performed and up to two causal variants were considered. In this case, MARS took approximately 3.5 min to test the significance for an average-size gene with 7522 SNPs for 338 samples in our system. Using parallel processing, we were able to run the 23,163 genes over several hours, which was approximately 3 h for sampling and computing LRTstats and some extra time for pre-processing and post-processing the data. For the GWAS data analysis, we applied the fast and efficient strategy of MARS as described in Fig. 6b, where 104 number of importance sampling was performed and up to two causal variants were considered. In this case, MARS took approximately 50 min to test a significance for an average-sized locus with 299 SNPs and 5326 samples in our system. Using parallel processing, we were able to run the 56,319 number of genes in approximately tow days.

The standard univariate test, fastBAT, and DAP-G

To compare MARS with the standard approach of the set-based association test, we defined a univariate test that uses a maximum summary statistic among the SNPs in the analysis locus. In addition, the widely used set-based association tests fastBAT [25] and DAP-G [26] were used for the comparison. A Genome-wide Complex Trait Analysis (GCTA) [85] program was downloaded from the GCTA website (http://gcta.freeforums.net/thread/309/gcta-fastbat-based-association-analysis) and the “fastBAT” option was used to run GCTA-fastBAT. The DAP-G program was downloaded from the appropriate website (https://github.com/xqwen/dap/tree/master/dap_src), and summary statistics were used from the run option.

GTEx data

The summary statistics and genotypes of 44 tissues of GTEx data version 6 were downloaded from dbGap (https://www.ncbi.nlm.nih.gov/gap); these were used to generate all results throughout this paper. The eGene list of GTEx data version 7 was downloaded from dbGap and only used to validate eGenes that had been identified by MARS applied on GTEx data version 6. In total, 23,163 gene loci selected from the Whole Blood data were used for the analysis; these contain at least 50 SNPs in their ± Mb of TSS. We generated the null panel of LRTstats using Whole Blood data that contains the most samples, 338. The numbers of genes differ between tissues due to factors such as sample size differences; therefore, for eGene detection in 44 tissues, we used common gene regions in each tissue and the Whole Blood data.

Northern Finland Birth Cohort dataset

The genotypes and 10 phenotype values of triglycerides (TG), high-density lipoproteins (HDL), low-density lipoproteins (LDL), glucose (GLU), insulin (INS), body mass index (BMI), C-reactive protein (CRP) as a measure of inflammation, systolic blood pressure (SBP), diastolic blood pressure (DBP), and height of 5326 samples from the Northern Finland Birth Cohort (NFBC) dataset were downloaded from dbGap. PLINK, a whole-genome association analysis toolset (http://zzz.bwh.harvard.edu/plink/index.shtml), was used to compute the statistics. For the set-based association test, the gene map of the GTEx data that contains 56,319 gene positions was used to define the loci for analysis. SNPs ±1 Mb around the transcription start site (TSS) of the genes were searched in the NFBC genotype data and 51,762 regions with >50 SNPs were used for the analysis. 104 importance samplings were performed to generate the null panel to estimate the p-values of MARS and the univariate test.

GIANT consortium height dataset

To evaluate the identifications of MARS on the NFBC height data, we performed univariate tests on the GIANT consortium height dataset, which contains 131,547 samples. [67].