Main

Estimation of the variance explained by all SNPs used in a population-based GWAS was initially motivated by the 'missing heritability' problem1. The problem was that the estimated variance explained by genome-wide significant (GWS) SNPs discovered in GWAS (denoted ) was only a fraction of the estimated heritability (ĥ2) from family or twin studies2, where was estimated in a multi-SNP model to account for linkage disequilibrium (LD) among SNPs and in an independent sample to avoid overestimation due to winner's curse3. Taking human height as an example, was 5% before 2010 (ref. 4), which is much smaller than a frequently quoted ĥ2 of 80% from family or twin studies5,6,7. This raised concerns about the cost-effectiveness of GWAS as an experimental design for the discovery of associated genes8. Several explanations of the missing heritability were proposed, including the presence of a large number of common variants of small effect yet to be discovered, rare variants of large effect not tagged by common SNPs on genotyping arrays, and inflation in pedigree-based ĥ2 due to shared environmental effects, non-additive genetic variation and/or epigenetic factors2,9. The missing heritability question also reignited the debate about the 'common disease, common variant' hypothesis10, that is, whether the proportion of heritability for common disease not explained by GWS loci is due to rare variants of large effect not tagged by the current generation of SNP arrays or undetected common variants of small effect2,11. It is therefore important to quantify the proportion of variance attributable to all common SNPs (defined here as those with minor allele frequency, MAF ≥ 0.01) used in GWAS. If common SNPs are the major contributor to heritability, then the concern about missing heritability is premature because the extent to which heritability is missing depends on the experimental sample size of GWAS12.

Estimation of the SNP-based heritability—the GREML approach

SNP-based heritability (or ) was initially defined as the proportion of phenotypic variance explained by all SNPs on a genotyping array13 and is therefore dependent of the number of SNPs on a SNP array. The concept has now been expanded to refer to the variance explained by any set of SNPs, for example, all genetic variants from in-depth whole-genome sequencing (WGS) or imputed from a reference14. Yang et al.13 used a mixed linear model (MLM) approach to estimate in a GWAS data set of unrelated individuals and demonstrated that common SNPs on a genotyping array explain a large proportion (45%) of variance in height. Here 'unrelated individuals' means distantly related individuals rather than individuals with no genetic relatedness, as even random pairs of individuals drawn from a general population would share distant ancestors. Given the small (5%) and relatively large (45%), it was concluded that, for complex traits like height, there are likely a large number of common variants with effect sizes too small to pass the stringent GWS threshold (P < 5 × 10−8) in GWAS, even with sample sizes that were considered large at that time (n = 1,000 to 10,000 samples before 2010), consistent with a model of polygenic inheritance. It was subsequently predicted that more associated genetic variants could be discovered with larger sample sizes while keeping the same experimental design of GWAS. This prediction has been realized by recent studies with n > 100,000 for height, body mass index (BMI), schizophrenia, and many other traits and diseases15,16,17,18,19,20. Under a polygenic model, the amount of heritability unexplained by GWS loci depends on sample size12. The aforementioned comparison of 5% versus 80% for height in 2009 (ref. 4) became 16% versus 80% only five years later15. Given the nearly linear relationship between the number of GWS loci and the logarithm of sample size (log(n)) observed in published GWAS12 and the highly polygenic nature of most complex traits21,22, we predict that the shrinking of the gap between and will be less than linear with log(n) because the variance explained by SNPs discovered in studies with larger sample sizes tends to be smaller.

The approach of Yang et al.13 was subsequently termed genomic relatedness matrix (GRM) restricted maximum likelihood (GREML)23 and implemented in the GCTA software tool24 (Box 1). GREML shares features with a pedigree-based analysis (part 1 of the Supplementary Note) but is usually applied to a sample of unrelated individuals (note that this is also the usual experimental design for GWAS), and hence is unlikely to be confounded by common environmental effects (Fig. 1). For pairs of distantly related individuals, the amount of the genome shared is small and highly variable, and it is unlikely that pairs who share slightly more of the genome than average will also have greater sharing of environment in a relatively homogenous population. The use of unrelated individuals also means that is unlikely to be contaminated with contributions from non-additive genetic effects, as the correlation between the additive and non-additive genetic relationships is tiny, whereas such contamination could be a problem in ĥ2 estimated from families depending on the study design. In addition, GREML can be applied to family data, but the resulting estimates should be interpreted with caution (part 3 of the Supplementary Note).

Figure 1: Interpretation of estimated genetic variance depends on ascertainment of the sample.
figure 1

Shown in red are the pedigree-based heritability estimate ( ) for height from 2,824 pairs of full siblings in the UK Biobank data61 (“5k related”; sibling correlation = 0.520), from a GREML analysis of 35,000 unrelated UK Biobank individuals using all the imputed SNPs in common with HapMap 3 (“35k unrelated”) and the estimates in between from GREML analyses in a mixed sample of unrelated individuals and close relatives (part 2 of the Supplementary Note). The difference between and demonstrates the genetic variation (due to rare variants in particular) not tagged by common HapMap 3 SNPs and/or confounding in from common environmental effects and non-additive genetic variation. Shown in blue are the results from the same analyses for a simulated phenotype based on a common environment model without genetic effect (part 2 of the Supplementary Note). Each bar is a single estimate, and each error bar indicates the SE of the estimate.

The GREML estimate directly quantifies the proportion of phenotypic variance explained by all SNPs used in GWAS and therefore provides the upper limit of given the same experimental design. The information to estimate comes from very small coefficients of genetic relationship for pairs of individuals, but small standard error (SE) for (part 4 of the Supplementary Note) can be achieved because of the large number of pairwise relationships (for example, 50 million for a study using 10,000 individuals), although these pairs are not independent. Subsequent work has extended the method to estimate in disease data25 (part 5 of the Supplementary Note) and genetic correlation (rg) between traits26,27 (part 6 of the Supplementary Note). There are several caveats to estimating using data from case–control studies (part 5 of the Supplementary Note) and interpreting the estimates on different scales (Fig. 2).

Figure 2: Relationship between SNP-based heritability on the liability scale (h2SNP(l)) and SNP-based heritability estimated from case–control samples.
figure 2

(a–d) The plots show that the same estimate of h2SNP(l) of 0.1 (a), 0.2 (b), 0.4 (c) or 0.6 (d) on the liability scale can correspond to a wide range of SNP-based heritability estimates from case–control samples on the observed 0–1 scale (part 5 of the Supplementary Note), depending on the proportion of cases in the sample (P) and the assumed lifetime risk of disease (K) used to transform the estimates to the liability scale. For each plotted line, the minimum value assumes a population sample with P = K. In real application, we advise investigating the sensitivity of estimates of h2SNP(l) to choice of K, but we find that the impact is small when K < 0.05. As shown in c and d, for a rare disease with high h2SNP(l), h2SNP(O) is expected to be larger than 1 because of the nonlinear relationship between genetic variance and phenotypic variance on the observed 0–1 scale.

Multiple terms and notations that have been used to describe the parameter estimated by GREML. We recommend using the term 'SNP-based heritability' and the notation . Unlike h2, which is a population-level parameter irrespective of experimental design, is a parameter given a set of SNPs. We likewise believe that it is also necessary to use a specific notation, , to represent h2 estimated from pedigrees (including twins) because of the potential biases in pedigree-based ĥ2 due to confounding factors such as common environmental effects. We have shown above that is by definition smaller than h2 because not all causal variants, in particular those with low frequency, can be perfectly tagged by SNPs used in GWAS (Fig. 3a and part 1 of the Supplementary Note). Here, by 'causal variant', we mean a genetic mutation that causes a different cascade of events in biological pathways and consequent phenotypic change rather than an associated variant identified from GWAS. In the particular case where is defined as the variance explained by all such causal variants, . In reality, however, causal variants are unknown. An unbiased estimate of h2 might be achieved by estimating from in-depth WGS data assuming that all causal variants have been sequenced and that there is no difference in LD between causal variants and other sequence variants14 (see below for more discussion).

Figure 3: Estimation of genetic variance depends on ascertainment of SNPs and genetic architecture.
figure 3

(a) Estimates of using SNPs on six different SNP panels for a simulated trait under two scenarios: (i) causal variants are random, with both common and rare variants (red), and (ii) causal variants are rare (blue) (see part 7 of the Supplementary Note for details of the simulation). The six SNP panels are the Affymetrix 6.0 array (Affy6), Affymetrix Axiom array (AffyAxiom), HapMap 3 Project (HM3), Illumina OmniExpress array (Illu1M), Illumina Omni2.5 array (Illu2M) and Illumina CoreExome array (IlluCoreE). (b) Effect of LD pruning on and the likelihood-ratio test (LRT) statistic. LD pruning was performed on the basis of HapMap 3 SNPs in PLINK (−indep-pairwise 50 5 r2) with the LD r2 threshold shown on the x axis. The last column with an r2 threshold of 1 represents the result without LD pruning (with all HapMap 3 SNPs). GREML analyses were performed using common SNPs on the HapMap 3 panel. (c) Distribution of MAFs of HapMap 3 variants after LD pruning with different r2 thresholds (no pruning for the r2 threshold of 1.0). In the box plots shown in a and b, the band inside the box is the median; the bottom and top of the box are the first and third quartiles, respectively (Q1 and Q3); the lower and upper whiskers are Q1 − 1.5 IQR and Q3 + 1.5 IQR, respectively, where IQR = Q3 − Q1; and the dots are the data not included between the whiskers.

Both GWAS and estimation of by GREML use LD

GWAS relies, by design, on genotyped common SNPs tagging unknown causal variants in the same chromosomal region. Estimating how much trait variation is tagged when fitting all SNPs simultaneously also makes use of the LD between SNPs and unobserved causal variants. A sparse SNP array that does not cover common variation in the genome well is less likely to lead to the discovery of trait-associated variants (even with a large sample size), and fitting those SNPs together in a GREML analysis will result in a smaller proportion of phenotypic variance explained than with a denser SNP array (Fig. 3a). Because the maximum possible LD correlation between two genetic variants declines as their difference in MAF increases28, genetic variation at rare variants (MAF < 0.01) is unlikely to be well tagged by common SNPs on genotyping arrays (Fig. 3a). If causal variants are located in genomic regions with a different LD property from the rest of the genome, this can lead to bias in (refs. 14,29,30; see below for more discussion).

Interpretation and misinterpretation of the GREML model

There are several circumstances where the principle of GREML is misinterpreted and the method is misapplied, and this could potentially lead to misleading or confusing inference. GREML is based on a random-effect model (Box 1). If the number of SNPs (m) is smaller than the sample size (n), this model is similar to a linear regression analysis (fixed-effect model) in terms of estimating (note that the adjusted R2 from multiple regression is an unbiased estimate of variance explained in a fixed-effect model). Such a hypothetical experiment would not rely on selecting SNPs to be individually GWS, nor would it rely on assumptions about the genetic architecture. In either a linear regression or random-effect model, the effect sizes of SNPs are fitted jointly (therefore accounting for LD among SNPs), meaning that the effect of any SNP is interpreted as the effect size of this SNP conditioning on the joint effects of all other SNPs. In GWAS, m is normally larger than n, in which case there is no unique solution to the fixed-effect model, a well-known overfitting problem in statistics. In a random-effect model, there is an additional assumption that the joint SNP effects u = {u1, u2, ...,um} follow a normal distribution with mean 0 and variance (see Box 1 for notations) so that the model parameters are estimable even when m is larger than n, where is interpreted as per-SNP genetic variance when all SNPs are fitted jointly, hence accounting for LD31. Therefore, is not consistent across models having different numbers of SNPs. There is a misunderstanding that GREML does not account for LD because it does not have a covariance matrix for u (ref. 32). This is incorrect. In fact, the LD correlations among SNPs have been modeled by fitting the SNP genotype matrix W, similar to that in linear regression analysis31. Because is the variance of a SNP effect conditioning on the joint effects of all other SNPs and wij is the standardized SNP genotype, the additive genetic variance captured by all SNPs is (Box 1).

In part 8 of the Supplementary Note, we list five scenarios where GREML (or the GCTA tool) is misused, resulting in potentially misleading results. In addition, there is often a question about whether the SNPs included in GREML analysis need to be pruned for LD. As discussed above, GREML accounts for LD so that LD pruning is not necessary (but see the later discussion on bias due to the nonrandom distribution of causal variants with respect to LD). LD pruning using a high r2 threshold might increase the estimate, but the likelihood of the model is not improved as compared to that without LD pruning (Fig. 3b). Caution is needed in interpreting the GREML estimate from pruned SNPs because of the change in the MAF spectrum of SNPs resulting from LD pruning (Fig. 3c). Changing the set of SNPs means that the underlying parameter being estimated ( for a set of LD-pruned SNPs) is different from the original parameter ( for all SNPs).

Bias due to the nonrandom distribution of causal variants with respect to LD

We have mentioned above that from WGS data could be a biased estimate of h2 if the LD property of causal variants is different from that of the other variants14,29,30,33. The unbiasedness of GREML in estimating h2 using WGS data depends on the ratio of (mean LD r2 between causal and non-causal variants) to (mean LD r2 between non-causal variants)14. Note that, because r2 is a function of MAF, a difference in MAF spectrum between causal and non-causal variants will lead to a difference in LD (MAF-mediated LD bias), resulting in a bias in . One solution is to stratify SNPs by MAF (MAF-stratified GREML, GREML-MS)14,34,35, which reduces bias in the estimate due to MAF-mediated LD bias. However, a more general approach is to not rely on a specific model of the interplay between allele frequency, effect size and LD, but instead stratify SNPs by MAF and LD jointly and estimate genetic variance with MAF–LD subsets. This approach, termed GREML-LDMS, appears to provide unbiased estimates of h2 as well as the contributions of common and rare variants to h2 in simulations based on WGS data, regardless of the underlying genetic architecture and distribution of causal variants with respect to MAF and LD14,36. We recommend the use of GREML-LDMS to estimate in imputed data (part 9 of the Supplementary Note). The applications of GREML-LDMS to WGS data sets with rich phenotypes in the future will be able to provide nearly unbiased estimates of h2 in unrelated individuals and quantify the variance explained by all rare variants for a range of complex traits. However, large sample sizes are required to estimate with useful precision because var() depends on sample size and variant density37 (part 4 of the Supplementary Note); for example, a sample size of ∼33,000 is needed to obtain an SE of 0.02 for WGS data.

Speed et al.29 proposed a method called LDAK to correct for the LD bias. The basic idea is to weight each SNP by a factor inversely proportional to its LD with SNPs nearby. This weighting strategy can introduce MAF bias because it gives more weight to SNPs with lower MAF (Supplementary Fig. 2 of Yang et al.14), as LD is a function of MAF28. The LDAK model implicitly assumed that the variance explained by a rare variant (for example, 0.001 < MAF < 0.01) is more than ten times larger than that explained by a common variant (for example, 0.1 < MAF < 0.5) (based on the LDAK weights calculated from a sequenced reference set14). This is an unrealistic model because it predicts that the power to detect rare variants would be orders of magnitude higher than that to detect common variants, a prediction not consistent with empirical results in the cases of human height15,38, schizophrenia17,39 and type 2 diabetes40. The LDAK-induced MAF bias can be substantial, especially when there is a large number of rare variants (as in a WGS data set), leading to an inflated estimate of (ref. 14).

The LDAK model has recently been changed substantially41. Two new parameters have been added: one is a weighting according to MAF and the other is a weighting according to imputation accuracy. Although it is not the justification for these two new parameters, both give more weight to common variants than the original LDAK model41. The revised LDAK model is now more similar to GREML-LDMS14, but not identical, as Speed et al.41 estimate a higher SNP-based heritability from their empirical analyses on a range of traits. In simulation studies to compare the methods, the results depend on the model used to simulate the data. Unfortunately, we cannot be sure which is the closer-fitting model for any given trait. GREML-LDMS makes fewer assumptions about the relationship between causal variants, LD and MAF and thereby appears to be more robust than the revised LDAK method36, although at the expense of estimating more parameters. On balance, we conclude that this topic merits further investigation36, as the relationship between local LD, locus heterozygosity and additive genetic variance for complex traits has not yet been resolved, and indeed may differ across the genome and between traits.

Assumptions about the relationship between effect sizes and allele frequencies

Under an evolutionarily neutral model, the proportion of variance in a polygenic trait explained by all variants in a MAF bin is linearly proportional to the width of the MAF bin14 (the variance explained by a rare variant, on average, is tiny, but there are a large number of them). Therefore, a significant deviation of the observed variance explained in a MAF bin from the expected value is evidence that the trait has been under natural selection14,42. In GCTA-GREML, we standardize the SNP genotypes and assume that the effect size per standardized genotype (ui) follows a normal distribution. This implicitly assumes a larger per-allele effect (bi) for a SNP with lower MAF, consistent with a model of purifying selection where variants with larger effect sizes tend to be under higher selection and therefore are more likely to be at lower frequencies (for example, MAF < 0.1). There is an option in GCTA to run GREML assuming that effect size is independent of MAF (neutral model). However, the difference between the two models is trivial in GREML-MS analysis14. Moreover, GREML-MS allows the data to reveal the relationship between variance explained and MAF. One of the important extensions of GREML in the future is to estimate directly from the data a parameter to quantify the relationship between bi and allele frequency while fitting a mixture distribution to the joint effects of SNPs43 (part 10 of the Supplementary Note).

Comparison with HE regression

As described in Box 1, the GREML analysis is based on an MLM that is equivalent to fitting the additive genetic values of all individuals, that is, y = g + e with . The variance components in this model are usually estimated using the REML approach. However, the REML algorithm is computationally intensive (part 11 of the Supplementary Note). Alternatively, can be estimated from Haseman–Elston (HE) regression37,44, that is, yi yj = b0 + b1 Aij + eij, where . The performance of GREML has been compared using extensive simulations in Golan et al.45 in ascertained case–control studies where GREML estimates can be biased, especially when m / n is small and disease prevalence is low. We also performed simulation to compare the two methods with an emphasis on the SE under a polygenic model (part 12 of the Supplementary Note). HE regression is computationally much more efficient but slightly less powerful than REML, as the SE of from HE regression is larger than that from REML (Supplementary Table 1 and part 12 of the Supplementary Note). The small difference in SE between the methods might not be important when the sample size becomes very large. For example, given > 0.1, whether the SE is 0.01 (REML) or 0.015 (HE regression) does not make any difference in statistical inference of whether = 0. HE regression can also be used to estimate multiple genetic components, for example, multiple sets of SNPs stratified by MAF or chromosome (Fig. 4), or to estimate genetic correlations between traits (Supplementary Table 2). These analyses have been implemented in the latest version of GCTA (see URLs). In addition, phenotype correlation–genotype correlation (PCGC) regression is an implementation of HE regression designed for disease data to attenuate the biases in ascertained case–control studies22,45 (see URLs).

Figure 4: Multiple-component GREML or HE regression for sets of SNPs stratified by MAF.
figure 4

Results are shown as with SE (error bar) in each MAF group averaged over 200 simulation replicates using ∼11,500 unrelated individuals (SNP-based relatedness < 0.05) and ∼550,000 genotyped SNPs after standard quality controls. In each simulation replicate, 1,000 SNPs were selected at random as causal variants with their effects sampled from a standard normal distribution with mean 0 and variance 1. The true heritability was 0.5 (roughly 0.1 per MAF bin). The SE of the estimate from HE regression was calculated using the jackknife approach where one individual was left out at a time.

Non-additive genetic variation

The GREML approach has been extended to estimate dominance genetic variance tagged by SNPs in unrelated individuals on the basis of a classical quantitative genetics model46. Similar to the additive GREML method, the dominance GREML model fits the additive and dominance effects of all SNPs as two sets of random effects in an MLM. This is an orthogonal model because the additive and dominance genotype variables and, thereby, the additive and dominance GRMs are independent. On average, across 79 quantitative traits, additive genetic variation explained ∼15% of the phenotypic variance and dominance genetic variation explained ∼3% of the variance46. The ratio of additive to dominance variance is consistent with what is expected from theory47. The method can be further extended to estimate genetic variance attributable to epistasis48 on the basis of the classical quantitative genetics model49, y = gA + gD + gAA + gAD + gDD + e, where gA, and gD are the additive and dominance genetic values of an individual and gAA, gAD and gDD are additive-by-additive, additive-by-dominance and dominance-by-dominance epistatic genetic values, respectively. However, the sample size will need to be very large to get a precise estimate of epistatic variance because the variance in the epistatic genetic relationship between unrelated individuals is very small. For instance, the genetic relationship for gAA is A2ij, which has a variance of 2[var(Aij)]2 (ref. 49). For HapMap 3 SNPs, var(Aij) ≈ 2.0 × 10−5 so that the variance in genetic relationship for gAA is ∼1.0 × 10−9, meaning that over 1 million unrelated individuals will be needed to estimate the variance explained by gAA with SE <0.05 (>4 million unrelated individuals to get SE <0.01). The variance in the dominance genetic relationship is smaller than for the additive genetic relationship. Therefore, it will be even more difficult to estimate variance for gAD or gDD.

Estimating and r g from GWAS summary data

We have discussed above the MLM-based approaches to estimate using individual-level GWAS data. There are other methods that are able to estimate from GWAS summary data (estimated SNP effects and their SE for all SNPs analyzed in a study)50. For example, the AVENGEME method uses maximum likelihood to estimate the genetic variance of a trait, the proportion of genetic variants affecting the trait and the genetic covariance (and therefore genetic correlation) between traits from the test statistic for association between phenotype and polygenic risk score (PRS)51,52. We can also estimate directly from summary data using the deviation of the observed χ2 test statistic for a SNP from its expected value under the null hypothesis of no association53 (part 13 of the Supplementary Note). This is the basic principle of the recently developed LD score regression approach (LDSC)54. This approach requires only summary data from GWAS because LD scores can be estimated from a reference sample (for example, the 1000 Genomes Project). LDSC has been extended to estimate rg between traits using summary data55, which allows the traits to be measured on different samples regardless of whether there is an overlap between samples, and to partition by functional annotation56. This method provides great flexibility for researchers to estimate rg between any two GWAS data sets. Both GREML and LDSC aim to estimate the variance explained by all SNPs used in GWAS. However, there are distinct differences between the two methods. LDSC is orders of magnitude faster than GREML, and the computing time for LDSC does not scale up with sample size. LDSC only requires summary data, which allows the reanalysis of summary data available from published meta-analyses. There are also limitations for LDSC. LDSC is not applicable in estimating the variance explained by rare variants (for example, MAF < 0.01) using either imputed or WGS data36 nor the variance explained by SNPs in small genomic regions (although the latter has been overcome by the HESS method developed recently53), and it is more sensitive to the genetic architecture of a trait (Supplementary Table 3). A previous study showed that estimates from LDSC are consistently smaller than those from GREML in the same data set57, which is likely owing to errors in LD scores estimated from the reference (by default, LDSC uses LD scores from HapMap 3 SNPs in the 1000 Genomes Project). We therefore advise using LD scores from the data used to generate the GWAS summary statistics. Although this may not be possible for published summary statistics, it should be possible for large cohorts such as the UK Biobank. It is noteworthy that LDSC will suffer bias in a similar way as GREML if causal variants are not randomly distributed with respect to LD. The estimate of rg from bivariate LDSC is consistent with that from bivariate GREML, but the jackknife SE of g from LDSC is larger than that expected from the approximation theory37,55,57.

Summary

We have provided a perspective of the methods for estimating SNP-based heritability in unrelated individuals using GWAS data. We emphasized that the GREML approach accounts for LD when estimating and actually uses LD to tag causal variants if they are not observed. We discussed the concepts and assumptions of the methods and scenarios under which the estimates could be biased, the methods could be misused and the results could be misinterpreted. We further discussed the extensions and applications of the methods in large data sets in the future (Box 2). These future directions could expand understanding of the genetic architecture for human complex traits and inform the design of future experiments to fully dissect genetic variation and genetic correlations.

URLs. GCTA, http://cnsgenomics.com/software/gcta/; PCGC, https://www.hsph.harvard.edu/alkes-price/software/; LDSC, https://github.com/bulik/ldsc.

Author contributions

All authors conceived and designed the project. J.Y., J.Z. and N.R.W. performed the analyses. All authors wrote the manuscript.