Genetic Basis of Complex Genetic Disease: The Contribution of Disease Heterogeneity to Missing Heritability
- First Online:
The genetic basis of complex genetic disease can be quantified by heritability, which is an estimate of the relative importance of genetic and non-genetic factors in contributing to differences between individuals for any given trait. Heritability is estimated from phenotypic records in data sets of families and represents contributions from genetic variants across the frequency spectrum and of any kind and function. Advances in technology allow direct interrogation of some kinds of DNA variants. Specific DNA variants identified in the era of genome-wide association studies explain only a fraction of the heritability estimated from family studies, as do less common variants identified through whole exome sequencing. If true effect sizes of risk variants are small, studies to date may be underpowered to detect individual risk variants; but the studies may be well-powered to detect the total contribution from common risk variants, and this has explained some of the missing heritability. Here we review explanations for the so-called “still-missing heritability” and focus particularly on the issue of genetic heterogeneity.
KeywordsGenome-wide association studies Genetic heterogeneity Whole exome sequencing Genetic architecture De novo mutations
Complex genetic diseases are those that tend to “run” in families yet show no clear pattern of inheritance. Most common diseases are complex genetic diseases including cancers, heart disease, immune disorders, and psychiatric disorders. Our understanding of causality of these diseases is limited, and this lack of knowledge has contributed to the limited progress made in the development of new treatments. Traditionally, quantification of the genetic basis of disease has been determined by measuring the increased risk of disease in relatives of those affected. Evidence for a genetic risk shared between relatives implies that DNA risk variants are passed from parent to child. This knowledge has underpinned the philosophy that identification of genetic risk variants is a worthy goal that may expose and open new doors towards understanding of the causality of disease, which in turn may lead to new treatments. Strategies to identify DNA risk variants have been dictated by available genotyping technologies. Advances in technology of the last decade have delivered methodologies, notably genome-wide association studies (GWAS) and whole exome sequencing (WES) that have started to deliver DNA risk variants associated with disease. Here we review the portfolio of strategies used to understand the genetic contribution to complex disease. We close by focussing on the issue of genetic heterogeneity of disease.
Evidence for a genetic contribution to disease comes from measurement of an increased risk of the disorder in relatives of those affected. However, such increased risks need to be interpreted with care, since close relatives share a common family environment so that recurrence risk in relatives may also reflect non-genetic factors. Estimates of risks of disease in different types of relatives (e.g., monozygotic and dizygotic twins, first and second degree relatives) are needed to disentangle genetic from non-genetic factors. These risks to relatives are used to estimate heritability on the liability scale [1, 2]. Liability to disease is a non-observable or latent, continuous variable with those ranking highest on liability being affected. Heritability on the liability scale, h2, quantifies the proportion of variance of liability to disease attributable to inherited genetic factors. Comparison of the relative importance of genetic factors for different disorders is more intuitive on this scale, particularly when comparing diseases of a different lifetime risk. Heritability accounts for genetic factors that are additive on the liability scale; these genetic factors combine non-additively on the disease scale , so that the probability of disease is many times higher for individuals carrying a high number of risk alleles compared to those carrying only half the number. Non-genetic factors include identifiable (but perhaps not recorded) environmental factors or measurement error, but also unidentifiable factors which form an intrinsic stochastic noise. Estimates of heritability may vary between populations, across ages and may depend on whether non-genetic factors have been recorded and included in the analysis . They depend on baseline risk of disease in the population, and the degree of sampling variance is often overlooked. Hence, in reality, heritability estimates should be viewed as pragmatic benchmarks representing evidence for low, moderate or high contributions of genetic effects.
While heritability on the liability scale expresses the proportion of the variance in liability that is attributable to genetic factors, it tells nothing about the underlying genetic architecture of the disease in terms of number, frequency, and effect sizes of individual causal variants, nor of the mode of action of causal loci (i.e., additive or non-additive). Lack of evidence that complex disease cases represented single gene disorders generated theories of polygenicity . Empirical results of the last decade provided support for a polygenic model . Under a polygenic model, the liability to disease reflects multiple genetic and non-genetic effects acting additively. Hence, liabilities are assumed to be normally distributed, because such a distribution results from many additively acting effects. All individuals in the population carry some genetic risk variants and likely experience some non-genetic risk factors, but most individuals in the population are not affected. Disease status results when the cumulative load exceeds a burden of risk threshold.
De Novo Mutations
De novo mutations are genetic variants present in the DNA of a child but not of their parents. Genotyping of parents and their child is used to identify de novo mutations. Whole exome sequencing has identified that de novo mutations play an important role in Mendelian diseases . Effect sizes of de novo mutations, that are their contribution to the risk of disease, are expected to be both small and large. This is not inconsistent with the expectation that genetic variants of large effect size are more likely to be de novo as they have not been subject to selection. Sequencing studies of the last decade have demonstrated that de novo mutations play an important causal role in some complex diseases and disorders for some individuals  (for example, mental retardation  and autism ). For other diseases and disorders there is evidence of an increased burden of de novo mutations in cases compared to controls  without being able to identify which of the de novo mutations are individually causal and increase risk of disease versus those that are benign . In rare instances, somatic de novo mutations have been shown to be causal . De novo mutations are not shared between relatives (except possibly between identical twins, or between siblings as a result of germ line mutations in sperm) and so rarely contribute to explaining heritability .
Familial vs Sporadic
It is not uncommon for cases to be referred to as either “familial” or “sporadic”, reflecting whether there is a known family history for the disease. In childhood disorders, cases are similarly referred to as multiplex or simplex depending on the presence or the absence of other affected children. In common parlance, the terms tend to be interpreted as implying a genetic or non-genetic etiology of disease, but this can be misleading. On the one hand, knowledge of family history can be used in optimal experimental design. For example, genetic studies designed to identify de novo mutations would be optimised by genotyping of cases with no family history of disease. In contrast, genetic studies designed to identify common genetic risk variants are optimised by prioritising selection of cases with family history and controls with no family history of disease. On the other hand, it is frequently overlooked that under a polygenic genetic architecture the majority of cases are not expected to report family history. For example, for a disease with lifetime prevalence of 1 % and heritability of 80 %, less than half of cases are expected to report family history when considering all first, second, and third generation relatives . Likewise, for the same disease more than 60 % of monozygotic twins are expected to be discordant for disease status .
Advances in genotyping technology allow cheap genome-wide interrogation of single nucleotide polymorphism (SNPs). GWAS identify associations between SNPs and disease. Reported results from association analyses include risk allele frequency (RAF), effect size (expressed for disease as the odds ratio, OR) and p-value of association. The contribution of these associated DNA genetic variants to variance can be calculated on the liability scale  to allow direct comparison of the contribution to the risk of each locus on the same scale as the heritability is reported. Assuming independence (and ignoring potential overestimation of effect size due to winner’s curse), the contribution of each genome-wide significant (GWS) locus can be summed to determine the proportion of variance in liability explained by these loci together, thus, quantifying the effects of all genome-wide significant SNPs (hGWS2).
Given the stringent significance threshold applied, the ability to detect risk loci (i.e., the power) depends on whether the sample size is sufficient given the true effect sizes. When the first GWAS were planned, the distribution of expected effect sizes was unknown and sample sizes were powered to detect OR > ~1.3. The first generation of GWAS yielded few GWS results with hGWS2 much less than h2. This difference has been termed “missing heritability” . As sample sizes have increased, the number of GWS variants have increased for both quantitative traits and diseases (see Figure 2 in Visscher et al. ) providing empirical evidence that common variants do play a role in complex genetic disease. Nonetheless, substantial missing heritability remains.
The observed increase in number of significant association results as sample sizes have been increased , This implies that the earlier studies were underpowered to detect the variants given their effect sizes. However, given that collection of larger samples is time consuming and expensive, can we be sure that the same will be true for other diseases? Statistical methods that combine quantitative and population genetic concepts to evaluate the contribution to variance of common SNPs across the whole genome without identifying them individually have been developed [19, 20, 21, 22, 23, 24]. These methods use people unrelated in the conventional sense of the word; but given the finite global population size, they share a proportion of their DNA by descent. The proportion of sharing between pairs of individuals can be estimated using genome-wide marker data, and that genomic similarity can be correlated with disease status to estimate genetic variation [20, 21, 25, 26]. By using distantly related individuals, a significant heritability tagged by common SNPs, hSNP2, is detected if case-case pairs and control-control pairs have higher genomic similarity than case-control pairs . For most disease traits studied, significant SNP heritabilities have been estimated demonstrating that, although the data sets analysed may have been underpowered to detect the individual small effects as GWS, contributions from common variants exist. Larger sample sizes are needed for individual detection. Hence, the polygenic analyses have been successful in identifying “hidden heritability”, i.e., the increase from hGWS2 to hSNP2. In theory, with sufficiently large sample size, hGWS2 can become as large as hSNP2.
Explanations for the Still-Missing Heritability
Over-estimation of heritability from family studies
Variants not tagged by common SNPs
Disease heterogeneity is a possible explanation for still-missing heritability. We have previously noted, for psychiatric disorders at least, that heritabilities estimated from large population samples are lower than those estimated from twin studies. We argued  that this may reflect greater diagnostic heterogeneity in large cohorts compared to the carefully collected twin samples, but that the large cohorts may be more representative of the samples currently brought together for analysis in genetic studies.
Exploring the Impact of Disease Heterogeneity
Impact on h2
Impact on hGWS2
Impact on hSNP2
The impact of analysing a disease composite to estimate hSNP2 can also be considered in terms of disease misclassification . The estimated hSNP2 is a weighted average of the true hSNP2 parameters of each underlying disease and the SNP-covariance (counted twice). So if the two contributing diseases have equal true hSNP2 and are independent, the estimated value from the composite disease will be 0.5 hSNP2. We conclude that disease heterogeneity can generate underestimates of hSNP2 compared to when disease classes are genetically homogeneous.
The genetic basis of complex genetic disease can be quantified by heritability, which is an estimate of the relative importance of genetic and non-genetic factors in contributing to differences between individuals for any given trait. Heritability is estimated from phenotypic records in data sets of families and represents contributions from genetic variants across the frequency spectrum and genetic variants of any kind and function. Advances in technology allow direct interrogation of some kinds of DNA variants. Specific DNA variants identified in the era of genome-wide association studies explain only a fraction of the heritability estimated from family studies (hGWS2) as do less common variants identified through whole exome sequencing. If true effect sizes of risk variants are small then studies to date may be underpowered to detect individual risk variants, but they may be well-powered to detect the total contribution from common risk variants (hSNP2) and such analysis has helped to explain some of the missing heritability. Here we reviewed explanations for the so-called “still-missing heritability” and focus particularly on the issue of disease heterogeneity.
To explore the impact of disease heterogeneity on estimates of h2, hGWS2 and hSNP2 we considered an extreme example of two independent indistinguishable but equally genetic diseases being lumped together as a disease composite. We have shown that under this scenario the estimates of h2 from family data are nearly as high as the heritabilities of the contributing individual diseases, yet the estimates of hGWS2 and hSNP2 are severely compromised. In reality, this toy example may be too extreme as real presentations of composite diseases may reflect diseases that are genetically correlated rather than totally independent. For example, Crohn’s Disease and ulcerative colitis are estimated to have a genetic correlation based on SNP data of 0.6 , the vast majority of SNPs identified in GWAS affect both diseases, but a handful of them have effects in the opposite direction . Clearly, as the genetic correlation between the two contributing diseases approaches 1, the two diseases merge as a single genetic disease entity. For genetically correlated diseases, the power to detect associated loci may be increased by considering the disease composite for loci contributing to both diseases and decreased for other loci. Consideration of these factors can quickly lead to philosophical musings of the definition of disease, since even for a single genetic disease under a polygenic model of disease, each individual could carry a unique portfolio of risk loci. In the genomics era, a disease definition may be at the pathway level, whereby a single genetic disease considers different portfolios of risk loci impacting the same pathway, or, more practically, the class of individuals who respond to the same treatment.
NR Wray is funded by the Australian National Health and Medical Research Council grants 61602 and 1050218.
Compliance with Ethics Guidelines
Conflict of Interest
NR Wray and R Maier both declare no conflicts of interest.
Human and Animal Rights and Informed Consent
This article does not contain any studies with human or animal subjects performed by any of the authors.
- 17.Witte JS, Visscher PM, Wray NR. The contribution of genetic variants to disease depends on the ruler Nat Genet. 2014. doi:10.1038/nrg3786.
- 19.Deary IJ, Yang J, Davies G, Harris SE, Tenesa A, Liewald D, et al. Genetic contributions to stability and change in intelligence from childhood to old age. Nature 2012.Google Scholar
- 44.Chen G-B, Lee SH, Montgomery GW, Wray NR, Radford-Smith GL, Visscher PM. Estimation and partitioning of (co)heritability of inflammatory bowel disease from GWAS and immunochip data. Submitted.Google Scholar