Introduction

Complex genetic diseases are those that tend to “run” in families yet show no clear pattern of inheritance. Most common diseases are complex genetic diseases including cancers, heart disease, immune disorders, and psychiatric disorders. Our understanding of causality of these diseases is limited, and this lack of knowledge has contributed to the limited progress made in the development of new treatments. Traditionally, quantification of the genetic basis of disease has been determined by measuring the increased risk of disease in relatives of those affected. Evidence for a genetic risk shared between relatives implies that DNA risk variants are passed from parent to child. This knowledge has underpinned the philosophy that identification of genetic risk variants is a worthy goal that may expose and open new doors towards understanding of the causality of disease, which in turn may lead to new treatments. Strategies to identify DNA risk variants have been dictated by available genotyping technologies. Advances in technology of the last decade have delivered methodologies, notably genome-wide association studies (GWAS) and whole exome sequencing (WES) that have started to deliver DNA risk variants associated with disease. Here we review the portfolio of strategies used to understand the genetic contribution to complex disease. We close by focussing on the issue of genetic heterogeneity of disease.

Heritability

Evidence for a genetic contribution to disease comes from measurement of an increased risk of the disorder in relatives of those affected. However, such increased risks need to be interpreted with care, since close relatives share a common family environment so that recurrence risk in relatives may also reflect non-genetic factors. Estimates of risks of disease in different types of relatives (e.g., monozygotic and dizygotic twins, first and second degree relatives) are needed to disentangle genetic from non-genetic factors. These risks to relatives are used to estimate heritability on the liability scale [1, 2]. Liability to disease is a non-observable or latent, continuous variable with those ranking highest on liability being affected. Heritability on the liability scale, h 2, quantifies the proportion of variance of liability to disease attributable to inherited genetic factors. Comparison of the relative importance of genetic factors for different disorders is more intuitive on this scale, particularly when comparing diseases of a different lifetime risk. Heritability accounts for genetic factors that are additive on the liability scale; these genetic factors combine non-additively on the disease scale [3], so that the probability of disease is many times higher for individuals carrying a high number of risk alleles compared to those carrying only half the number. Non-genetic factors include identifiable (but perhaps not recorded) environmental factors or measurement error, but also unidentifiable factors which form an intrinsic stochastic noise. Estimates of heritability may vary between populations, across ages and may depend on whether non-genetic factors have been recorded and included in the analysis [4]. They depend on baseline risk of disease in the population, and the degree of sampling variance is often overlooked. Hence, in reality, heritability estimates should be viewed as pragmatic benchmarks representing evidence for low, moderate or high contributions of genetic effects.

Genetic Architecture

While heritability on the liability scale expresses the proportion of the variance in liability that is attributable to genetic factors, it tells nothing about the underlying genetic architecture of the disease in terms of number, frequency, and effect sizes of individual causal variants, nor of the mode of action of causal loci (i.e., additive or non-additive). Lack of evidence that complex disease cases represented single gene disorders generated theories of polygenicity [5]. Empirical results of the last decade provided support for a polygenic model [6]. Under a polygenic model, the liability to disease reflects multiple genetic and non-genetic effects acting additively. Hence, liabilities are assumed to be normally distributed, because such a distribution results from many additively acting effects. All individuals in the population carry some genetic risk variants and likely experience some non-genetic risk factors, but most individuals in the population are not affected. Disease status results when the cumulative load exceeds a burden of risk threshold.

De Novo Mutations

De novo mutations are genetic variants present in the DNA of a child but not of their parents. Genotyping of parents and their child is used to identify de novo mutations. Whole exome sequencing has identified that de novo mutations play an important role in Mendelian diseases [7]. Effect sizes of de novo mutations, that are their contribution to the risk of disease, are expected to be both small and large. This is not inconsistent with the expectation that genetic variants of large effect size are more likely to be de novo as they have not been subject to selection. Sequencing studies of the last decade have demonstrated that de novo mutations play an important causal role in some complex diseases and disorders for some individuals [8] (for example, mental retardation [9] and autism [10]). For other diseases and disorders there is evidence of an increased burden of de novo mutations in cases compared to controls [11] without being able to identify which of the de novo mutations are individually causal and increase risk of disease versus those that are benign [12]. In rare instances, somatic de novo mutations have been shown to be causal [13]. De novo mutations are not shared between relatives (except possibly between identical twins, or between siblings as a result of germ line mutations in sperm) and so rarely contribute to explaining heritability [14].

Familial vs Sporadic

It is not uncommon for cases to be referred to as either “familial” or “sporadic”, reflecting whether there is a known family history for the disease. In childhood disorders, cases are similarly referred to as multiplex or simplex depending on the presence or the absence of other affected children. In common parlance, the terms tend to be interpreted as implying a genetic or non-genetic etiology of disease, but this can be misleading. On the one hand, knowledge of family history can be used in optimal experimental design. For example, genetic studies designed to identify de novo mutations would be optimised by genotyping of cases with no family history of disease. In contrast, genetic studies designed to identify common genetic risk variants are optimised by prioritising selection of cases with family history and controls with no family history of disease. On the other hand, it is frequently overlooked that under a polygenic genetic architecture the majority of cases are not expected to report family history. For example, for a disease with lifetime prevalence of 1 % and heritability of 80 %, less than half of cases are expected to report family history when considering all first, second, and third generation relatives [15]. Likewise, for the same disease more than 60 % of monozygotic twins are expected to be discordant for disease status [16].

Missing Heritability

Advances in genotyping technology allow cheap genome-wide interrogation of single nucleotide polymorphism (SNPs). GWAS identify associations between SNPs and disease. Reported results from association analyses include risk allele frequency (RAF), effect size (expressed for disease as the odds ratio, OR) and p-value of association. The contribution of these associated DNA genetic variants to variance can be calculated on the liability scale [17] to allow direct comparison of the contribution to the risk of each locus on the same scale as the heritability is reported. Assuming independence (and ignoring potential overestimation of effect size due to winner’s curse), the contribution of each genome-wide significant (GWS) locus can be summed to determine the proportion of variance in liability explained by these loci together, thus, quantifying the effects of all genome-wide significant SNPs (h 2 GWS ).

Given the stringent significance threshold applied, the ability to detect risk loci (i.e., the power) depends on whether the sample size is sufficient given the true effect sizes. When the first GWAS were planned, the distribution of expected effect sizes was unknown and sample sizes were powered to detect OR > ~1.3. The first generation of GWAS yielded few GWS results with h 2 GWS much less than h 2. This difference has been termed “missing heritability” [18]. As sample sizes have increased, the number of GWS variants have increased for both quantitative traits and diseases (see Figure 2 in Visscher et al. [6]) providing empirical evidence that common variants do play a role in complex genetic disease. Nonetheless, substantial missing heritability remains.

Hiding Heritability

The observed increase in number of significant association results as sample sizes have been increased [6], This implies that the earlier studies were underpowered to detect the variants given their effect sizes. However, given that collection of larger samples is time consuming and expensive, can we be sure that the same will be true for other diseases? Statistical methods that combine quantitative and population genetic concepts to evaluate the contribution to variance of common SNPs across the whole genome without identifying them individually have been developed [1924]. These methods use people unrelated in the conventional sense of the word; but given the finite global population size, they share a proportion of their DNA by descent. The proportion of sharing between pairs of individuals can be estimated using genome-wide marker data, and that genomic similarity can be correlated with disease status to estimate genetic variation [20, 21, 25, 26]. By using distantly related individuals, a significant heritability tagged by common SNPs, h 2 SNP , is detected if case-case pairs and control-control pairs have higher genomic similarity than case-control pairs [26]. For most disease traits studied, significant SNP heritabilities have been estimated demonstrating that, although the data sets analysed may have been underpowered to detect the individual small effects as GWS, contributions from common variants exist. Larger sample sizes are needed for individual detection. Hence, the polygenic analyses have been successful in identifying “hidden heritability”, i.e., the increase from h 2 GWS to h 2 SNP . In theory, with sufficiently large sample size, h 2 GWS can become as large as h 2 SNP .

Explanations for the Still-Missing Heritability

For most diseases the “still-missing” heritability, i.e., the difference between h 2 SNP and h 2, remains substantial at approximately half of the heritability estimated from family data. It is important to note that it is not necessary to explain all heritability when the goal is to open new biological research doors that may impact treatment, and; indeed, it is likely to be impossible to do so. Nonetheless, seeking further insight for the still-missing heritability may also provide important guidance of future research directions. A number of explanations have been proposed [1928], which include the following.

  1. a)

    Over-estimation of heritability from family studies

In human populations, part of the still-missing heritability may simply reflect overestimation of h 2 since typical study designs for estimation of heritability use very close relatives (e.g., full siblings and twins) who share non-additive gene combinations and a common environment, and these confounding factors can be difficult to separate [4, 18]. The difference between estimates of h 2 from family data and the “true” h 2 has been termed “phantom heritability” [29] when the difference is attributable to non-additive genetic variance, but our ability to quantify this based on realistically collectable data is limited. Others have argued that the contribution from non-additive genetic variance to complex traits is likely limited [30, 31], and the presence of important epistasis and small epistatic variance are not inconsistent [32]. The extent to which gene-environment interaction (GxE) or G and E correlation inflate estimates of heritability from twin and family studies is unknown. Nonetheless, it seems intuitive that exposure to environmental risk factors increases risk of disease only in those that are already genetically susceptible and; hence, SNP effect sizes may differ in cases stratified by an environmental exposure. However, GxE studies to date are limited by a dearth of samples that are informative for G and consistently recorded E, and are notoriously underpowered [33]. For this reason, studies of candidate GxE interactions have generally lacked replication, and the field is plagued by publication bias towards studies with positive results [34].

  1. b)

    Variants not tagged by common SNPs

Part of the still-missing heritability must reflect genomic variants not well tagged by SNPs [21, 27]. Since the SNPs on SNP chips are generally chosen because both their alleles are common, they cannot be in high r2 linkage disequilibrium with rare causal variants. For many diseases, copy number variants or other rare variants have been identified, usually through WES studies. In order to have been detected these rare variants requires relatively large effect size; but because they are rare, their contribution to risk in the population is small. A very large number of rare variants are needed to explain the still-missing heritability. For example, a locus with risk allele frequency 0.001 and heterozygous relative risk (RR) of 2.1 explains approximately the same proportion of variance in liability as a locus with allele frequency 0.5 and RR 1.05. It is notable that estimation of h 2 SNP using SNPs imputed to the 1000 Genomes reference panel does not tend to generate higher estimates compared to imputation to the HapMap3 panel [35, 36]. The relative importance of small structural variants to genomic variation is currently not well documented and may not be well represented in sequenced reference panels used for imputation. Since recurrent tandem repeat polymorphisms are known to modulate a range of biological functions [37, 38], these may represent an example of an important, but as yet unprobed, source of disease associated variation. Estimation of h 2 SNP based on haplotypes constructed from SNPs is a field of active research since haplotypes have the opportunity to tag uncommon structural variants not present in imputation reference panels. In practice, such methods may be difficult to apply since they are likely to be very sensitive to genotyping error.

  1. c)

    Disease heterogeneity

Disease heterogeneity is a possible explanation for still-missing heritability. We have previously noted, for psychiatric disorders at least, that heritabilities estimated from large population samples are lower than those estimated from twin studies. We argued [39] that this may reflect greater diagnostic heterogeneity in large cohorts compared to the carefully collected twin samples, but that the large cohorts may be more representative of the samples currently brought together for analysis in genetic studies.

Disease heterogeneity can have several interpretations, but at its most tangible there are multiple examples of complex genetic diseases that are now recognised to have biologically determined subtypes reflecting independent, or more likely correlated, diseases that may have different optimal treatment strategies. For example, decades ago, based on clinical symptoms alone, the inflammatory bowel diseases ulcerative colitis and Crohn’s Disease would have been indistinguishable and given the same diagnosis. More recently, it has been recognised that diagnosis and treatment of rheumatoid arthritis should consider the presence and the absence of anti-citrullinated protein autoantibodies [40]. The genomics era has allowed good progress in subtyping of cancers (e.g., ER + ve/ER –ve and overexpression of HER2 as a breast-cancer subtype [41, 42] or K-ras mutations in colorectal cancer and EGFR mutations in lung cancer, as reviewed in [43]. However, other branches of medicine are less able to supply measures of phenotypic heterogeneity in the tissue of relevance for mapping onto the genetic heterogeneity. Given the known examples, it seems likely that other diseases currently treated as a single disease entity may in fact be a diagnostic aggregation of subtypes. How could this impact missing heritability? We consider the impact of disease heterogeneity on estimates of the different parameters of variance explained by genetic factors and demonstrate that it could make an important contribution to still-missing heterogeneity (Fig. 1).

Fig. 1
figure 1

A schematic of heritabilities

Exploring the Impact of Disease Heterogeneity

To consider the impact of disease heterogeneity on genetic interpretation of disease, we consider an extreme example of two diseases each of lifetime prevalence 0.5 % and heritability 80 % that are phenotypically and genetically independent, but that have such similar clinical presentation that they are indistinguishable and are considered a single disease. Under this composite disease etiology what would be the impact on estimates of h 2, h 2 GWS and h 2 SNP ?

  1. a)

    Impact on h 2

The composite disease would have a lifetime prevalence of 0.005(2-0.005) = 0.998 % ≈ 1 %, and the heritability estimated from the two-disease composite would be estimated as greater than 65 % from a twin design (see Appendix). In fact, for the composite disease the estimates of heritability using the liability threshold model are expected to be slightly inconsistent when estimated from the relative risks of disease from different types of relatives (Fig. 2), but such inconsistencies are expected to be difficult to detect given the sampling error on estimates especially since most studies to estimate heritability use relatively small samples of only twins or first degree relatives [4]. We conclude that high estimates of heritability are possible for a composite disease.

Fig. 2
figure 2

Estimates of heritability under a liability threshold model calculated from lifetime risk of disease and lifetime risk of disease in relatives of affected individuals for a composite disease that comprises two independent diseases each of lifetime risk either 5 % or 0.5 % and each of heritability 80 %

  1. b)

    Impact on h 2 GWS

We have previously provided theory to estimate the power of association studies in the context of misdiagnosis [20] (see Appendix) which is analogous to the scenario here of a disease composite. In Fig. 3 we show the power of an association study to detect risk alleles of a spectrum of frequencies that have effect size under a multiplicative model of heterozygote relative risk 1.15. For a sample of 10,000 cases of a single genetic disease and 10,000 controls we have >75 % power to detect risk alleles of frequencies 0.2-0.8 at a genome-wide significance of 5 × 10-8 (line A). However, for our composite disease (for which we expect risk alleles to be associated with only one of the underlying diseases) an association study of 10,000 cases, of which only half are from the disease impacted by the risk allele, is totally underpowered to detect risk alleles (line B). To demonstrate that this reflects the impact of contamination by the phenocopy disease rather than the reduced sample size of the associated disease, we also show the power of an association study of 5,000 cases and 10,000 controls (line C). To consider a range of disease composite scenarios when the proportion of disease 2 cases in the disease composite sample is 0 %, 5 %, 10 %, 20 %, and 50 %, the power to detect a disease 1 risk variant of frequency 0.4 and relative risk 1.15 at the genome-wide significance threshold of p < 5 × 10-8 is 93 %, 87 %, 79 %, 55 %, and 3 % (assuming 10,000 composite disease cases and 10,000 controls and 0.5 % lifetime risk of disease 1).

Fig. 3
figure 3

Power of a genome-wide association study to detect risk variants with heterozygous relative risk of 1.15. A) 10,000 cases of a homogeneous genetic disease of prevalence 1 % and 10,000 screened controls B) 10,000 cases of a composite disease and 10,000 screened controls, the composite disease has prevalence 1 % but comprises two equally represented genetically independent diseases each of prevalence 0.5 % C) 5,000 cases of a homogeneous genetic disease of prevalence 0.5 % and 10,000 screened controls

We conclude that disease heterogeneity can severely compromise the power of association studies and, hence, estimation of h 2 GWS .

  1. c)

    Impact on h 2 SNP

The impact of analysing a disease composite to estimate h 2 SNP can also be considered in terms of disease misclassification [20]. The estimated h 2 SNP is a weighted average of the true h 2 SNP parameters of each underlying disease and the SNP-covariance (counted twice). So if the two contributing diseases have equal true h 2 SNP and are independent, the estimated value from the composite disease will be 0.5 h 2 SNP . We conclude that disease heterogeneity can generate underestimates of h 2 SNP compared to when disease classes are genetically homogeneous.

Summary

The genetic basis of complex genetic disease can be quantified by heritability, which is an estimate of the relative importance of genetic and non-genetic factors in contributing to differences between individuals for any given trait. Heritability is estimated from phenotypic records in data sets of families and represents contributions from genetic variants across the frequency spectrum and genetic variants of any kind and function. Advances in technology allow direct interrogation of some kinds of DNA variants. Specific DNA variants identified in the era of genome-wide association studies explain only a fraction of the heritability estimated from family studies (h 2 GWS ) as do less common variants identified through whole exome sequencing. If true effect sizes of risk variants are small then studies to date may be underpowered to detect individual risk variants, but they may be well-powered to detect the total contribution from common risk variants (h 2 SNP ) and such analysis has helped to explain some of the missing heritability. Here we reviewed explanations for the so-called “still-missing heritability” and focus particularly on the issue of disease heterogeneity.

To explore the impact of disease heterogeneity on estimates of h 2, h 2 GWS and h 2 SNP we considered an extreme example of two independent indistinguishable but equally genetic diseases being lumped together as a disease composite. We have shown that under this scenario the estimates of h 2 from family data are nearly as high as the heritabilities of the contributing individual diseases, yet the estimates of h 2 GWS and h 2 SNP are severely compromised. In reality, this toy example may be too extreme as real presentations of composite diseases may reflect diseases that are genetically correlated rather than totally independent. For example, Crohn’s Disease and ulcerative colitis are estimated to have a genetic correlation based on SNP data of 0.6 [44], the vast majority of SNPs identified in GWAS affect both diseases, but a handful of them have effects in the opposite direction [45]. Clearly, as the genetic correlation between the two contributing diseases approaches 1, the two diseases merge as a single genetic disease entity. For genetically correlated diseases, the power to detect associated loci may be increased by considering the disease composite for loci contributing to both diseases and decreased for other loci. Consideration of these factors can quickly lead to philosophical musings of the definition of disease, since even for a single genetic disease under a polygenic model of disease, each individual could carry a unique portfolio of risk loci. In the genomics era, a disease definition may be at the pathway level, whereby a single genetic disease considers different portfolios of risk loci impacting the same pathway, or, more practically, the class of individuals who respond to the same treatment.