Introduction

Type 2 diabetes is a chronic metabolic disease affecting the lives of millions and creating major healthcare problems worldwide. The huge personal and financial costs, together with good evidence for a genetic component of the disease, have justified a substantial investment into the search for novel type 2 diabetes genes [183]. Linkage studies have shown limited success to date in identifying causative genes. Genome-wide association (GWA) studies have identified multiple novel, apparently causative genes, but the contribution of these to disease risk and predictive value is small [84, 85]. Pathway-specific candidate gene searches have also had little success. Therefore, most of the genetic contribution to type 2 diabetes remains unknown and the pressure for solutions to the disease remains intense. The environmental contributions to type 2 diabetes may partly be explained by a gene–environmental interaction whereby a particular environment triggers the disease in those with an underlying genetic predisposition. The search for diabetes disease genes is therefore as critical as the current efforts to modify the environmental and lifestyle factors that contribute to the disease.

Family-based studies of the genetic determinants of type 2 diabetes and related precursor quantitative traits (QTs, e.g. plasma insulin and glucose levels) [183] and GWA studies have now provided an abundance of evidence for potentially causative genes. These results have been drawn together onto a single map of the human genome sequence [86]. The goal is to look for genomic locations where the presence of a potential underlying type 2 diabetes gene has been attested to repeatedly—diabetes genetic ‘hot spots’. Such replication increases our confidence of the presence of an underlying gene. While GWA studies look for diabetes genes using a different approach to linkage analysis, the ultimate goal is the same—to find the genetic determinants of the disease. Therefore, the results of linkage and association must eventually match each other. The current analysis identifies multiple linkage locations that differ from those found in the recent GWA studies [8789], and suggests the location of additional major type 2 diabetes susceptibility genes.

Methods

Linkage studies of type 2 diabetes and related QTs were identified though Medline or other literature searches. In total, 52 genetic linkage projects reported in 83 publications were included in our analysis. Participants originated from multiple ethnic backgrounds. We were aware of some overlap in study populations, mainly the Genetics of NIDDM (GENNID) study [18, 22, 23, 25, 31, 44], but also other studies [4, 17, 30, 59]. The number of individual results removed because of this to avoid redundancy was small. Pima Indian QT studies from clinical research [65] or field studies [33] were treated separately. For studies that updated their analyses we have endeavoured to use the most recent results.

The X chromosome is excluded because of limited results. There were 450 different genetic marker names representing 439 loci (11 markers had two aliases). The physical location of the marker reported for the linkage peak or the 2 point score, was identified on build 36.1 of the Human Genome from UCSC Genome Bioinformatics (http://genome.ucsc.edu/, accessed July 2008) [86]. Where the linkage location fell between several markers, the p-terminal marker was used. Inclusion of results is based on logarithm of odds (LOD) score, not p value. Linkage peaks for clarity have been plotted as a single point/line. To be included the marker name needed to be explicitly provided in the publication (for study [78] the marker names were provided by the senior investigator).

To avoid mapping multiple results from the same linkage signal, linkage results were sorted by study population and phenotype and then ranked by descending LOD score. Any result with a lower LOD score for the same population/phenotype within 30 Mb in either direction of the signal was deleted. A physical distance of 30 Mb is approximately 33 cM. Using the Kosambi map function, this corresponds to a recombination fraction of 0.29 (maximum is 0.5) [90]. Some studies had both subgroup and combined group analyses. Since the aim of this study was to identify replicated findings, we preferentially used the subgroups (e.g. for the Finland–United States Investigation of NIDDM Genetics [FUSION], select FUSION1 and FUSION2 and delete FUSION1+2 [73, 78] if co-located).

The following simplified phenotypes were created prior to filtering: (1) diabetes (diabetes, diabetes and impaired glucose tolerance combined, diabetes age of onset), (2) glucose (fasting glucose, OGTT glucose levels, HbA1c), (3) insulin (any insulin, proinsulin, C-peptide, some bivariate results), (4) acute insulin response (AIR), (5) minimal model insulin sensitivity (Si), (6) minimal model glucose effectiveness (Sg), and (7) euglycaemic–hyperinsulinaemic clamp (glucose infusion rate, M value). Phenotypes constructed from two categories were filtered against the results in both categories. While obesity also contributes to type 2 diabetes susceptibility, it is a distinct genetic disorder that is not always associated with the disease, and was therefore not included.

For several studies we converted p values to LOD scores using the assumption that LOD × 2loge10 has a χ 2 distribution [91]. We assumed a two-sided test for [1], according to the authors instructions, and a one-sided test for [43, 46] (i.e. p value doubled before further calculation). This may introduce some error, but the location estimates are not influenced by the calculation. A number of authors supplied additional data for their studies on request, but we did not systematically request data from all authors.

The SAS statistical package, version 9.1 (SAS Institute, Cary, NC, USA) was used for the analyses. Linkage results for any particular underlying gene are expected to lie close to the genomic location of the gene. Since these results may be scattered locally, rather than falling in exactly the same spot, and to make the identification of any potential clusters of replicated results as objective as possible, the refined results were analysed using MODECLUS (method 1), a SAS statistical clustering procedure. It is not possible to assign a significance value to an individual cluster. The initial seeding radius for clusters provided to MODECLUS was 12 Mb, but lower if there were results not unambiguously assigned to a cluster. Large seeds will put all observations into one large cluster, which is clearly unhelpful. Three markers fell in the same cluster as a related result (but >30 Mb away). These were deleted from further analyses after the goodness-of-fit tests. The goodness-of-fit tests evaluated the distribution of location results against a uniform distribution (Anderson–Darling statistic).

A quality score was developed and applied to each cluster. Clusters were first ranked by three properties: the sum of all the LOD scores, the density of the cluster (number of studies per 10 Mb of genome), and the number of studies in the cluster with an LOD score of ≥3.0. The sum of these ranks was used to calculate a final ranking, which we use as a quality score (the 15 best are in Table 1).

Table 1 Genome locations of type 2 diabetes linkage replication (best 15 locations ranked by quality score)

Calculations

The significance of the relationship between the GWA results and clusters was calculated as: \(P\left( k \right){\text{ }} = {\text{ }}\left\{ {{{n!} \mathord{\left/ {\vphantom {{n!} {\left[ {k!\left( {n - k} \right)!} \right]}}} \right. \kern-\nulldelimiterspace} {\left[ {k!\left( {n - k} \right)!} \right]}}} \right\}p^k \left( {1 - p} \right)^{\left( {n - k} \right)} \), where n is the number of trials (i.e. 20 for GWA results), k is the number of hits (i.e. GWA markers falling into the clusters) and p is the probability of success in one trial (i.e. amount of genome covered by clusters). The overall probability is the sum over k, k + 1, k + 2…k + 20.

The λ s value (sibling recurrence risk ratio, which is the risk to a sibling relative to the general population; Table 3) is calculated for an additive genetic effect as follows [90, 92]:

  • p = risk allele frequency, \(q = \left( {1 - p} \right)\)

  • g = odds per allele (for two alleles, \(g_2 = \left( {2g - 1} \right)\); one allele, g 1 = g; no alleles, g 0 = 1)

Note: Since f 0, the disease penetrance in those without the disease allele, eventually cancels out in these equations, only the genotype relative risks (odds ratios) are included.

$$\begin{array}{*{20}c} {V_{A} = 2{\text{pq}}{\left\{ {{\left[ {p{\left( {g_{2} - g_{1} } \right)}} \right]} + {\left[ {q{\left( {g_{1} - g_{0} } \right)}} \right]}} \right\}}^{2} } \\ {V_{D} = {\left( {p^{2} } \right)}{\left( {q^{2} } \right)}{\left[ {{\left( {g_{2} - 2g_{1} + g_{0} } \right)}^{2} } \right]}} \\ {K = {\left( {p^{2} g_{2} } \right)} + {\left( {2pqg_{1} } \right)} + {\left( {q^{2} } \right)}} \\ {C{\text{ }} = {\left( {V_{A} \times 0.5} \right)} + {\left( {V_{D} \times 0.25} \right)}} \\ {K_{s} = K + {\left( {C \mathord{\left/ {\vphantom {C K}} \right. \kern-\nulldelimiterspace} K} \right)}} \\ {\lambda _{s} = {K_{s} } \mathord{\left/ {\vphantom {{K_{s} } K}} \right. \kern-\nulldelimiterspace} K} \\ \end{array}$$

where V A is the additive variance, V D is the dominance variance, K is the population prevalence of the disease, C is the genetic covariance between two full siblings (epistatic effects ignored) and K s is the sibling recurrence risk.

Results

We identified 560 linkage results, on 22 autosomes, representing 439 different marker locations that met the following criteria: an LOD score of ≥1.18 (equivalent to p ≤ 0.01 in a single-point analysis or the likelihood of about two such results occurring by chance in a genome-wide analysis [93]); a marker with a known genomic location; and each result independent of any other result lying either within 30 Mb or in the same cluster (Tables 1 and 2, Fig. 1, Electronic supplementary material [ESM] Figs 1 and 2). There were 264 linkage results for the type 2 diabetes phenotype, and 296 linkage results for QTs. The number of linkage reports according to racial group were: Europeans 266, Native/Mexican American 124, Chinese/Japanese 78, African 55, Other/mixed/uncertain 37. Among the Europeans there was almost an equal number of results for type 2 diabetes and QTs (124 and 142, respectively), but QT studies dominated in the Native/Mexican American and African groups (>70% of results), and type 2 diabetes analyses predominated among the East Asians (83% of results). LOD scores of ≥2.2, ≥3.0 and ≥3.6 were present in 188 (34%), 80 (14%) and 41 (7%) of results, respectively (genome-wide p values ∼0.2, ∼0.05, ∼0.01 [93]). By other assessments, an LOD of 2.2 is expected to occur by chance once in every genome-wide scan [91]. Within each major racial group a similar proportion of results (13–17%) had LOD scores ≥3.0.

Fig. 1
figure 1

The linkage results for type 2 diabetes and related quantitative traits for chromosomes 1 and 22. The results for all 22 chromosomes plus the reference for the source of the data are given in ESM Fig. 1. The linkage results are plotted as a single line, each line representing a linkage peak or 2 point score, and positioned by the location of the genetic marker. Each result is independent of any other result by population and phenotype, within 30 Mb in either direction, and independent of any results in the same cluster if the cluster exceeds 30 Mb in size. The y-axis shows the LOD score of the results, with values ≥6 plotted as 6. The x-axis shows the location of the linkage result in megabase pairs from the p-terminus (according to http://genome.ucsc.edu/, accessed July 2008). Diabetes/impaired glucose tolerance results are in blue, QT results are in green. A thick solid horizontal bar at the baseline demarcates each cluster. This is coloured deep purple for the best 15 clusters and pale purple for the remainder. A solid red arrow at the baseline identifies the location of a GWA SNP result (Table 3). For chromosome 1 the locations of the GWA results were adjusted to make the two closely located results visible separately. A number of linkage results were close to, or in an identical position to, another result. Therefore, for plotting purposes only, some results were moved in increments of 0.75 Mb until all values could be seen. The maximum move was 4.5 Mb (chromosome 6). The cluster range (i.e. Start–end) given in the tables is the actual position, whereas the plotted value may be adjusted to make the results visible. ESM Fig. 2 shows the clusters in chromosomes 1, 6, 18 and 22 that contained the most superimposed lines in more detail

Table 2 Genome locations of type 2 diabetes linkage replication (remaining clusters listed by chromosome)

By goodness-of-fit tests against a uniform distribution, the genomic locations of the linkage results were non-random in a number of assessments. Testing all results (chromosomes 1–22) as a continuous block suggested a non-uniform distribution (p = 0.045). Chromosomes 1, 8, 10, 17 and 18 show a non-uniform distribution (p ≤ 0.02), as do 2, 5 and 6 (p ≤ 0.05), while on chromosomes 3 and 21, the evidence is only suggestive (p ≤ 0.075). These tests may be too crude to truly assess clustering and may obscure finer detail.

Cluster analysis by individual chromosome identified 56 clusters, each containing at least five linkage results. The selection criterion of five members to define a cluster is arbitrary but provides a compromise between showing strong support for a replication locus on the one hand and discarding too many results of modest evidence for replication on the other. The 56 clusters contained 471 results (84% of total), had a mean size of 20.3 Mb (±11.7 Mb), and covered 39.7% of the autosomes (Tables 1and 2, Fig. 1, ESM Figs 1 and 2). The clusters contained a mixture of type 2 diabetes and QT results, except for three small clusters (QT only). There were 89 (16%) results not assigned to a cluster, with type 2 diabetes and QT equally represented (17% and 15%, respectively). The mean LOD score was not significantly different between these orphaned results and clustered results (ANOVA and Wilcoxon tests). All clusters contained studies from several different racial groups (Tables 1 and 2).

The clusters were given a quality rank (Tables 1 and 2), and according to this, the top three clusters were located on chromosome 6q (very dense cluster), 1q (larger number of results and high LOD scores) and 18p (combination of properties). Linkage clusters were compared with results from recently published large GWA studies [8789] (Table 3, Fig. 1, ESM Figs 1 and 2). There was no close relationship between the GWA and linkage results. Although 15 out of 20 GWA results were in a cluster or within 5 Mb of the cluster edge (56.1% of autosomal genome), this relationship was not significant (p = 0.067). Only five GWA results fell within 4 Mb of the mean of a cluster (15.6% of autosomal genome), and this is also not significant (p = 0.19). The best linkage replication clusters had no associated GWA result.

Table 3 GWA study results for type 2 diabetes

Discussion

The search for genetic determinants of type 2 diabetes is yet to provide the anticipated insights into the cause of the disease. An abundance of results has not provided a consistent picture or satisfactory set of causative alterations. The current compilation of linkage results for type 2 diabetes and its precursors suggests locations for type 2 diabetes susceptibility genes based on clusters of replicated results. It suggests that there are major genes for type 2 diabetes on 6q, 1q, 18p, 2q, 20q, 17pq, 8p, 19q and 9q, and possibly elsewhere, that are yet to be identified. Because of the limitations of available genetic methods and the likely complexity of the underlying genetic architecture of type 2 diabetes, these genetic locations are currently only broadly defined but do provide promise of further major gene identification.

In this study, we combined the results of type 2 diabetes as a discrete trait and the results of QTs. This is justified under the assumption that heritable precursors of type 2 diabetes will share the same genetic determinants as type 2 diabetes itself. Diabetes-related QTs are heritable in multiple ethnic groups (for examples, see [9496]), and many also predict the subsequent development of diabetes (for examples, see [9799]). In addition, genetic correlation has been demonstrated between type 2 diabetes and some QTs [100] and between various QTs [14, 37, 96, 100102]. On the other hand, by providing twice the reservoir of data to analyse, the combination of evidence from both type 2 diabetes and precursor QTs should improve the likelihood of identifying type 2 diabetes gene locations. Almost all replication clusters contained both types of results, supporting this approach (Tables 1 and 2).

A recent meta-analysis [103] evaluated 23 type 2 diabetes linkage (but not QT) studies with the aim of determining gene locations. The genome was divided into bins based on genetic distance, and LOD scores in each bin were evaluated using three different scoring schemes. The study identified 24 linkage loci, with three of these loci roughly consistent across scoring schemes. These three loci corresponded to one of our clusters, chromosome 4 ∼176 cM/176 Mb, chromosome 6 ∼125–150 cM/125–150 Mb, and chromosome 10 ∼138 cM/119 Mb. The three other results given special comment by the authors do not correspond to a cluster in our results. Our study is more qualitative than that of Guan et al. [103], but does allow for a visual evaluation of the raw data, is not constrained by selection of bin size or location (we used cluster analysis instead), and the inclusion of QT doubles the number of results that can inform the gene location evaluation.

The individual linkage results show a wide scatter of locations (Fig. 1, ESM Figs 1 and 2). There are several potential explanations for this. Firstly, the data may merely be randomly distributed without any underlying genetic loci (see goodness-of-fit testing results). Alternatively, the scatter could reflect the limitations inherent in location estimates from linkage studies and the density of markers. Finally, the scatter could reflect the presence of multiple co-located independent genes.

We expect that both the linkage location estimate and the LOD score will be subject to error. This location error will determine the likely size of a cluster of results around a locus that is creating linkage signals. Genetic loci are unlinked when the recombination rate at meiosis between two loci reaches the maximum 50%, a distance of approximately 100 cM, about 100 Mb [90, 104]. Accurately defined linkage signals this far apart are not linked, suggesting an upper limit for replication cluster sizes. The multipoint linkage peaks observed in genetic analyses are typically broad, with the width of a peak easily reaching 40 cM or more (see also [105]). This indicates the uncertainty in the location estimate for the underlying gene. Confidence intervals for the location of the susceptibility gene have been examined by modelling for affected sib-pairs [106, 107] and for family studies [108110]. These intervals are often surprisingly large, i.e. tens of centimorgans. For a λ s of 1.24 (i.e. tenfold larger than that for TCF7L2 [Table 3]) and 400 sib-pairs, modelling suggested the standard deviation of the location estimate to be 13.11 cM, i.e. a 95% confidence interval of 51 cM (1.96 × 2 × 13.11) [106]. Indeed, the chance of finding the susceptibility gene under the region of maximum allele sharing is quite small [111]. While the use of a 1 LOD distance on either side of the linkage peak might provide tight confidence limits, such limits can be deceptively narrow and will often completely miss the true location [107]. Hence, we should reasonably expect a scatter of results around a susceptibility locus, with a larger scatter observed when the gene has a small effect or the study population is small [108110, 112]. Cluster analysis tries to take this scatter into account. Studies with larger samples might serve to better define the susceptibility gene locations by genuinely narrowing the confidence intervals.

Results with lower LOD scores are included here to incorporate as much corroborative evidence as possible. Including lower LOD scores may, however, increase the scatter of the data and increase the number of false-positive results. The orphaned (non-clustered) results did not have lower LOD scores, suggesting that they are not false-positives and instead reflect random error in the location estimates. Because of the large number of tests typically performed in a genome-wide scan, the cutoff values for genome-wide significance for linkage are stricter than the point-wise significance limits. An LOD of ≥3.6 (theoretical) [91] or ≥3.0 (modelling) [93] suggests genome-wide significance (p < 0.05). Linkage results were centred around LOD scores of ≥3.0 in 40 (71%) of the clusters, and around LOD scores of ≥3.6 in 27 (48%) of the clusters. Inspection of the results can help the reader evaluate the linkage signals in these locations (Tables 1 and 2, Fig. 1, ESM Figs 1 and 2). When defining the correct position of the locus underlying a cluster, for positional cloning or bioinformatics purposes, it may be beneficial to weight each result.

In the current analysis, we excluded results within 30 Mb in either direction of the selected result if they were derived from the same study population and phenotype, because it was necessary to set limits to avoid spurious replication. This may have excluded a genuine second linkage signal, lying nearby, although it is doubtful that signals in such close proximity can be adequately resolved in most studies. However, it is possible that the wide clusters represent multiple co-located genes, as seen in prostate cancer (8q24) [113] and several other diseases. For the moment, we invoke Occam’s razor in assuming that, generally, a single cluster represents a single underlying gene, with the linkage signal picked up over a substantial length of the genome. The scatter also indicates that positional cloning efforts based on the location estimates from a single study alone may not correctly target the underlying gene.

The mean number of linkage results in a cluster was 8.4, and it is clear that the linkage signal for each putative underlying gene is not identified by every study (most, but not all, studies were genome-wide scans). The absence of a linkage signal could reflect the different population structures in the studies, the genetic heterogeneity of type 2 diabetes or the lack of power in individual studies.

If only a subset of type 2 diabetes susceptibility genes was required for the disease in any individual and the frequencies of these susceptibility genes were different in each population, linkage results would be variable. This might easily arise if hyperglycaemia was a collection of subtly different phenotypes, each resulting from different subsets of underlying genes. Heterogeneity for diabetes as a broad phenotype is already apparent in the distinct features of type 1 diabetes, type 2 diabetes and MODY/monogenic diabetes [114]. The non-monogenic form of type 2 diabetes is likely to feature further levels of heterogeneity. Phenotypic heterogeneity may be largely independent of the ethnic background however, since there was a mixture of racial groups in all replication clusters (Tables 1 and 2). Even though association studies [88, 115] suggest that there will be some differences in the frequency of individual type 2 diabetes genes between ethnic backgrounds, many type 2 diabetes genes may be shared between individuals of different continents of origin.

Studies with an insufficient number of participants may also fail to detect a linkage signal for locations where the genetic effect is weak. Given that >2,500 families may be needed to detect loci conferring a genotype risk ratio of less than 2 ([116] compare with Table 3), it is expected that many studies will not be able to identify the location of some genes. The lack of support for a particular location in some studies does not discount the potential importance of the existing data, but does emphasise the importance of replication and the need for larger studies.

Fifteen GWA results were in or near a cluster. Given the size of clusters and the amount of the genome they cover, we were unable to demonstrate that this was a significant correlation. The GWA type 2 diabetes loci may be the source of the adjacent linkage signals in some cases, but demonstrating this would require a specific re-analysis of the linkage data. Our nine best replication clusters are not associated with a GWA result, and these clusters quite likely reflect genes with even bigger effects than do the current GWA candidates. This is because the 20 loci identified by GWA studies have small λ s values (Table 3), with a λ s value in combination of only about 1.08–1.10. The expected λ s value for type 2 diabetes appears to be somewhere between 1.2 and 6 [96, 117120]. Therefore, we must conclude that much of the genetic cause of type 2 diabetes remains unidentified and the GWA results should be seen as complementing, not replacing, the linkage analyses.

While MODY genes are not considered responsible for typical type 2 diabetes, five of the nine known MODY (http://www.ncbi.nlm.nih.gov/omim/, accessed December 2008) genes fall into the following replication clusters: MODY1/HNF4A (chromosome 20/42.4 Mb); MODY3/HNF1A (chromosome 12/119.9 Mb); MODY7/KLF11 (chromosome 2/10.1 Mb); MODY8/CEL (chromosome 9/134.9 Mb); and MODY9/PAX4 (chromosome 7/127.0 Mb). Since some loci may contain both weak and strong alleles [121], it remains possible that variants in these genes could be influencing typical type 2 diabetes[122].

In addition to phenotypic heterogeneity, we should expect to see allelic heterogeneity in type 2 diabetes. There are ten million SNPs in the human genome, and in the two individuals whose whole genomes have been sequenced, over 3 million SNP differences were detected between each individual and the reference human genome sequence [123]. The majority of these SNPs will be rare in the whole population [124, 125]. Individual genes can harbour hundreds of genetic variants. As an illustration, the MODY3 gene (HNF1A, the commonest monogenic cause of diabetes) has 200–300 known mutations (missense, nonsense, splicing defects, insertions and deletions) [126]. Therefore, we should expect the genetic architecture of typical type 2 diabetes to be much more complex still, and this will make it harder to identify genetic causes of the disease.

Both association studies and linkage analysis are looking for new causative genes using different principles [92, 121, 127129]. Association analysis is considered more powerful than linkage analysis [92, 116], though the difference may have been inflated [128]. In linkage analysis, gene transmission is assessed one family at a time. If every family studied is affected by a different rare disease allele elsewhere in the same gene (allelic heterogeneity), a linkage signal should still be apparent [92]. Disease mutations need not be identical, only close enough to a marker that recombination over one or two generations rarely separates them in any particular family. A GWA study, however, would need to either type the same rare SNP directly in each individual, which is unlikely with current markers, or to indirectly type one or more rare SNPs by typing a representative or ‘tag’ SNP [125] on the same haplotype block. The disease allele and the typed tag SNP also need to occur at about the same frequency for a study to identify the relationship between them [127]. The gene typing arrays currently available are largely based on the HapMap results [124, 125], which specifically targeted common SNPs. Investigators made the assumption that common diseases would be due to common variants [128]. It is not surprising therefore that the current set of GWA results are associated with SNPs with high-risk allele frequencies. As a corollary, the future discovery of new rare disease variants may be difficult using this methodology. Current GWA studies may have also missed copy number variants [130]. Therefore, GWA studies may be identifying genes that linkage studies have been insufficiently powered to detect, while linkage studies may be detecting genes with multiple rare variants within the same gene, variants rare enough individually to go undetected by current GWA tools.

Linkage studies have helped in the discovery of monogenic forms of type 2 diabetes (MODY), and most of these have probably now been identified. GWA studies have identified a number of common genes with low penetrance, but current methods may soon reach a limit here, too. This leaves the moderately rare genetic variants with modest penetrance left to be identified (see box 7 in [129]). Since one gene could harbour multiple, different, rare, modestly penetrant variants, we suggest that the most likely interpretation of strong linkage clusters with no associated GWA result is the presence of variants of this sort. The discovery of genes for type 2 diabetes now appears to require an increased catalogue of rare variants or large-scale re-sequencing of well-defined diabetes linkage ‘hotspots’ [131].