Background

The genetic architecture of type 2 diabetes (T2D) appears to be composed of several genes, each of which has a modest impact on disease risk. Despite significant advances in our understanding of the genetic determinants of the monogenic forms of diabetes, the definitive identification of genes that increase risk of common T2D in the general population has been far more elusive. However, a string of recent genome-wide association studies (GWAS) has given promising clues to additional genes involved in common T2D risk.

GWAS offer an approach to gene discovery unbiased with regard to presumed functions or locations in the genome. The common method of control selection used for many GWAS is to form a single pool of potential controls consisting of subjects who were not cases by the end of the study period. However, this method has been shown by Greenland and Thomas [1] and Lubin and Gail [2] to lead to biased estimates of the rate ratio. This bias has been termed "case-exclusion bias". Moreover, differences in the origin of populations of cases and controls can arise if the two groups are recruited independently or have different inclusion criteria, and the presence of population stratification can lead to greater than nominal type I error rate.

Another method of control selection, termed "incidence density sampling", uses subjects who survived to the time of case occurrence to make a pool of potential controls for each case. The pool of potential controls may include subjects who later become cases and subjects who develop other diseases. This nested case-control design can be a very efficient approach to obtain unbiased estimates of relative risks associated with genetic variants.

In this study, we use the GWAS data from the Framingham Heart Study (FHS, Genetic Analysis Workshop 16 Problem 2) to compare the influence of control selection on the results for T2D.

Methods

FHS

The FHS is a community-based, multigenerational, longitudinal study of cardiovascular disease and its risk factors, including diabetes. The FHS began in 1948 to investigate the causes of heart disease. Men and women between the ages of 28 and 62 years were recruited and followed prospectively over time. Beginning in 1971, offspring of the Original Cohort were recruited as part of the Framingham Offspring Study. There are a total of 6752 subjects. There are 765 pedigrees with 2 to 301 genotyped subjects: 134 pedigrees with 2, 123 with 3, 98 with 4, 85 with 5, 177 with 6 to 10, 72 with 11 to 15, 30 with 16 to 20, and 46 with more than 20.

Genotyping

FHS GWAS data were generated on the Affymetrix 250 k Sty, 250 k Nsp, and the supplemental 50 k platforms. Single-nucleotide polymorphisms (SNPs) were selected for analyses were based on previously reported GWAS [3, 4]. The genes and SNPs used in this study to represent the 18 most significant T2D SNPs from these GWAS are: peroxisome proliferator-activated receptor gamma (PPARG; rs1801282); insulin-like growth factor two binding protein 2 (IGF2BP2; rs4402960); cyclin-dependent kinase 5, a regulatory subunit-associated protein1-like 1 (CDKAL1; rs7754840 and rs10946398); a variant found near cyclin-dependent kinase inhibitor 2A/2B (CDKN2A/2B; rs10811661 and rs564398); hematopoietically expressed homeobox (HHEX; rs5015480 (r2 = 1 with rs1111875); transcription factor-7-like 2 (TCF7L2; rs10885409 and rs7901695 (r2 = 0.8 with rs7903146)); potassium inwardly rectifying channel subfamily J member 11(KCNJ11; rs5215 (r2 = 0.89 with rs5219)); fat mass obesity-associated gene (FTO; rs9939609 and rs8050136); tetraspanin 8/leucine-rich repeat-containing G protein-coupled receptor 5 (TSPAN8/LGR5; rs7961581); cell division cycle 123 (CDC123; rs4747969 (r2 = 0.83 with rs12779790)); Wolfram syndrome 1 (WFS1; rs4689394 (r2 = 1 with rs10010131)); ADAM metallopeptidase with thrombospondin type 1 motif, 9 (ADAMTS9; rs4607103); thyroid adenoma associated (THADA; rs13431070 (r2 = 1 with rs7578597); and JAZF zinc finger 1 (JAZF1; rs864745).

Case-control definitions

Cases were defined as people with a diagnosis of type 2 diabetes (T2D) during follow-up of the FHS cohort. Cases were born during the first, the second, or the third generation of the FHS. The age at diagnosis for 231 unrelated male and female cases was 20 to 80 years old.

In our nested incidence density case-control approach, 10 individually matched controls were selected with replacement from members of the cohort who did not have a T2D diagnosis at the time when the case was identified. Age is a strong risk factor for T2D disease, and so controls were always selected among participants of the same age at enrollment as the cases ( ± 5 years). Controls were additionally matched on sex and body mass index (BMI) at enrollment ( ± 2 kg/m2). For every case, ten randomly chosen controls were selected by incidence density sampling. Cases and controls were not members of the same family. In our case-exclusion approach, controls were selected as members of the FHS who never received a T2D diagnosis during any of the recorded follow-up. We then adjusted for age, sex, and BMI matching criteria as in our nested case-control approach.

Statistical analyses

As a quality control measure, we tested for Hardy-Weinberg disequilibrium in controls using an exact test. All markers are in Hardy-Weinberg equilibrium in the observed FHS data and in all simulated samples. All individuals had complete data for sex, age, BMI, and diabetes except 15 controls in the incidence density (ID) sample and 28 in the case exclusion (CE) sample for whom BMI at enrollment was not available. All SNPs had no more than 10.4% missing data, which we judged to be acceptable.

Genetic associations with T2D (odds ratios, confidence intervals, and statistical tests) were estimated and tested using a conditional logistic regression under the additive model for the ID sampling approach and using logistic regression, adjusted for matching variables, in the case-exclusion approach. These analyses were carried out in SAS software using the PHREG procedure.

Simulations

Simulations were used to investigate control selection effects in a larger sample of individuals than that in the observed FHS data, and with SNPs having higher effect sizes. We simulated 11 sets of 100 replicates according to varying minor allele frequencies and generating hazard ratios. These simulations were used to estimate bias and power between the control sampling designs. A SAS program was used to simulate diabetes as a function of SNP genotype. We generated data sets of 10,000 individuals with SNP genotypes assigned probabilistically according to allele frequencies of 0.10, 0.30, or 0.50. We then assigned diabetes status and time of onset using an exponential model based on SNP hazard ratios from 1.3 to 3 (see Table 1). We selected five controls for each case according to the ID (risk set) sampling scheme and set a 5:1 control:case ratio for the CE sampling at end of follow up. We then estimated odds ratios (ORs), confidence intervals (CIs) and performed tests of association for each SNP. We repeated this 100 times to report average bias and estimated power for each SNP (defined as the proportion with statistically significant association (p < 0.05)). Bias ratios between ID and CE methods were estimated by the ratio "calculated OR per method/generating hazard ratio in simulations".

Table 1 OR and bias from simulated cohorts under different control sampling designs based on 100 replicates for each design

Results and discussion

In order to maximize precision, we chose a ratio of 10 controls per case for both sampling strategies in the FHS data. Because we did not have exact dates and BMI at onset of diabetes, we used the age at enrollment, i.e., the age at Visit 1, and BMI at enrollment to match cases and controls. To accommodate the effect of random ID control selection, we repeated random sampling and conditional logistic regression 10 times. The distribution of OR estimates obtained in each analysis showed wide variability across replicates, with a coefficient of variation from 14% to 20% per SNP among the ID sampling replicates. We report the average OR from these 10 replicates in Table 2, along with confidence limits based on the method of Rubin [5] that takes within-replicate and across-replicate variation into account. We also show the average p-value per SNP to indicate whether statistical significance was achieved in any replicate.

Table 2 OR for T2D SNPs from previous meta-analysis and in the FHS data under different control sampling designs

We failed to find any significant association with any of the 18 previously reported SNPs using ID sampling or CE sampling in FHS (Table 2). We included 18 SNPs with convincing association evidence; however, two important SNPs were missing in our genotyping data (rs757210 in TCF2 and rs13266634 in SLC30A8), and could not be considered in the FHS. One drawback of our study is the limited number of T2D cases, despite the very large database. With only 189 incident cases and 231 total cases, our study had low power to detect genetic association between SNPs and T2D, especially considering the expected magnitudes of association based on previous reports. Owing to the large CIs of the ORs in our two scenarios, the results would have been less conclusive than those of the previous studies conducted in larger sample (>1000 cases). An alternative explanation for the low power is that we considered each SNP separately rather than a combination of variants acting additively on risk, which may have a large effect.

Because the empirical data are hard to interpret due to the small number of cases and small effect sizes, we further addressed differences between control sampling methods via simulation with higher sample sizes and effect sizes. For each simulating scenario, we simulated 100 cohort data sets, each with approximately 500 cases, as described in the Methods section (Table 1). These simulations show that when more precision can be obtained and higher effect sizes are considered, ID sampling does indeed have less bias, while CE methods have a slight upward bias, leading to the appearance of increase power. We suggest that this increased power should be considered with caution given the bias, and recommend ID sampling as the appropriate strategy for case-control analyses nested in cohorts.