Transethnic insight into the genetics of glycaemic traits: fine-mapping results from the Population Architecture using Genomics and Epidemiology (PAGE) consortium

Aims/hypothesis Elevated levels of fasting glucose and fasting insulin in non-diabetic individuals are markers of dysregulation of glucose metabolism and are strong risk factors for type 2 diabetes. Genome-wide association studies have discovered over 50 SNPs associated with these traits. Most of these loci were discovered in European populations and have not been tested in a well-powered multi-ethnic study. We hypothesised that a large, ancestrally diverse, fine-mapping genetic study of glycaemic traits would identify novel and population-specific associations that were previously undetectable by European-centric studies. Methods A multiethnic study of up to 26,760 unrelated individuals without diabetes, of predominantly Hispanic/Latino and African ancestries, were genotyped using the Metabochip. Transethnic meta-analysis of racial/ethnic-specific linear regression analyses were performed for fasting glucose and fasting insulin. We attempted to replicate 39 fasting glucose and 17 fasting insulin loci. Genetic fine-mapping was performed through sequential conditional analyses in 15 regions that included both the initially reported SNP association(s) and denser coverage of SNP markers. In addition, Metabochip-wide analyses were performed to discover novel fasting glucose and fasting insulin loci. The most significant SNP associations were further examined using bioinformatic functional annotation. Results Previously reported SNP associations were significantly replicated (p ≤ 0.05) in 31/39 fasting glucose loci and 14/17 fasting insulin loci. Eleven glycaemic trait loci were refined to a smaller list of potentially causal variants through transethnic meta-analysis. Stepwise conditional analysis identified two loci with independent secondary signals (G6PC2-rs477224 and GCK-rs2908290), which had not previously been reported. Population-specific conditional analyses identified an independent signal in G6PC2 tagged by the rare variant rs77719485 in African ancestry. Further Metabochip-wide analysis uncovered one novel fasting insulin locus at SLC17A2-rs75862513. Conclusions/interpretation These findings suggest that while glycaemic trait loci often have generalisable effects across the studied populations, transethnic genetic studies help to prioritise likely functional SNPs, identify novel associations that may be population-specific and in turn have the potential to influence screening efforts or therapeutic discoveries. Data availability The summary statistics from each of the ancestry-specific and transethnic (combined ancestry) results can be found under the PAGE study on dbGaP here: https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000356.v1.p1 Electronic supplementary material The online version of this article (doi:10.1007/s00125-017-4405-1) contains peer-reviewed but unedited supplementary material, which is available to authorised users.

The Hispanic Community Health Study / Study of Latinos (HCHS/SOL) is a population-based cohort study of 16,415 self-identified Hispanic/Latino individuals aged 18-74 years randomly selected from households in four U.S. field centers (Chicago, IL; Miami, FL; Bronx, NY; San Diego, CA) [4]. The cohort includes participants who self-identified as having Hispanic/Latino background, the largest groups being Central American (n = 1,730), Cuban (n = 2,348), Dominican (n = 1,460), Mexican (n = 6,471), Puerto-Rican (n = 2,728), and South American (n = 1,068). The baseline examination during 2008 and 2011 included a clinical visit with comprehensive biological, behavioral, and sociodemographic assessments. Smoking status was measured by self-report and categorised into three groups: current, former, and never smokers. Two questions were used: "Have you ever smoked at least 100 cigarettes in your entire life?" and "Do you now smoke daily, some days or not at all?" If participants had smoked at least 100 cigarettes in their entire life and reported smoking daily or some days, then they were considered current smokers; if participants had smoked at least 100 cigarettes in their entire life and did not report smoking daily or some days, then they were considered former smokers; and if participants had not smoked at least 100 cigarettes in their entire life, they were considered never smokers.
The Women's Health Initiative (WHI) is a prospective study investigating post-menopausal women's health [5]. A total of 161,808 women aged 50-79 years old were recruited from 40 U.S. clinical centers between 1993 and 1998. WHI consists of two parts: randomised clinical trials of hormone therapy, dietary modification, and calcium/Vitamin D supplementation, and an observational cohort study. Socio-demographic characteristics, lifestyle factors (e.g. smoking), medical history, medication use and physical measures of height and weight were collected at the baseline visit.
Data on lifetime active and passive smoking were collected. Women were initially classified by active smoking status into current, former or never smokers (participants that had not smoked 100 cigarettes in their life). A subset of African American participants was genotyped through the WHI SNP Health Association Resource (SHARe). All African American, Hispanic/Latino, Asian, and Native American/American Indian individuals who provided informed consent to submit their genotype data to dbGaP were either directly genotyped on the Metabochip or had genome-wide data on the Affymetrix 6.0 array available to impute Metabochip SNPs.
The Multiethnic Cohort (MEC) is a population-based prospective cohort study of over 215,000 men and women in Hawaii and California aged 45-75 at baseline (1993)(1994)(1995)(1996) and primarily of five ethnic/racial groups: African Americans, Native Hawaiians, Whites, Latinos, and Japanese Americans [6]. MEC was funded by the National Cancer Institute in 1993 to examine lifestyle risk factors and genetic susceptibility to cancer. All eligible cohort members completed baseline and follow-up questionnaires. Each participant completed a mailed, epidemiologic self-administered questionnaire regarding demographic, dietary, and lifestyle traits. This questionnaire included history of daily cigarette smoking during the past two weeks, smoking duration, and a record of current medications. Subjects were selected for Metabochip genotyping based on availability of biomarker data or from a pool of controls for a study of type 2 diabetes. IRB Statement All participants included in these analyses have given consent for genetic studies and data sharing.

Continuous BMI measurement
In all CALiCo studies, and WHI, BMI was calculated from height and weight measured at time of study enrolment in a clinic setting. In WHI only, measurements collected 1 or 3 years after enrolment were substituted for 140 participants missing enrolment height and/or weight. In MEC and BioMe, self-reported height and weight were used to calculate baseline BMI.

Smoking status measurement Smoking status was harmonised across studies as current versus
former/never. See PAGE cohort descriptions above for study-specific details on ascertainment.
Racial/ethnic grouping In all studies, self-reported racial/ethnic group was collected via epidemiological questionnaires at baseline.
Glucose/insulin measurement Analyses were performed for fasting glucose (mmol/l) and natural log transformed fasting insulin (pmol/l). Individuals were excluded from the analysis if they were on diabetes treatment (oral or insulin), had a fasting plasma glucose equal to or greater than 7 mmol/l (126mg/dl), or fasting status for less than 8 hours. Individuals with BMI<16.5 kg/m 2 and BMI>70 kg/m 2 were also excluded with the assumption that these extremes could be attributable to data coding errors, an underlying illness, or possibly to a familial syndrome. In BioMe , oral type 2 diabetes medications used for exclusion included Acetohexamide, Tolazamide, Chlorpropamide, Glipizide, Glyburide, Glimepiride, Repaglinide, Nateglinide, Metformin, Rosiglitazone, Pioglitazone, Troglitazone, Acarbose, Miglitol, Sitagliptin, and Exenatide.

Genotyping and quality control
DNA Extraction In the MEC and WHI, DNA was purified from buffy coat samples. A subset of MEC DNA samples were whole-genome amplified by Molecular Staging Inc. following their standard protocol. For CALiCo and BioMe studies, DNA was extracted from blood samples drawn at baseline.
Genotyping Study specific details are summarised in ESM Table 1 Genotypes were called separately for each study using GenomeStudio with the GenCall 2.0 algorithm. Samples were called using study-specific cluster definitions (based on samples with call rate >95%, ARIC, CARDIA, MEC, WHI, BioMe) and kept in the analysis if call rate was >95%. We excluded SNPs with GenTrain score <0.6 (ARIC, CARDIA, MEC, WHI), Proper info score ≥ 0.4 (BioMe), cluster separation score <0.4, call rate <0.95, and Hardy-Weinberg Equilibrium p value < 1×10 -6 . We utilised the common 90 YRI samples and excluded any SNP that had more than 1 Mendelian error (in 30 YRI trios), any SNP that had more than two replication errors with discordant calls when comparisons were made across studies in 90 YRI samples, and any SNP that had more than three discordant calls for 90 YRI genotyped in PAGE versus the HapMap database.
SNPs were excluded from the meta-analyses if they were present in less than three studies.

Replication and Fine-mapping of known glycaemic trait loci
For replication of index SNPs from GWAS, we used a nominal significance level (p value = 0.05). For transethnic signal fine-mapping we used the locus-specific p value (ranging from α = 1.41 × 10 -5 to α = 4.1 × 10 -4 ), which is 0.05 divided by the number of variants passing quality control at each locus.
To fine-map previously identified loci, we investigated the patterns of association at 15 known fasting glucose and fasting insulin loci using meta-analysis results from 13,613, and 2,406 SNPs genotyped on the Metabochip in up to 26,760, and 22,674 individuals for fasting glucose, and fasting insulin, respectively. At loci that exhibited evidence of regional significance (0.05/# of SNPs in region), we performed a series of sequential conditional analyses adding the most significant lead SNP into the regression model as an additional covariate and testing all remaining regional SNPs not already in the model as a covariate for association. Sequential conditional analyses were performed adding in lead SNPs to the model until the strongest SNP association showed a conditional p value > than the regional significance level.

Strategy for selecting novel associations
The Metabochip selected genotyping content for type 2 diabetes, 2 hour glucose, glycated hemoglobin, fasting glucose, fasting insulin, myocardial infarction and coronary artery disease, high-density lipoprotein/low-density lipoprotein /triglycerides/total cholesterol, BMI, waist to hip ratio, body fat percentage, height, waist circumference, diastolic/systolic blood pressure, QT interval, mean platelet volume, platelet count, and white blood cell count. Given that Metabochip SNPs were included specifically due to prior evidence for association with these traits, the Metabochip-wide analyses were defined as testing for pleiotropy with any of these cardio, metabolic traits. Novel trait associations were subsequently investigated in the GWAS catalog to identify which trait(s) the locus was previously reported for. Metabochip-wide results were considered statistically significant if they reached a threshold of p value ≤ 2.5 x 10 -7 (0.05/196975), and were not in LD (r 2 < 0.2) or within 500 kb of a reported index SNP. Secondary independent or population-specific associations in the 15 fine-mapping regions are reported as being significant if they met region specific significance of 0.05/# SNPs in the locus.

Statistical analysis
All analyses were adjusted for age, sex, smoking status (current versus former/never), BMI, and ancestry principal components (PCs) in each study. Some studies also adjusted for center when applicable.
Each study performed race/ethnic specific analyses to test the association between continuous fasting glucose or natural log-transformed fasting insulin levels with genotypes or imputed dosages assuming an additive mode of inheritance. For studies of unrelated individuals, we applied multiple linear regression including age, sex, center site (as applicable), smoking status (current versus former/never), continuous BMI, and ancestry PCs (number varied by study) as model covariates.
Like previous studies, primary analyses adjusted for BMI because it is a major risk factor for type 2 diabetes and is correlated with glycaemic traits. For sensitivity analysis, all models were also run without BMI as a covariate. Adjustment for smoking was decided a priori given the racial/ethnic differences in smoking patterns in the US and incident type 2 diabetes, as well as the association of cotinine with glycaemic related traits in non-diabetics [10]. HCHS/SOL has approximately 2000 related individuals, and a complex sampling design was used for recruitment. Therefore, in this study we employed the W-PS method (http://dlin.web.unc.edu/software/SUGEN/) by Lin et al, which is a weighted version of generalised estimation equations, to account for unequal inclusion probabilities and family relationships [11].
We combined SNP effect estimates and their standard errors across studies for H/L, AA, NA/AI and ASN by inverse-variance weighted fixed effects meta-analysis using METAL [12]. Results from these ethnic/race-specific meta-analyses are presented. Quantile-quantile plots for the Metabochipwide analysis of fasting glucose and fasting insulin are shown in ESM Figure 4. Consistency between study/race-ethnicity effect size was assessed using the Q test (Chi-squared p value) and the I 2 metric, where low I 2 suggests little difference between-study/race-ethnicity variability. A two-stage conventional fixed-effects analysis approach was chosen to enable comparison of effects between races/ethnicities and because this strategy has been shown to provide a well-controlled type 1 error [13,14]. Furthermore, I 2 metrics indicated little heterogeneity between race/ethnicities for the significant trans-ethnic results. Results for all analyses were reported as betas and standard errors (SE).

Functional Annotation
We interrogated each of the fine-mapped loci to determine if the identified non-coding variants were positioned within regulatory regions such as enhancers, promoters, insulators and/or silencers, which can potentially modulate transcript levels and thereby explain the underlying biology of the glucose/insulin association. To identify variants that overlapped putative regulatory elements we aligned a custom track with a list of correlated variants (r 2 ≥0.2 in 1000 Genome Project AMR and AFR) at each locus with several relevant tracks. Bedfiles were obtained using http://raggr.usc.edu/. Given that our traits of interest were glycaemic loci, we particularly utilised DNAseI hypersensitivity, histone modification and transcription factor occupancy data assayed in tissues and cell types derived from pancreatic, pancreatic islet, brain or other metabolically relevant tissues highlighted by expression patterns of strong candidate genes. By integrating the signal from several epigenomics tracks in the aforementioned tissues/cell types, a genomic element was annotated to be a putative enhancer if it showed enrichment for signal in DNASe  ESM Table 19. The epigenomics datasets were sourced from publicly available projects including ENCODE, Roadmap Epigenomics, and FANTOM5. Furthermore, to identify the motifs disrupted by putative risk alleles, we utilised Haploreg (v5) and the JASPAR motif database. To query GTEx eQTL data we utilised Haploreg (v5).
In addition to the use of epigenomic data, we also included summary functional scores from in silico prediction algorithms including RegulomeDB, the Combined Annotation Dependent Depletion   Table 7. In the PROX1 locus (a), rs10494973, overlaps an intronic pancreatic enhancer and a FOXA2 binding site, which is an essential activator of genes governing insulin secretion. In the G6PC2 locus (b), rs560887 is a known functional splicing variant and bioinformatic annotation supports this variant as the strongest functional candidate. 9 LD SNPs in the ADCY5 locus (c) overlap 4 pancreatic enhancers and were shown to be associated with ADCY5 and SEC22A expression in multiple tissues. In the DGKB locus (d), 7 SNPs in LD with the lead, rs62448618, overlap a strong intergenic pancreatic islet enhancer. In the 7p13 locus (e), 7 variants in LD with the lead are positioned in 4 pancreatic islet enhancer/promoter regions that were also an eQTL for GCK in thyroid tissue. In the GLIS3 locus (f), the lead SNP, rs10974438, is associated with expression of GLIS3 in brain tissue and is in LD with 3 SNPs positioned in two pancreatic islet enhancers. In the FADS3-11q12.2 region (g), the lead SNP, rs174547, was in LD with 6 variants spanning 3 enhancer regions. In the GCKR region (h), the lead SNP, rs1260326, is a missense variant in exon 15 of GCKR that is predicted by SPANR to disrupt splicing and is associated with expression of GCKR in brain tissue. Additionally, 13 variants in LD with the lead span across six predicted enhancers. In the IGF1 locus (i), rs10860845 tagged two putative functional variants positioned within two different pancreatic islet intronic enhancers. The lead SNP in the MTNR1B locus (j), rs10830963, falls in a brain and pancreatic islet/pancreas enhancers in the intronic region of MTNR1B, which also bind the transcription factors FOXA1 and FOXA2. In the SLC2A2 locus (k), SNPs in LD with the lead SNP, rs1604038 are associated with SLC2A2 and EIF5A2 expression in brain tissue, overlap 7 pancreatic islet/pancreas enhancers, and one LD SNP was a missense variant predicted by SPANR to misregulate splicing of SLC2A2.