Background

Methylation plays a major role in gene regulation through epigenetic modifications at specific cytosine-phosphate-guanine (CpG) residues within the regulatory regions of genes and, consequently, may influence the transcriptional activity [1]. In brief, methylation occurs when a methyl group is transferred to the DNA via a family of DNA methyltransferases. The majority of DNA methylation occurs oncytosines, which immediately precedea guanine nucleotide (ie, CpG site). These CpG sites occur frequently throughout the genome and have been linked to both single-nucleotide polymorphisms (SNPs) and epigenetic changes [2].In particular, DNA methylation may lead to different influences on gene activities depending on the surrounding genetic sequence [3]. Because SNPs near the CpG site may alter methylation levels, the statistical interaction between SNPs and CpG sites may explain varying gene expression across individuals. Prior research shows that DNA methylation in the interleukin-4 receptor is associated with asthma, but this association is further explained by the presence or absence of a nearby SNP [4]. A study focusing on obesity found the interaction between CpG sites in an enhancer region interacts with CpG creating SNP sites in an obesity-risk haplotype, which helps explain obesity/Type 2 Diabetes [5].

As part of GAW20, we were provided access to a data set of methylation, SNPs, and triglyceride levels over 2 time periods, along with numerous related covariates. In particular, the study measured triglyceride levels before and after pharmaceutical intervention. Given the well-known relationship between triglycerides and many different cardiometabolic diseases, including cardiovascular disease [6], we chose to look for evidence of methylation at CpG sites that potentially modulate the impact of nearby SNPs on changes in triglyceride levels.

Methods

Sample population and variables

The sample consisted of 670 individuals from a pedigree sample provided as part of GAW20 for whom all analyzed variables were available. We considered 6 covariates (age, observation center, smoking status, mass spectrometry DX client [MSDX] International Diabetes Federation [IDF] score, fasting time at baseline, and high-density lipoprotein [HDL] at baseline) the majority of which were significantly associated with baseline triglyceride (TG) in this sample. The primary response variable of interest was TG level at baseline (visit 1 or 2). For variables with up to 2 measurements at baseline (HDL [baseline], TG [baseline]), we used the average value if both measurements were available, or the only available measurement if only one was available.

Models

The modeling process was done in 2 stages. The first stage model resulted in a single residual TG value for each person, while the second stage resulted in approximately 700,000 models (one for each SNP that passed standard genome-wide association study [GWAS] quality control [QC] criteria: Hardy-Weinberg equilibrium p value> 1 × 10− 6, minor allele frequency > 1%, SNP missing data rate < 5%).

In the first stage, we used the lmekin function from the coxme package in R [7] to predict the change in log-transformed TG levels [y = ln (baseline)]. In cases where 2 separate TG measurements were available for the baseline, we natural-log (ln)-transformed the data before averaging. Baseline ln-transformed TG levels was predicted by the 6 covariates listed earlier and accounted for the familial relationships in the model through the use of the kinship matrix. We then saved the resulting “residual” value (\( {r}_i={\widehat{y}}_i-{y}_i \)) for each of the i = 1,…, 670 individuals in our analysis.

The second stage predicted the residuals (ris) from stage 1 based on the number of minor alleles (SNPj = 0, 1, 2) and methylation scores (CPGj ∈ [0, 1]) with a separate model for each SNPj, CPGj pair using the lm function in R [7]. In particular, the second stage model for SNPj, CPGj pair was:

$$ {r}_j={\beta}_{S_j}{SNP}_j+{\beta}_{C_j}{CPG}_j+{\beta}_{S_j{C}_j}{SNP}_j{CPG}_j $$
(1)

where \( {\beta}_{S_j{C}_j} \) is the estimate of interaction effect between SNPj and CPGj. SNPj, CPGj pairs were made by assigning each SNP passing QC to its nearest CpG site, resulting in approximately 700,000 pairs, with some CpG sites assigned to multiple SNPs.

Statistical analysis

Statistical significance of the interaction term in Eq. 1 was assessed using an F test, essentially testing whether the statistical interaction provided significantly more evidence of association with changes in TG levels versus a model with only main-effects terms. Versions of Eq. 1 without the interaction term were also run. We started by using a generally accepted, but stricter and conservative, genome-wide significance level of 5 × 10− 8. We followed up this analysis by using a more liberal and exploratory significance level of 1 × 10− 4 in our genome-wide interaction analysis.

We followed this genome-wide analysis with a candidate gene study focusing on 18 gene regions (containing 423 unique SNP-CpG sites) that have been shown to be associated with TGs in previous genome-wide association studies via searches at http://www.ebi.ac.uk/gwas. Throughout the candidate gene analysis, we used a significance level of 0.05. As part of the candidate gene analysis we also collapsed all the CpG sites within each gene region (50 kb on either side of the gene) by using 5 different methods (mean, minimum, maximum, median, and sum-squares of the CpG values as the CPG value in the model) to evaluate the potential impact of different ways of summarizing methylation evidence for each SNP. For the SNPs that demonstrated a significant interaction for more than one of the collapsing methods used, we then looked at the interactions between all CpG sites within the region and those SNPs.

Results

Genome-wide approach

No interaction term p values were significant when using the conservative 5 × 10−8threshold. However, 58 SNP-CpG pairs showed significant interactions using the more liberal 1 × 10−4significance level. Table 1 summarizes 25 loci that include regions of SNPs that are colocalized and within genes (total of 44 interactions). The median p value of the interaction term across all sites was 0.504 and a lambda value of 1.02, showing no inflation of test statistics.

Table 1 Summary of 25 loci with significant interactions between SNP and CpG site at the 1 × 10−4significance levela

Candidate gene approach

In our data, there are 18 genes (containing 423 SNPs for which data was available) previously shown to be associated with TG levels. Table 2 summarizes the results of fitting Eq. 1 with an interaction term, as well as a version of Eq. 1 without the interaction term.

Table 2 Summary of 18 genes with previous evidence of association with triglyceride levels

Thirteen of the 18 candidate genes show at least modest (p < 0.05) evidence of statistical interaction between nearby methylation values and SNPs within the gene. The most significant SNP is in FADS3 (rs1675102) and has a minor allele frequency of 0.28. The interaction is such that additional copies of the minor allele lead to a decreased impact of methylation on changes in TG levels.

Table 3 shows the results of collapsing all the CpG sites within each gene region through the minimum method, which uses the minimum CpG value of all CpG sites within 50 kb of the gene. Compared to the other 4 methods, the minimum method resulted in more significant interactions (44) than did the other 4 collapsing methods, which on average only have 23 significant interactions (detailed results not shown).

Table 3 Summary of CpG results after collapsing using the minimum method

We identified 176 unique SNPs in significant interactions for more than 1 of the 5 different CpG collapsing methods as found in Table 4. In total, there are 176 unique significant SNP × CpG interactions. GALNT2 had the largest number of significant results with 69 interactions, where 1 of the 69 interactions is the most significant with a p value of 0.000142. The SNP in this interaction (rs6677241) has a minor allele frequency of 0.026. The interaction results in an increased impact of methylation on TG levels for every additional allele.

Table 4 Summary of 176a interaction pairs

Discussion

Although no significant SNP–CPG interactions were identified when using strict, genome-wide significance thresholds (5 × 10− 8), use of a more exploratory approach identified many genes previously shown to be associated with cardiometabolic traits (1 × 10− 4). A candidate gene approach, using a significance level of 0.05, identified loci in 13 genes with modest evidence for SNP-CpG interactions on baseline TG levels. Furthermore, by using the collapsing methods, we were able to identify potentially interesting SNPs for additional exploration. Using only these SNPs, our examination of all CpG sites within each gene region resulted in 176 significant unique SNP-CpG pairs. In every case, the SNP-CpGp value was smaller than both the SNP and CpGp values from the noninteraction model. This suggests that using SNP-CpG pairs may identify SNPs that would not be identified by traditional GWAS techniques. The gene GALNT2, had the most significant interactions with 69. SNPs in GALNT2 were previously identified as associated with TG levels, high- and low-density lipoprotein cholesterol [8]. One study shows that promoter methylation of GALNT2 is associated with a higher risk of coronary heart disease [9].

There are some limitations to our analysis. First, to manage computational resources, we began by predicting baseline TG levels by kinship and covariates, yielding residuals for each individual, which we used for assessing impact of methylation and genetic variation. Other alternatives to this methodology may exist. We used an exploratory significance threshold for the genome-wide analysis, relative to the vast majority of GWAS-type analyses published today. Although this can lead to more false-positive results, we did find a number of “subthreshold” loci of potential interest suggesting the need for studies with larger sample sizes and more sensitive statistical methods to draw out these loci of interest. The minimum method of summarizing methylation in a region nearby to a gene showed promise, although further work is needed to more fully evaluate the many options. Regardless, leveraging prior biological evidence (eg, via the candidate gene approach) may be of potential effect when testing for SNP–CPG interactions.

Conclusions

Even with “subthreshold” significance, our results go a long way toward showing the need for statistical models that leverage prior biological information. Our study shows that a mediated effect of SNPs on methylation is a possible explanation for changes in TG levels. With this knowledge, more studies with greater sample sizes can be performed as well as wet lab experimentation to confirm the relationship. As we learn more about the effect an individual’s genotype has on their health, there is greater opportunity for personalized medicine to be an effective treatment strategy.