Background

Genetic variants, such as single-nucleotide polymorphisms (SNPs), have been found to influence risk for human diseases. Recent studies show that epigenetics affect SNPs in genes and subsequently influence the gene function and disease trait [1]. Epigenetic mechanisms consist of DNA methylation, histone modifications, and noncoding RNAs, all of which represent the patterns of chemical and structural modifications to DNA [2]. There are an increasing number of laboratory experiments that provide evidence of DNA methylation and gene expression regulation [3,4,5]. Only a few studies, however, have evaluated the genome–epigenome interactions through statistical means, which may potentially provide novel findings for the joint effects of SNPs and cytosine-phosphate-guanine (CpG) sites [6,7,8,9]. The search for SNP-CpG epistasis is usually conducted through multistage or integrated analyses, where the genome and methylation data are first analyzed separately and the results then combined [10, 11]. Some studies apply existing interaction-effect methods, such as regressions, to perform the joint analysis of methylation and genome data. The advantages of the W-test method are data set adaptive probability distributions and robustness for complicated genetic architectures, such as moderate data sparsity and population stratifications [12]. By applying the W-test to gene–methylation data directly, epistasis can be measured without a preselection of biomarkers, while also relying less on significant main effects for detecting important CpG–SNP interactions. The GAW20 provided an opportunity to study methylation and genome-wide association study (GWAS) data from participants who have undergone lipid control treatment. The W-test was applied in the detection of gene–methylation interactions, resulting in interesting findings with biological implications.

Methods

GAW20 experimental and simulated data sets

GAW20 provided the study data. The study participants were patients with diabetes who had undergone lipid-control treatments with the drug fenofibrate and were recruited from the Genetics of Lipid Lowering Drugs and Diet Network (GOLDN) clinical trial project. The analyzed data sets consisted of a simulated and experimental data sets. The triglyceride (TG) levels were collected at 4 clinical visits, with 2 measurements before treatment and 2 measurements after treatment. Age, sex, smoking status, and location were recorded. Genome-wide association data were sequenced with the Affymetrix Genome-wide Human SNP array 6.0, and DNA methylation profiling was performed with the Illumina Infinium Human Methylation 450 K Bead Chip Array, using the buffy coat harvest from blood samples collected at the second and fourth visits. In the simulated data, the phenotype of the simulated data set was generated using experimental genetic data under a hypothetical model [13]. The TG levels were generated from 5 SNPs with major effects and 5 CpG sites in physical proximity. A set of 5 SNP-CpG pairs with relatively high heritability but not related to TG levels was given as noise for testing the statistical methods. The simulated data contained 680 subjects after excluding individuals with missing phenotypes. For simulated data, the 84th replicate was used as suggested by GAW20. In the experimental data set, a total of 523 participants had complete genomic and clinical measurements. Participants with missing values were removed during the quality-control process, resulting in a remaining sample size of 476. The method was applied to chromosome 11 of the experimental data.

Defining drug response

The TG levels can be used as a measure of drug response. Because common clinical standards regard a 30% decrease in TG levels as an effective control of lipids [14], we adopted the same criteria in this study. First, the average pretreatment TG levels (TG_pre) were calculated by averaging the measurements from the first and second visits. The average posttreatment TG levels (TG_post) were calculated by averaging the measurements from the third and fourth visits. Next, a percentage of change was calculated as: ΔTG% = (TG_pre–TG_post)/TG_pre. If the percentage of change was greater than 30%, then the drug treatment was defined as effective; if less than 30%, treatment was defined as ineffective. The effectiveness of the drug response was the outcome variable for both the simulated and experimental data.

The epistasis measure: The W-test

The W-test measures the probability distributional differences for a set of biomarkers between the 2 groups of participants such as the 2 drug-response groups [12]. Under an additive genetic model, a SNP variable can be coded into 3 levels with the counts of the minor alleles. The quantitative CpG variable can be divided into high and low methylation levels by two-mean clustering. A SNP-CpG pair can form a genetic combination of 6 categories. The empirical distributions are compared through a sum of the square of the log odds ratio by:

$$ W=h\sum \limits_{i=1}^k{\left[\log \frac{{\widehat{p}}_{1i}/\left(1-{\widehat{p}}_{1i}\right)}{{\widehat{p}}_{0i}/\left(1-{\widehat{p}}_{0i}\right)}/{SE}_i\right]}^2\sim {\chi}_f^2 $$
(1)

where \( {\widehat{p}}_{1i} \) and \( {\widehat{p}}_{0i} \) are the proportion of cases and controls in the ith category out of total cases or controls, respectively. SEi is the standard error of the log of odds ratios. The test statistics follows a chi-squared distribution with f degrees of freedom. Two parameters, h and f, are estimated using large-sample approximation by drawing smaller bootstrap samples under a null hypothesis. Consequently, the testing distribution is robust for complicated genetic architectures, as it adaptively adjusts to the data structure of the working data [12]. For detecting the cis-regulation patterns in the gene-methylation data, the SNPs and CpG sites located within a 10-kb genome distance on chromosome 11 were evaluated exhaustively [1].

Two types of logistic-regression models were applied as accompanying benchmarks to the W-test. The first logistic-regression model, LR-m1, considered the CpG site as a binary variable like the W-test, and the second logistic-regression model, LR-m2, included the CpG sites as a continuous variable using the original methylation values. Both logistic-regression models incorporated the main and interaction effects of SNP and CpG sites. In short, we denote:

  • LR-m1: Y = SNP + CpG + SNP × CpG, where CpG is a binary variable;

  • LR-m2: Y = SNP + CpG + SNP × CpG, where CpG is a continuous variable.

The type I error rate is an average false-positive proportion using a permuted phenotype on a pair of gene–methylation markers in 2000 replicates. A total of 140,501 epistatic pairs were tested, and a Bonferroni correction resulted in a significance level of 3.56E-7 at a family-wise error rate of 5%.

Results

Performance of the W-test with simulated data

In the simulated data set, the W-test, LR-m1, and LR-m2 were applied to the given causal and noise pairs. Table 1 displays the p values obtained from alternative methods. Generally, the W-test gave smaller p values than LR-m1 in most answer pairs, and also had comparable p values to LR-m2. The top 3 answer pairs were all identified to be significant by the 3 methods. The W-test also found the fourth answer pair (cg00045910, rs10828412) with a p value = 0.0475, which was slightly smaller than the p values from the LR-m1 (p value = 0.0532) and LR-m2 (p value = 0.0597). The results suggested that the W-test could be sensitive to small signals with lower heritability. In terms of the performance for the noise pairs, all methods yielded noise p values greater than 0.05. The Type I error rate of the W-test was 2.95%, less than the family-wise error rate of 5%. Meanwhile, the Type I error rates of LR-m1 was 5.40% and of LR-m2 was 5.43%. The results showed that the W-test was able to distinguish between signal and noise in the simulated data set.

Table 1 p Values of 5 answers and 5 noises by the W-test and the logistic regression models LR-m1 and LR-m2 in simulated data

Computing time

Computing time was calculated on a laptop computer with a 1.6 GHz chipset and 4 GB of random access memory using 2000 replicates on 1 pair of markers. The W-test was 4 times faster than logistic regression on a general laptop (2.28 s by the W-test, 10.12 s by the LR-m1, and 9.37 s by the LR-m2).

Identification of gene–methylation interaction in experimental data

The W-test was applied to test the gene–methylation interactions for GAW20 experimental data on chromosome 11. No significant interaction pair passed the Bonferroni correction significance level of 3.56E-07 (Table 2). We checked the functions of the top 15 identified epistatic pairs and found interesting biological implications. The top 3 SNP-CpG pairs all resided in the gene MPPED2 (11p14.1; p value = 8E-06), which encoded the protein metallophosphoesterase and was reported to be related to neuronal function [15]. Previous GWAS studies and biomedical experiments reported that MPPED2 was associated with chronic kidney disease, and knockdown of this gene in zebrafish embryos suggested a role for it in renal function [16]. GUCY2E was ranked fourth and has been reported to function in the central nervous system and retinal [17, 18]. NAV2 (ranked 6th; p value = 1.78E-05) is a neuron navigator that induces neurite outgrowth for all-trans retinoic acid, and plays an essential role in the development of the cranial nerve and the regulation of blood pressure in humans [19]. ZBTB16 at 11q23.2 (ranked 15th; p value = 7.04E-05) also has been reported as an inhibitor of neurite outgrowth in the adult central nervous system [20]. Other genes in the top 15 identified pairs include TRIM5, TRIM6-TRIM34, and TRIM3 (smallest p value = 5.22E-05), which were highly correlated with TG levels in mice [21]. The quantile–quantile (Q-Q) plot of the gene–methylation tests showed no inflation in spurious relations for the experimental data (Fig. 1).

Table 2 Top 15 gene–methylation pairs identified by the W-test in experimental dataa
Fig. 1
figure 1

Q-Q plot of gene–methylation interaction using experimental data

Discussion and conclusions

There has been increasing evidence for the contribution of epigenetics in regulating gene expressions implicated in diseases. Previous studies were mainly focused on experimentally studying gene–methylation interactions. In this study, we demonstrated that the W-test can be used as an effective method to identify the epistatic interactions between SNPs and CpG sites in the GAW20 simulated and experimental data sets. One common obstacle in the analysis of epistasis in the genome and epigenome is the large number of pairwise tests, the volume of which is determined by the size of the cis-regulatory region. Existing methods solve the challenge by using stage-wise and integrated analyses, in which the SNPs are separately selected and then the epistatic interactions with CpG sites are jointly evaluated in regression-based approaches [10, 11]. The stage-wise analysis may potentially miss the markers that have weak main effects but strong epistasis effects. Previous studies also made a linear assumption about the relationship between the epistatic pairs and a transformed form of the response variable, while having the advantages of covariate and population structure control. Some nonparametric methods, such as the Mann-Whitney U-test, have been applied for the analysis of methylation data [22]. However, these nonparametric tests cannot handle the potential complicated genetic architectures such as sparse data or population stratification. The W-test has the advantage of being model-free and does not assume any form of interaction effect. It also follows a chi-squared distribution in which the degrees of freedom is estimated from the working data by bootstrapped sampling. In this way, the W-test is able to correct potential bias of the probability distribution caused by complicated data structures. This method is very efficient such that it can be applied directly on SNP-CpG evaluations without prior filtering with the main effect.

Application of this method on the experimental data from patients who had undergone treatment for managing TG levels via fenofibrate identified genes that played roles in renal function, the central nervous system, and retinal functions. The enriched signals found in neuronal-related genes suggest that the blood lipid levels could be related to the neurological dysfunction in the brain, which is the most cholesterol-rich organ in the body. By performing an epistatic evaluation between SNPs and CpG sites, we identified MPPED2, GUCY2E, NAV2, and ZBTB16 as associated with hyperlipidemia. Among these 4 genes, MPPED2 was the most significant; it plays a role in neural development, and genetic variations in this gene are reported to be related to migraines, a common disease of the neural system disease [23]. Furthermore, mutations of CUCY2E are reported to be related to retinal disorders [24, 25]. ZBTB16 encodes a protein that is highly expressed in the brain, and polymorphisms in this gene are used as a marker for attention deficit hyperactivity disorder, a neuropsychiatric condition [26]. It is intriguing to note that the enriched signaling in neuronal and retinal genes are identified through epistasis evaluation between SNPs and CpG sites, but not through separate analysis of the main effect in those data sets. This shines light on the importance of integrated analysis of omics data: considering multiple facets or measurement of a common object may improve the chance of catching the underlining signal. Further studies on these threads are necessary to discover the underlying biological mechanism.