Background

The etiology of complex disorder involves an interplay of polygenic biomarkers, lifestyle and environmental factors [1]. Robust and efficient statistical tools are needed to perform interaction analysis in high volume genome data. Besides SNP-SNP interactions, the analysis of interactions of SNPs and cytosine-phosphate-guanine (CpG) sites might provide novel insight into the regulatory mechanism DNA methylation and gene expression underlying complex diseases.

Here we introduce a software that provides estimations for different types of genetic associations, including the main effect, second or higher order interaction, and gene-methylation interaction. This package is built upon the W-test [2] to perform epistasis testing. The statistic compares distributional differences of a set of biomarkers in cases and controls and follows a chi-squared distribution with data-set adaptive degrees of freedom. The method has the advantage of correcting p-value bias caused by complicated genetic architectures. Flexible implementation options are provided. The package can calculate SNP-CpG epistasis for biomarkers located in physical proximity of the input genome and epigenome. A number of post-test diagnostic, visualization and statistical genetic analysis functions are provided for model diagnosis. This is the first statistical software providing functions for direct gene-methylation interaction and high-order interaction evaluations in genome and epigenome dataset.

Implementation

Design

The wtest package is based on the W-test [2] to measure the association between binary phenotype and categorical genetic data. To test the association of a subset marker, a k by 2 contingency table can be formed, where k is the number of non-empty category combination formed by the SNP-set, and 2 is the binary phenotype. The statistic tests for the existence of distributional difference of a subset in the case group from a comparison control group, and it takes the following form,

$$ W=h\sum_{i=1}^{k}\left[\log{\frac{\hat{p}_{1i}/(1-\hat{p}_{1i})}{\hat{p}_{0i}/(1-\hat{p}_{0i})}/SE_{i}}\right]^{2} \sim\chi_{f}^{2} $$
(1)

where n1i and n0i are the number of cases and controls in the ith cell of the contingency table; N1 and N0 are the total cell counts of cases and controls; \(\hat {p}_{1i} = n_{1i}/N_{1}\) and \(\hat {p}_{0i} = n_{0i}/N_{0}\) are the conditional cell probabilities of the ith cell of the contingency table; and SEi is the standard error of the ith log odds ratio. The W-test follows a chi-squared distribution of f degrees of freedom. The scalar h and degree of freedom f take forms of covariance matrices of the log odds ratios and are estimated from bootstrapped samples under the null hypothesis by the large sample theory. The W-test inherits a data-set adaptive degree of freedom that absorbs the genetic variation not attribute to phenotypes, therefore robust to complicated genetic architectures. In this software, we further extend it to evaluate high-order interaction effect and gene-methylation interaction effect. For gene-methylation interaction, methylation data are clustered into two categories according to high and low methylation levels by two-mean clustering algorithm. We also use a novel triangular network diagram to display interaction effects up to the third order. Extensive simulation studies testing the power and type I error of the W-test can be found in Wang, Sun et al. (2016) [2] and Sun et al. (2017) [3].

Implementation

Figure 1 demonstrates the major functions in the package and illustrates the implementation step by step using example data in the package. The implementation is performed in two steps: (1) Estimation of parameters h and f; (2) Testing by the W-test. Step 1. Estimation of parameters h and f. In genotype data, the hf() function is called, and in genotype and methylation data, the function hf.snps.meth() is called. Parameter h is the scaler in Eq. (1) and f is the degrees of freedom of a chi-squared distribution of the W-test. The two parameters are estimated using bootstrap samples with permutated phenotypes (null hypothesis) for B times. Simulations suggest that the estimation converges at B>400 when the number of variables is 1000 and the number of subjects is 1000 (Additional file 1). If step 1 is not performed, the p-value of W-test will be calculated by default h and f: h=k/(k−1) and f=k−1. In this case, k is the integer categorical combinations formed by the marker set. When k=2, the W-test is equivalent to the odds ratio test for a 2-by-2 table.

Fig. 1
figure 1

Integrated genetic epistasis testing and functions

Step 2. Testing by the W-test. The wtest() evaluates main and second order interaction and wtest.high() evaluates third or higher order interaction in genotype data. The wtest.snps.meth() calculates SNP-CpG interactions for genome and epigenome data. Oftentimes users are interested to explore the interactions among biomarkers with a certain level of main effect signals. The input.pval option in the function can be used to screen candidate SNPs according to their p-values to form interaction sets. While the output.pval option allows the convenient output of interaction sets reaching a p-value threshold. In function wtest.snps.meth(), positions of the biomarkers are input alongside the genome and epigenome data sets, and the window size to calculate cis-regulation relationship can be specified. The methylation.recode() function transforms the methylation data into high and low methylated levels. For high order interaction calculation, a simple check for sample size can be done by estimating the average number of cell counts formed by a set, and a high order is feasible if the number is at least two. A reference table could be found in Additional file 2 with suggested sample sizes for various order of interactions.

Diagnostic checking for test statistic distribution can be performed by w.diagnosis(), which plots the W-test statistics histograms from the observed data and the curve of the chi-squared distribution using estimated parameters, indexed by the number of categorical combinations k. Close overlaying of the densities indicates the goodness of fit of estimation. An example is shown in the real data application section. The w.qqplot() function assists the diagnostic of probability distribution and degree of population stratification.

Results

Real data example

The software is applied to a number of real data analysis with novel biomarker findings and interesting implications [29]. Here we demonstrate its usage by two data sets: a genotypic dataset for bipolar disorder from the Genetic Association Information Network (GAIN) project, and a gene-methylation data for the lipid control treatment.

Application I. GAIN bipolar disorder dataset

This data contains 653 bipolar disorder patients and 1767 healthy controls, and 46,181 SNPs of chromosome 6 [10]. The result of h and f estimation can be found in Additional file 3. At second order interaction (order=2), setting input.pval=0.001 and output.pval=0.001, the function would output second order epistasis marker pairs with p-value <0.001. Figure 2 is the diagnostic plot for this estimation using w.diagnosis() function. The estimated red color chi-square curve follows closely with the histogram of the test statistics calculated from the observed data, showing a good estimation of the parameters.

Fig. 2
figure 2

Diagnostic plot by w.diagnostics. At each combination size k, the estimated red color chi-square curve follows closely with the histogram of the W-test statistics calculated from the observed data, showing a good estimation of the parameters

Data analysis identified one SNP with significant main effect: rs2495982 near GRM4, p-value =2.06×10−7. GRM4 is a major excitatory neurotransmitter in central nervous system and it is a susceptible gene for bipolar disorder and schizophrenia [11, 12]. For interaction effects, a number of SNP sets surpassed the Bonferroni corrected significance level. The top SNPs identified from different orders of interaction are listed in Additional file 4, and the interaction network up to the third order is plotted in a triangular network in Fig. 3. Each colored triangle in the network indicates a significant third order interaction, and the bold edge shows a significant second order interaction. It could be seen from the plot that the strongest interaction is formed by the gene set (SYNE1, BTBD9, RPL12P2) in the middle of the plot, in which BTBD9 plays a key role and extends to form significant combinations with FGD2 and CDKAL1. The BTBD9 is reported to be associated with neuropsychiatric disorders such as restless legs syndrome in Schizophrenia and the Tourette Syndrome [13, 14]. The gene encodes the BTB/POZ domain-containing protein that involved in protein-protein interactions [15], and is highly expressed in brain tissues [16]. It is very encouraging to discover this gene with known physical protein interaction function from pure computational and statistical perspective.

Fig. 3
figure 3

Triangular network for third order genetic interactions

Application II. gene-methylation interaction analysis for lipid control data

This application was originally reported in Sun et al 2018 [3]. The data set contains 476 diabetic patients undergone lipid control treatments, and 150,000 candidate SNP-CpG pairs within 10kb genome distance (window.size=10,000). The phenotype is whether or not a subject responded to the treatment, calculated by comparing the before and after treatment triglyceride levels [3]. The h and f are estimated by hf.snps.meth(), and the gene-methylation interactions are calculated by wtest.snps.meth(). Table 1 summarized the top 5 markers identified by gene-methylation interaction associations. The cluster of genes is found to be involved in neuronal and retinal functions, including MPPED2 [17] and GUCY2E [18].

Table 1 Gene-methylation interaction in lipid control data

Performance

The speed of the wtest package is evaluated on a laptop computer of 1.6GHz Intel Core i5 processor and 4GB RAM. Simulation data are used to compare the speed of different methods. On a data set consists of 5000 subjects and 100 SNPs, when B=200, n.sample=1000, the time elapsed for estimating h and f is 40.5s. After h and f calculation or assuming default values, the time used to evaluate main effects is 0.04s, and took 1.69s for second order interaction. In the same environment, the running time for existing tests for interaction yields 36.41s by chi-squared test and 130.56s by logistic regression. In the real data set, the genome-wide main effect calculation on 5000 subjects and 500,000 SNPs took around 5 min; and second order interaction calculation on 8000 SNPs used around 3.5 h.

Conclusions

Genetic epistasis testing is important to fathom the massive genomic data, and it also provides a way to explore the relationship between diseases and various types of biomarkers. This package offers an integrated toolset to analyse the association of genetic signals at all levels: from main effects, high order interactions, to gene-methylation interactions. The software is available in CRAN from https://CRAN.R-project.org/package=wtest under the GPL-2.0 license.

Availability and requirements

Project name: wtest

Project home page: https://CRAN.R-project.org/package=wtest

Operation systems: Platform independent

Programming language: R (>= 3.1), C++

License: GPL (>= 2)

Restrictions to use by non-academics: None