Background

Over the last decade, Genome-wide Association Studies (GWAS) have identified genetic loci associated for a variety of diseases [1]-[5]. Most studies aim to identify single nucleotide polymorphisms (SNPs) individually using univariate analysis methods [6]. Although current GWAS analyses have provided valuable insights into genetic risks of common diseases, the genetic variants identified by GWAS generally only account for a small proportion of the total heritability of complex diseases, illustrating a problem commonly referred to as “missing heritability” [7],[8]. Potential explanations for this problem include the underestimation of the effects of alleles identified, the existence of gene-gene joint effects, the contribution of rare variation, the possibility that inherited epigenetic factors lead to correlated phenotypes between relatives, and the possible overestimation of heritability of the complex traits [7],[9],[10]. Many diseases or phenotypes are likely caused by or associated with multiple SNPs, each having small effects individually, but collectively contributing a more significant genetic effect [11]. Therefore, multi-locus SNP models would offer one appealing solution in capturing the underlying genotypic-phenotypic relationship [12]-[14].

A typical GWAS study measures thousands or millions of SNPs, but the number of subjects is usually much smaller. This is known as the P > > N problem [15],[16]. One solution to this problem resorts to dimension reduction by identification of the optimal subset of predictors associated with the outcome variable. Determining the best model or selecting a subset of variables becomes an important statistical task for this method. Bayesian variable selection (BVS) provides a natural approach to solve this problem [17],[18]. Unlike penalized regression approaches, BVS naturally produces easily interpretable measures of confidence for variable selection, i.e., posterior selection probabilities. This is an appealing advantage in GWAS because the primary goal of the analysis is to identify the joint effect of SNPs, and to utilize these identifications to learn about underlying biology. BVS has been successfully applied to GWAS data; see the discussion by Guan [19]. The flexibility of the BVS approach allows for straightforward extensions to analyze both quantitative and qualitative data [20]-[23]. Alternative techniques such as the single-SNP test, Stepwise regression, and LASSO (Least Absolute Shrinkage and Selection Operator) were developed to address this statistical challenge. LASSO is a regression method that involves penalizing the absolute size of the regression coefficients, and Stepwise is a classic scheme for sequentially adding to or removing variables from the model. Many studies show that BVS has better performance than these alternative methods in other contexts [19],[23]-[25].

Diseases with complex inheritance may be influenced by multiple genes that interplay within genetic networks or pathways [26],[27]. Gene products interact with one another and work collaboratively within interconnected pathways explaining or associating with certain diseases. This idea led to the concept of network-based molecular biomarkers. Stingo et al. [28] proposed a Bayesian modeling strategy that addressed this concept by incorporating biological information, which was based on the structure or topology of regulatory gene-gene networks in the analysis of DNA microarray data. The method was further generalized into a 2-step framework, STS (screening, then selection) by our research team [29] where standard methods were applied in the screening step to identify a set of candidate genes which were further explored in the selection step using the BVS strategy. In addition to these two coherent steps, our strategy involves the mapping of genotype data to gene-gene networks constructed from various sources such as protein-protein interaction networks [30],[31]. We call this strategy of Bayesian biomarker discovery “integrated BVS” or iBVS. A partial least squares (PLS) g-prior for regression coefficients is also incorporated to solve the problem of non-positive deterministic covariance matrix when the sample size is smaller than the number of genes selected.

In this paper, we develop the strategy of iBVS for analyzing high dimensional GWAS data sets. The strategy is built upon a three-level hierarchical model as seen in Figure 1, where at the top level the PLS method is used to summarize the joint effect of selected SNPs within each gene. In the middle level, Markov Random Field (MRF) is employed to model the selection of genes in prediction of association with disease status. A focus of this article is on discovering SNPs within specific genes incorporating gene network information in GWAS under case–control design. Identification and prediction performance of this iBVS approach are then compared with those of LASSO and Stepwise selection strategies through simulation studies. We then apply iBVS to a GWAS data set for the prediction of leprosy, a skin disease, among Han Chinese.

Figure 1
figure 1

Hierarchical model structure and relationships among the stochastic nodes.

Methods

We denote Y = (Y 1, ⋯ Y n ) ' as independently observed binary outcomes in a GWAS data set, where n is the number of subjects and Y i = 1 0 if i t h subject has target disease otherwise . Each outcome is associated with a set of predictor variables, which correspond to the genotype data. We denote x ijk as the genotype of the k th SNP of the j th gene on the i th subject, for i = 1, ⋯, n, j = 1, ⋯, J, k = 1, ⋯, P j , where P j is the number of SNPs mapped to the j th gene, and P= j = 1 J P j denotes the total number of SNPs in the GWAS data set. Let A and a be the major and minor alleles at a SNP. The genotype may be coded according to different types of genetic effects: additive with 0, 1, 2 coding for the genotypes AA, Aa/aA, aa; dominant (with respect to the minor allele) with 0, 1, 1 coding for AA, Aa/aA, aa; Recessive (with respect to the minor allele) with 0, 0, 1 coding for AA, Aa/aA, aa.

iBVS with hierarchical modeling for GWAS data

Figure 1 shows the proposed three-level hierarchical model structure. iBVS for binary phenotypes is accomplished by introducing the latent variable vector Z to the linear regression model. Each component Z i  ~ N(0, 1) is defined such that

Y i = 0 , 1 , if Z i 0 if Z i > 0 .

In order to select genetic variants in both gene and SNP level simultaneously, we introduce two binary vectors, ξ = (ξ 1, ⋯, ξ J ) and γ = (γ 1, ⋯, γ P ), to indicate the selection of genes and SNPs respectively into a model for predicting Z i , i. e,

ξ j = 1 0 if j t h gene is selected otherwise j = 1 , , J

and

γ p = 1 0 if p t h S N P is selected otherwise p = 1 , , P .

For GWAS data with iBVS analysis, we propose the following hierarchical model,

Z i = T ξ , γ β ξ , γ i + ε i , ε i ~N 0 , 1 ,

where β ξ , γ = β J 1 , γ , , β J ξ , γ , T ξ , γ = T J 1 , γ , , T J ξ , γ , ξ denotes the number of selected genes in predicting Z i , T J l , γ denotes the vector of the first PLS component of X J l , γ , and J l (l = 1, ⋯, |ξ|) indexes the selected gene. Note that X J l , γ is a sub-matrix of X, consisting of only the columns that correspond to selected SNPs in the selected genes.

Prior specification

The indicator γ for SNP selection is assumed to follow an independent Bernoulli prior distributions with the same parameter π over all the γ i values.

p γ = p = 1 P π γ p 1 π 1 γ p 0π1

Choice of π reflects a user’s prior belief in terms of the numbers of causal SNPs out of P candidates.

Zellner’s g-prior is commonly used for the regression coefficient parameters β [32]. Yang and Song [33] generalized the g-prior by modifying the matrix inverse to the Moore-Penrose generalized matrix inverse. Since the multicollinearity problem is commonly encountered in GWAS data because of strong linkage disequilibrium between SNPs, we took a similar prior as that of Yang and Song,

β ξ , γ |ξ,γ~N 0 , c T ξ , γ T ξ , γ + ,

where T ξ , γ T ξ , γ + denotes the Moore-Penrose generalized inverse of T ξ , γ T ξ , γ . This idea was first adopted by our research team for microarray gene expression data [29].

To take into account the pathway membership information for each gene as well as the biological relationships between genes within pathways, we follow Li and Zhang [34] and Stingo et al. [28] to use an MRF to describe the prior distribution on each component of the gene selection indicator, i.e.,

p ξ j | ξ i , i N b j exp ξ j μ + η i N b j ξ i

where μ and η are tuning parameters, and Nb j is the set of neighbors of gene j within the selected pathway. The corresponding multivariate form is given by:

P ξ | μ , η exp μ 1 J ξ + η ξ R ξ ,

where 1 J is the vector consisting of J 1’s. We denote matrix R to reflect gene-gene network topological structure, where elements R ij  = 1 if there is a direct edge between the i th and j th genes, and R ij  = 0 otherwise.

Posterior distributions

The joint posterior distribution of θ = (γ, ξ, β (ξ,γ), Z) given (Y, X) is

P γ , ξ , β ξ , γ , Z | Y , X i = 1 N I A i ×exp Z T ξ , γ β ξ , γ Z T ξ , γ β ξ , γ 2 ×exp β ξ , γ T ξ , γ T ξ , γ β ξ , γ 2 c i = 1 m ξ λ i 1 2 × p = 1 P π p γ p 1 π p 1 γ p ×exp μ 1 J ξ + η ξ R ξ

where I(A i ) is the indicator function and A i is either equal to {Z i  : Z i  > 0} or {Z i  : Z i  ≤ 0} corresponding to Y i  = 1 or Y i  = 0, respectively, and λ 1 , λ m ξ m ξ J are the nonzero eigenvalues of T ξ , γ T ξ , γ + .

Since β is a nuisance parameter, we can integrate it out to obtain the joint posterior distribution of (Z, ξ, γ|Y, X) as follows:

P Z , ξ , γ | Y , X i = 1 n I A i × 1 Σ ξ , γ 1 / 2 exp Z Σ ξ , γ 1 Z 2 × p = 1 P π p γ p 1 π p 1 γ p ×exp μ 1 J ξ + η ξ R ξ

with Σ ξ , γ 1 =c T ξ , γ T ξ , γ T ξ , γ + T ξ , γ + I n . From this posterior joint distribution, we can derive the posterior conditional distributions

P Z | ξ , γ , Y , T N 0 , Σ ξ , γ i = 1 n I A i ,

which is a multivariate truncated normal distribution, and

P ξ , γ | T , Z N 0 , Σ ξ , γ × p = 1 P π p γ p 1 π p 1 γ p ×exp μ 1 J ξ + η ξ R ξ .

Posterior inference via MCMC sampling

Markov chain Monte Carlo (MCMC) sampling is used to generate samples for the posterior distribution of the model parameters. The MCMC sampling procedure consists of the following two steps:

  1. I.

    Sample ξ and γ from P(ξ, γ|Y, T, Z): the selection parameters (ξ, γ) are updated using a Metropolis-Hastings algorithm which is modified from Stingo’s method [28] (details of the MCMC moves to update (ξ, γ) are given in Additional file 1). The method consists of randomly picking one of the following moves: (1) change the inclusion status of SNP and gene by randomly choosing from adding or removing a gene and a SNP at the same time; (2) change the inclusion status of SNP only by randomly choosing from adding or removing a SNP.

  2. II.

    Sample Z from P(Z|Y, T, γ, ξ): directly sampling from this distribution is known to be difficult. In this article, we follow the method given in Devroye [35] to simulate samples from the univariate truncated normal distribution P(Z i |Z − i, Y, T, γ, ξ), where Z − i is the vector of Z without the i th element.

Simulation

Simulation studies were conducted to assess the performances of iBVS, LASSO regression, and Stepwise regression using a proprietary set of MatLab codes, an R package glmnet, and the R package lars. We simulated three scenarios of varying different proportion of variance of Z explained by the genetic factors, labeled as follows: (1) H70: genes with network, 70% explained variance; (2) H50: genes with network, 50% explained variance; and (3) H30: genes with network, 30% explained variance.

For each scenario, 50 sets of genotypes without disease status were simulated using software Hapgen2 [36] based on the genotype data from Hapmap project (http://hapmap.ncbi.nlm.nih.gov/). We subsequently generated phenotypes corresponding to each scenario, with 400 individuals and 300 SNPs assorted into 22 genes for each data set. The first 200 individuals were used to fit the iBVS hierarchical model and to evaluate the performance of identifying causal SNPs of the three methods, while the other 200 individuals were used to assess the prediction performance of each method. All the simulations were run under the additive and dominant genetic model respectively to check the model flexibility of the proposed iBVS.

We simulated sets of phenotypes in the following way: First, we specified 10 SNPs x j (j = 1 … 10) as causal SNPs, which were positioned within 5 genes. In order to add network information to gene levels, a network was simulated between the genes. We then conducted a precision matrix Ω which contains the network relationship between the genes. If there is an edge between p th gene and q th gene in the network, ω pq and ω qp would be assigned with a non-zero value, otherwise 0. The vector t i was generated from a multivariate normal distribution with zero mean vector and covariance matrix Σ = Ω − 1. Then the causal gene score T k (k = 1…5) was calculated by considering both genotype data and gene network information, T i k = S N P j gene k b j x k j + t i k , where b j ’s were carefully chosen to indicate different PLS configurations. We subsequently simulated the latent phenotypes score for each individual using z i  = ∑ϕ j T ik  + ε i , ε i  ~ N(0, 1). Finally, binary phenotypes Y i for each individual was generated using Y i = 1 0 if z i 0 if z i < 0 .

Application

We applied iBVS, LASSO and Stepwise approaches to analyze a GWAS data set designed to identify genetic variants associated with leprosy [37]. The genotype data consisted of 492,109 SNPs from 706 cases and 514 controls after removing genetically unmatched controls, to obviate the need for correction for population stratification. All subjects were Han Chinese from eastern China. In order to select variables and assess the performance of the three variable selection strategies, we randomly divided the data set into two parts: a training set and a testing set, each with 610 samples. The training set was used for SNP selection, while the testing set for validating the selection results and comparing the three methods. The genotype was first coded under the additive genetic model, and we re-coded the genotypes following dominant genetic components and re-analyzed this real data set to check the model flexibility under different genetic effects.

Results

Simulation studies

Performance of variable selection

The average area under the curve (AUC) was calculated to evaluate the performance of casual SNP identification in each scenario. For SSVS, the AUC is calculated using the following formula [38],[39]. AUC= 1 n D n C i D , j C I γ i > γ j , where D is the set of the causal SNPs and C is the set of the non-causal SNPs; n D and n C are the number of causal and non-causal SNPs, respectively.

For the LASSO method, a simple modification of the Least Angle Regression (LAR) algorithm was implemented that calculates the entire LASSO path, which is an efficient way of computing the solution to any LASSO problem especially when P ≫ N [40]. Using the modified LARS algorithm, one may generate all LASSO solutions corresponding to different values of the penalty parameter. Selecting the active model at a given iteration would give one LASSO solution corresponding to a particular value of the penalty parameter. Hence, one can control the penalty parameter using a cutoff for the number of iterations [38]. For LASSO and Stepwise approaches, the following formula was used to calculate the AUC: AUC= 1 n D n C i D , j C I s i < s j , where s is the iteration at which the i th marker enters the model [38].

Table 1 shows the averaged AUC of the three methods in variable selection under the additive genetic model. It can be seen that the AUC of the three methods all increase monotonically by the proportion of variance of Z explained by the genetic factors. Obviously, under scenario H70 and H50, the iBVS has the highest averaged AUC (0.911 and 0.894) followed by Lasso (0.891 and 0.882), while the AUC of Stepwise is relative low (0.869 and 0.853). The AUC of iBVS drops to 0.792 with low explained variance (H30), with LASSO and Stepwise both approximating 0.77. The results revealed that iBVS has superior performance compared with that of LASSO and Stepwise regression. Similar trends can also be found under the dominant genetic model (Additional file 1: Table S2).

Table 1 Average AUC values of iBVS, LASSO and Stepwise

Performance of outcome prediction

We subsequently assessed the prediction performance of the three methods using the remaining 200 individuals. Prediction performances were evaluated by considering correctly/incorrectly predicted positive/negative outcomes. We calculated specificity and sensitivity for thresholds from 0.01 to 0.99, with steps of 0.01. Then the ROC was plotted using mean specificity and sensitivity for a given threshold. For iBVS, we use a two-stage strategy. First the posterior probabilities of all the SNPs were estimated by iBVS. The top i SNPs (i = 1…300) were subsequently fitted into a PLS logistic regression model, and 10-fold cross-validation was conducted to choose the optimal number of predictors i. For LASSO and Stepwise approaches, the optimal model was determined by a 10-fold cross-validation.

Figure 2 demonstrates the ROC of the three methods in scenarios H70 and H50. One can see that the prediction performance of iBVS was slightly greater than that of LASSO, and had an obvious superiority to Stepwise regression.

Figure 2
figure 2

ROC curves of the three SNP selection strategies on the simulated data. Figure (a) depicts the ROC curves of the simulated data in scenario H70, and Figure (b) depicts the ROC curves of the simulated data in scenario H50.

Application

We first conducted screening of all SNPs, one by one by fitting the single-SNP logistic regression model with additive coding. By sorting on the p-values from the univariate analysis, we identified 100 genes, each containing at least one SNP with P-value smaller than 0.0004. We subsequently extracted all of the SNPs in each significant gene to comprise the joint effect of SNPs per one gene. The 100 genes selected above contained a total of 3,388 SNPs.

iBVS was applied using the above 3,388 SNPs. First, we constructed the R matrix using the KEGG database (details please see the Additional file 1). We subsequently specified prior distributions as described in the Methods Section, with hyper parameters set as: π = 0.01, μ = − 2, η = 0.8. Finally, the MCMC was conducted to make posterior inferences with 10,000 iterations as burn-in and 50,000 additional runs. Figure 3a shows the posterior SNP selection probabilities via our iBVS strategy with integrated biological priors.

Figure 3
figure 3

Posterior selection probabilities of SNPs and result of cross-validation in leprosy training Data. Figure (a) depicts the posterior SNP selection probabilities for the 3388 SNPs from the leprosy training data set. Figure (b) depicts the classification error in the conduct of cross-validation on the training data set using the PLS logistic regression model with different selection of top SNPs.

A ten-fold cross-validation approach was employed to set a cut-off for determining the optimal prediction model. The top i SNPs were added into a PLS logistic model one by one, to estimate the cross-validation error, shown in Figure 3b. It can be seen that the smallest classification error appeared when the top 94 SNPs were selected as predictors. The classification error stabilizes after 94 SNPs were selected into the PLS logistic model, with some slight increase shown. Therefore the top 94 SNPs were selected as ‘significant’ predictors whose information was listed in (see the Additional file 1: Table S1).

In addition, we ran LASSO and Stepwise regression on this leprosy GWAS dataset, with the optimal model determined by 10-fold cross-validation. The LASSO selected 100 SNPs, which only included 24 SNPs selected by iBVS. The Stepwise regression approach yielded a more parsimonious model with only 3 SNPs. Table 2 shows the detailed information of 24 common SNPs selected by both iBVS and Lasso. Specifically, the three SNPs selected by Stepwise also had high corresponding posterior probability in iBVS and relative large coefficient in LASSO, and SNP rs9270984 is most significant in all three methods. Finally, we assessed the ability of the three methods to correctly predict binary responses (case versus control) of the test data set. Figure 4a shows the ROC curves of the three selected models under the additive genetic model, while Figure 4b demonstrates that under the dominant model. This indicates that iBVS has comparable performance to the LASSO model, but has an performance advantage over the Stepwise regression method no matter what the genetic model is.

Table 2 Information of 24 common SNPs selected by both iBVS and Lasso
Figure 4
figure 4

ROC curves from leprosy testing data set under additive and dominant genetic model. Figure (a) depicts the ROC curves of testing data set with different SNP selection strategies (iBVS, LASSO, and Stepwise) under additive genetic model, and Figure (b) under dominant genetic model.

Discussion

GWAS analyses typically approach data as a list of single SNPs, a strategy which has yielded a catalog of susceptibility loci for complex diseases. However, the statistically significant variants detected so far account for only a small proportion of the total phenotypic variation. Gene-based tests for association are increasingly being seen as useful complements to GWASs, demonstrating several attractive features compared with traditional SNP-based analysis [12]-[14]. Beyond gene-based methods, there is an increasing recognition of the potential contributions of pathway-based analyses, in which variants in groups of genes within specific pathways are considered together to predict the phenotype [41]-[43]. In this paper, we followed an integrative biomarker identification scheme to conduct a novel hierarchical model incorporating a gene-gene network or pathway information for SNP identification in GWAS via the Bayesian inference paradigm.

Three scenarios of data sets were simulated, each considering different proportions of variance of outcome explained by genetic factors. Simulation results show our iBVS method outperformed the LASSO and Stepwise methods in identifying causal SNPs in each scenario (Table 1). In addition, we also evaluate the prediction ability of the three methods using additional data sets, and show that iBVS had advantageous performance (Figure 2). The advantages of the proposed iBVS strategy is verified when the real network is known and explicitly employed through a prior specification with the MRF distribution.

After applying iBVS to an actual GWAS dataset, a panel of 94 SNPs were selected as predictors of leprosy. The LASSO method selected 100 SNPs, which included only 24 SNPs in common with iBVS. The Stepwise regression yielded a very parsimonious model with only 3 SNPs. The results indicate that the iBVS method has comparable prediction performance with LASSO, and advantageous with Stepwise. Moreover, all the results are quite stable under the different genetic models. Stepwise regression searches the model space by adding or removing one SNP at a time and therefore the searching is partial, leading to convergence at a local optimum. The reason iBVS does not outperform LASSO may be due to deficiencies in pathway information from existing databases that do not reflect complete signaling pathways. The performance of iBVS can be improved by developing a stochastic inference of the gene-gene networks from the data and merging it into the current iBVS MCMC algorithm, which remains a future goal. Compared with LASSO and other penalized regression methods, which lack appropriate interpretation, iBVS has an additional advantage in that it produces posterior probabilities for each variable. This is a particularly important advantage in GWAS because the primary goal of the analysis is to identify the effect of SNPs. Comparing to metrics like p-values, posterior probabilities have clear interpretation. A reviewer of this article suggested that many important traits are generally quantitative and are often controlled by multiple genes in shared biology pathways; our model could be naturally extended to analyze quantitative data by removing the latent variable Z from the hierarchical model.

Efficiency of stochastic algorithms often diminishes as the total number of variables increases [19],[21]. It would be appealing to remove noisy data points or those with lower quality before using a stochastic search. Therefore, we first screened the total number of SNPs using a conventional SNP-based model to filter the number of SNPs included in the Bayesian hierarchical model. This set of candidate SNPs and their associated genes is called the ‘signature set’ in the sense that they are possibly signaling SNPs/genes (in other words, causal or marker SNPs/genes). We extracted all the SNPs in each significant gene to comprise the joint effect of a gene, leaving the weaker candidate genes out of the iBVS. The screening step should be viewed as a general step not only for dimension reduction but for constructing a functional context before conducting BVS.

A few issues regarding our model choices and computation should be highlighted. We mainly adopted the perspective of subjective Bayesian analysis due to the fact that we want to incorporate informative priors from relevant scientific sources. Choosing an objective prior that satisfies some fundamental principles as summarized in Bayarri et al. [44] would be theoretically appealing in future work. Another issue concerns computational burden. With a large number of parameters in the model, the inference is mainly based on Monte Carlo simulation, which may take a prolonged time. Running over a single computer with 3.3GH CPU computer and 8GB memory, 6 days were required to finish the leprosy data analysis. With the advent of high-speed cluster computers and the existence of cloud computing technologies, it is becoming increasingly feasible to apply full iBVS methods for biomarker identification.

Conclusions

We proposed an iBVS method to analyze high dimensional GWAS data sets based on a hierarchical model that incorporates an informative prior on networked gene interrelationships. Simulation studies showed that our iBVS method had better performance in both biomarker identification and disease prediction than LASSO and Stepwise models. A leprosy GWAS analysis showed iBVS method demonstrated a comparable performance with LASSO, and better than Stepwise methods. iBVS did not outperform LASSO, which may be due to deficiencies in existing signaling pathway databases that are likely to be improved as the knowledge base increases. In summary, the proposed iBVS strategy is a valid method for GWAS, having an additional advantage in the production of posterior probabilities for each variable that are again subject to continued refinement.

Additional file