# Integrative Bayesian variable selection with gene-based informative priors for genome-wide association studies

- 2.4k Downloads
- 4 Citations

**Part of the following topical collections:**

## Abstract

### Background

Genome-wide Association Studies (GWAS) are typically designed to identify phenotype-associated single nucleotide polymorphisms (SNPs) *individually* using univariate analysis methods. Though providing valuable insights into genetic risks of common diseases, the genetic variants identified by GWAS generally account for only a small proportion of the total heritability for complex diseases. To solve this “missing heritability” problem, we implemented a strategy called integrative Bayesian Variable Selection (iBVS), which is based on a hierarchical model that incorporates an informative prior by considering the gene interrelationship as a network. It was applied here to both simulated and real data sets.

### Results

Simulation studies indicated that the iBVS method was advantageous in its performance with highest AUC in both variable selection and outcome prediction, when compared to Stepwise and LASSO based strategies. In an analysis of a leprosy case–control study, iBVS selected 94 SNPs as predictors, while LASSO selected 100 SNPs. The Stepwise regression yielded a more parsimonious model with only 3 SNPs. The prediction results demonstrated that the iBVS method had comparable performance with that of LASSO, but better than Stepwise strategies.

### Conclusions

The proposed iBVS strategy is a novel and valid method for Genome-wide Association Studies, with the additional advantage in that it produces more interpretable posterior probabilities for each variable unlike LASSO and other penalized regression methods.

## Keywords

Biomarker discovery Bayesian hierarchical modeling Gene-based biomarkers Bayesian variable selection Integrative biomarker identification## Abbreviations

- iBVS
Integrative Bayesian variable selection

- LASSO
Least absolute shrinkage and selection operator

- STS
Screening, then selection

- AUC
The average area under the curve

## Background

Over the last decade, Genome-wide Association Studies (GWAS) have identified genetic loci associated for a variety of diseases [1]-[5]. Most studies aim to identify single nucleotide polymorphisms (SNPs) individually using univariate analysis methods [6]. Although current GWAS analyses have provided valuable insights into genetic risks of common diseases, the genetic variants identified by GWAS generally only account for a small proportion of the total heritability of complex diseases, illustrating a problem commonly referred to as “missing heritability” [7],[8]. Potential explanations for this problem include the underestimation of the effects of alleles identified, the existence of gene-gene joint effects, the contribution of rare variation, the possibility that inherited epigenetic factors lead to correlated phenotypes between relatives, and the possible overestimation of heritability of the complex traits [7],[9],[10]. Many diseases or phenotypes are likely caused by or associated with multiple SNPs, each having small effects individually, but collectively contributing a more significant genetic effect [11]. Therefore, multi-locus SNP models would offer one appealing solution in capturing the underlying genotypic-phenotypic relationship [12]-[14].

A typical GWAS study measures thousands or millions of SNPs, but the number of subjects is usually much smaller. This is known as the *P* > > *N* problem [15],[16]. One solution to this problem resorts to dimension reduction by identification of the optimal subset of predictors associated with the outcome variable. Determining the best model or selecting a subset of variables becomes an important statistical task for this method. Bayesian variable selection (BVS) provides a natural approach to solve this problem [17],[18]. Unlike penalized regression approaches, BVS naturally produces easily interpretable measures of confidence for variable selection, *i.e*., posterior selection probabilities. This is an appealing advantage in GWAS because the primary goal of the analysis is to identify the joint effect of SNPs, and to utilize these identifications to learn about underlying biology. BVS has been successfully applied to GWAS data; see the discussion by Guan [19]. The flexibility of the BVS approach allows for straightforward extensions to analyze both quantitative and qualitative data [20]-[23]. Alternative techniques such as the single-SNP test, Stepwise regression, and LASSO (Least Absolute Shrinkage and Selection Operator) were developed to address this statistical challenge. LASSO is a regression method that involves penalizing the absolute size of the regression coefficients, and Stepwise is a classic scheme for sequentially adding to or removing variables from the model. Many studies show that BVS has better performance than these alternative methods in other contexts [19],[23]-[25].

Diseases with complex inheritance may be influenced by multiple genes that interplay within genetic networks or pathways [26],[27]. Gene products interact with one another and work collaboratively within interconnected pathways explaining or associating with certain diseases. This idea led to the concept of network-based molecular biomarkers. Stingo et al. [28] proposed a Bayesian modeling strategy that addressed this concept by incorporating biological information, which was based on the structure or topology of regulatory gene-gene networks in the analysis of DNA microarray data. The method was further generalized into a 2-step framework, STS (screening, then selection) by our research team [29] where standard methods were applied in the screening step to identify a set of candidate genes which were further explored in the selection step using the BVS strategy. In addition to these two coherent steps, our strategy involves the mapping of genotype data to gene-gene networks constructed from various sources such as protein-protein interaction networks [30],[31]. We call this strategy of Bayesian biomarker discovery “integrated BVS” or iBVS. A partial least squares (PLS) g-prior for regression coefficients is also incorporated to solve the problem of non-positive deterministic covariance matrix when the sample size is smaller than the number of genes selected.

## Methods

We denote *Y* = (*Y* _{1}, ⋯ *Y* _{ n }) ' as independently observed binary outcomes in a GWAS data set, where *n* is the number of subjects and ${Y}_{i}=\left\{\begin{array}{c}\hfill 1\hfill \\ \hfill 0\hfill \end{array}\right.\begin{array}{c}\hfill \mathrm{if}\phantom{\rule{0.2em}{0ex}}i\mathrm{t}\mathrm{h}\phantom{\rule{0.2em}{0ex}}\mathrm{subject}\phantom{\rule{0.2em}{0ex}}\mathrm{has}\phantom{\rule{0.25em}{0ex}}\mathrm{target}\phantom{\rule{0.25em}{0ex}}\mathrm{disease}\phantom{\rule{0.1em}{0ex}}\hfill \\ \hfill \mathrm{otherwise}\hfill \end{array}$. Each outcome is associated with a set of predictor variables, which correspond to the genotype data. We denote *x* _{ ijk } as the genotype of the *k* ^{th} SNP of the *j* ^{th} gene on the *i* ^{th} subject, for *i* = 1, ⋯, *n*, *j* = 1, ⋯, *J*, *k* = 1, ⋯, *P* _{ j }, where *P* _{ j } is the number of SNPs mapped to the *j* ^{th} gene, and $P={\displaystyle \sum _{j=1}^{J}{P}_{j}}$ denotes the total number of SNPs in the GWAS data set. Let *A* and *a* be the major and minor alleles at a SNP. The genotype may be coded according to different types of genetic effects: additive with 0, 1, 2 coding for the genotypes *AA*, *Aa*/*aA*, *aa*; dominant (with respect to the minor allele) with 0, 1, 1 coding for *AA*, *Aa*/*aA*, *aa*; Recessive (with respect to the minor allele) with 0, 0, 1 coding for *AA*, *Aa*/*aA*, *aa*.

### iBVS with hierarchical modeling for GWAS data

*Z*to the linear regression model. Each component

*Z*

_{ i }~

*N*(0, 1) is defined such that

*ξ*= (

*ξ*

_{1}, ⋯,

*ξ*

_{ J }) and

*γ*= (

*γ*

_{1}, ⋯,

*γ*

_{ P }), to indicate the selection of genes and SNPs respectively into a model for predicting

*Z*

_{ i }, i. e,

where ${\beta}_{\left(\xi ,\gamma \right)}=\left({\beta}_{{J}_{1},\gamma},\cdots ,{\beta}_{{J}_{\left|\xi \right|},\gamma}\right),{T}_{\left(\xi ,\gamma \right)}=\left({T}_{\left({J}_{1},\gamma \right)},\cdots ,{T}_{\left({J}_{\left|\xi \right|},\gamma \right)}\right),\left|\xi \right|$ denotes the number of selected genes in predicting ${Z}_{i},{T}_{\left({J}_{l},\gamma \right)}$ denotes the vector of the first PLS component of ${X}_{{J}_{l},\gamma}\text{,}$ and *J* _{ l }(*l* = 1, ⋯, |*ξ*|) indexes the selected gene. Note that ${X}_{{J}_{l},\gamma}$ is a sub-matrix of *X*, consisting of only the columns that correspond to selected SNPs in the selected genes.

### Prior specification

*γ*for SNP selection is assumed to follow an independent Bernoulli prior distributions with the same parameter

*π*over all the

*γ*

_{ i }values.

Choice of *π* reflects a user’s prior belief in terms of the numbers of causal SNPs out of *P* candidates.

*g*-prior is commonly used for the regression coefficient parameters

*β*[32]. Yang and Song [33] generalized the g-prior by modifying the matrix inverse to the Moore-Penrose generalized matrix inverse. Since the multicollinearity problem is commonly encountered in GWAS data because of strong linkage disequilibrium between SNPs, we took a similar prior as that of Yang and Song,

where ${\left({T}_{\left(\xi ,\gamma \right)}^{\prime}{T}_{\left(\xi ,\gamma \right)}\right)}^{+}$ denotes the Moore-Penrose generalized inverse of ${T}_{\left(\xi ,\gamma \right)}^{\prime}{T}_{\left(\xi ,\gamma \right)}$. This idea was first adopted by our research team for microarray gene expression data [29].

*et al*. [28] to use an MRF to describe the prior distribution on each component of the gene selection indicator, i.e.,

*μ*and

*η*are tuning parameters, and

*Nb*

_{ j }is the set of neighbors of gene

*j*within the selected pathway. The corresponding multivariate form is given by:

where ${1}_{J}^{\prime}$ is the vector consisting of *J* 1’s. We denote matrix *R* to reflect gene-gene network topological structure, where elements *R* _{ ij } = 1 if there is a direct edge between the *i* ^{th} and *j* ^{th} genes, and *R* _{ ij } = 0 otherwise.

### Posterior distributions

*θ*= (

*γ*,

*ξ*,

*β*

_{(ξ,γ)},

*Z*) given (

*Y*,

*X*) is

where *I*(*A* _{ i }) is the indicator function and *A* _{ i } is either equal to {*Z* _{ i } : *Z* _{ i } > 0} or {*Z* _{ i } : *Z* _{ i } ≤ 0} corresponding to *Y* _{ i } = 1 or *Y* _{ i } = 0, respectively, and ${\lambda}_{1},\cdots {\lambda}_{{m}_{\xi}}\left({m}_{\xi}\le J\right)$ are the nonzero eigenvalues of ${\left({T}_{\left(\xi ,\gamma \right)}^{\prime}{T}_{\left(\xi ,\gamma \right)}\right)}^{+}$.

*β*is a nuisance parameter, we can integrate it out to obtain the joint posterior distribution of (

*Z*,

*ξ*,

*γ*|

*Y*,

*X*) as follows:

### Posterior inference via MCMC sampling

- I.
Sample

*ξ*and*γ*from*P*(*ξ*,*γ*|*Y*,*T*,*Z*): the selection parameters (*ξ*,*γ*) are updated using a Metropolis-Hastings algorithm which is modified from Stingo’s method [28] (details of the MCMC moves to update (*ξ*,*γ*) are given in Additional file 1). The method consists of randomly picking one of the following moves: (1) change the inclusion status of SNP and gene by randomly choosing from adding or removing a gene and a SNP at the same time; (2) change the inclusion status of SNP only by randomly choosing from adding or removing a SNP. - II.
Sample Z from

*P*(*Z*|*Y*,*T*,*γ*,*ξ*): directly sampling from this distribution is known to be difficult. In this article, we follow the method given in Devroye [35] to simulate samples from the univariate truncated normal distribution*P*(*Z*_{ i }|*Z*_{− i},*Y*,*T*,*γ*,*ξ*), where*Z*_{− i}is the vector of Z without the*i*^{th}element.

### Simulation

Simulation studies were conducted to assess the performances of iBVS, LASSO regression, and Stepwise regression using a proprietary set of MatLab codes, an R package glmnet, and the R package lars. We simulated three scenarios of varying different proportion of variance of Z explained by the genetic factors, labeled as follows: (1) H70: genes with network, 70% explained variance; (2) H50: genes with network, 50% explained variance; and (3) H30: genes with network, 30% explained variance.

For each scenario, 50 sets of genotypes without disease status were simulated using software Hapgen2 [36] based on the genotype data from Hapmap project (http://hapmap.ncbi.nlm.nih.gov/). We subsequently generated phenotypes corresponding to each scenario, with 400 individuals and 300 SNPs assorted into 22 genes for each data set. The first 200 individuals were used to fit the iBVS hierarchical model and to evaluate the performance of identifying causal SNPs of the three methods, while the other 200 individuals were used to assess the prediction performance of each method. All the simulations were run under the additive and dominant genetic model respectively to check the model flexibility of the proposed iBVS.

We simulated sets of phenotypes in the following way: First, we specified 10 SNPs *x* _{ j }(j = 1 … 10) as causal SNPs, which were positioned within 5 genes. In order to add network information to gene levels, a network was simulated between the genes. We then conducted a precision matrix *Ω* which contains the network relationship between the genes. If there is an edge between *p* ^{th} gene and *q* ^{th} gene in the network, *ω* _{ pq } and *ω* _{ qp } would be assigned with a non-zero value, otherwise 0. The vector *t* _{ i } was generated from a multivariate normal distribution with zero mean vector and covariance matrix *Σ* = *Ω* ^{− 1}. Then the causal gene score *T* _{ k } (k = 1…5) was calculated by considering both genotype data and gene network information, ${T}_{ik}={\displaystyle \sum _{SNP\phantom{\rule{0.1em}{0ex}}j\in \mathit{gene}\phantom{\rule{0.2em}{0ex}}k}{b}_{j}{x}_{kj}+}\phantom{\rule{0.1em}{0ex}}{t}_{ik}\text{,}$ where *b* _{ j }’s were carefully chosen to indicate different PLS configurations. We subsequently simulated the latent phenotypes score for each individual using *z* _{ i } = ∑*ϕ* _{ j }*T* _{ ik } + *ε* _{ i }, *ε* _{ i } ~ *N*(0, 1). Finally, binary phenotypes *Y* _{ i } for each individual was generated using ${Y}_{i}=\left\{\begin{array}{c}\hfill 1\hfill \\ \hfill 0\hfill \end{array}\right.\begin{array}{c}\hfill \mathrm{if}\phantom{\rule{0.2em}{0ex}}{z}_{i}\ge 0\phantom{\rule{0.1em}{0ex}}\hfill \\ \hfill \mathrm{if}\phantom{\rule{0.25em}{0ex}}{z}_{i}<0\hfill \end{array}$.

### Application

We applied iBVS, LASSO and Stepwise approaches to analyze a GWAS data set designed to identify genetic variants associated with leprosy [37]. The genotype data consisted of 492,109 SNPs from 706 cases and 514 controls after removing genetically unmatched controls, to obviate the need for correction for population stratification. All subjects were Han Chinese from eastern China. In order to select variables and assess the performance of the three variable selection strategies, we randomly divided the data set into two parts: a training set and a testing set, each with 610 samples. The training set was used for SNP selection, while the testing set for validating the selection results and comparing the three methods. The genotype was first coded under the additive genetic model, and we re-coded the genotypes following dominant genetic components and re-analyzed this real data set to check the model flexibility under different genetic effects.

## Results

### Simulation studies

#### Performance of variable selection

The average area under the curve (AUC) was calculated to evaluate the performance of casual SNP identification in each scenario. For SSVS, the AUC is calculated using the following formula [38],[39]. $AUC=\frac{1}{{n}^{D}{n}^{C}}{\displaystyle \sum _{i\in D,j\in C}I}\left\{{\gamma}_{i}>{\gamma}_{j}\right\}\text{,}$ where *D* is the set of the causal SNPs and *C* is the set of the non-causal SNPs; *n* ^{ D } and *n* ^{ C } are the number of causal and non-causal SNPs, respectively.

For the LASSO method, a simple modification of the Least Angle Regression (LAR) algorithm was implemented that calculates the entire LASSO path, which is an efficient way of computing the solution to any LASSO problem especially when *P* ≫ *N* [40]. Using the modified LARS algorithm, one may generate all LASSO solutions corresponding to different values of the penalty parameter. Selecting the active model at a given iteration would give one LASSO solution corresponding to a particular value of the penalty parameter. Hence, one can control the penalty parameter using a cutoff for the number of iterations [38]. For LASSO and Stepwise approaches, the following formula was used to calculate the AUC: $AUC=\frac{1}{{n}^{D}{n}^{C}}{\displaystyle \sum _{i\in D,j\in C}I}\left\{{s}_{i}<{s}_{j}\right\}\text{,}$ where *s* is the iteration at which the *i* ^{th} marker enters the model [38].

*Z*explained by the genetic factors. Obviously, under scenario H70 and H50, the iBVS has the highest averaged AUC (0.911 and 0.894) followed by Lasso (0.891 and 0.882), while the AUC of Stepwise is relative low (0.869 and 0.853). The AUC of iBVS drops to 0.792 with low explained variance (H30), with LASSO and Stepwise both approximating 0.77. The results revealed that iBVS has superior performance compared with that of LASSO and Stepwise regression. Similar trends can also be found under the dominant genetic model (Additional file 1: Table S2).

**Average AUC values of iBVS, LASSO and Stepwise**

Scenario | Average AUC | ||
---|---|---|---|

iBVS | LASSO | Stepwise | |

| 0.911 | 0.891 | 0.869 |

| 0.894 | 0.882 | 0.853 |

| 0.792 | 0.779 | 0.774 |

#### Performance of outcome prediction

We subsequently assessed the prediction performance of the three methods using the remaining 200 individuals. Prediction performances were evaluated by considering correctly/incorrectly predicted positive/negative outcomes. We calculated specificity and sensitivity for thresholds from 0.01 to 0.99, with steps of 0.01. Then the ROC was plotted using mean specificity and sensitivity for a given threshold. For iBVS, we use a two-stage strategy. First the posterior probabilities of all the SNPs were estimated by iBVS. The top *i* SNPs (*i* = 1…300) were subsequently fitted into a PLS logistic regression model, and 10-fold cross-validation was conducted to choose the optimal number of predictors *i*. For LASSO and Stepwise approaches, the optimal model was determined by a 10-fold cross-validation.

### Application

We first conducted screening of all SNPs, one by one by fitting the single-SNP logistic regression model with additive coding. By sorting on the *p*-values from the univariate analysis, we identified 100 genes, each containing at least one SNP with P-value smaller than 0.0004. We subsequently extracted all of the SNPs in each significant gene to comprise the joint effect of SNPs per one gene. The 100 genes selected above contained a total of 3,388 SNPs.

*R*matrix using the KEGG database (details please see the Additional file 1). We subsequently specified prior distributions as described in the Methods Section, with hyper parameters set as:

*π*= 0.01,

*μ*= − 2,

*η*= 0.8. Finally, the MCMC was conducted to make posterior inferences with 10,000 iterations as burn-in and 50,000 additional runs. Figure 3a shows the posterior SNP selection probabilities via our iBVS strategy with integrated biological priors.

A ten-fold cross-validation approach was employed to set a cut-off for determining the optimal prediction model. The top *i* SNPs were added into a PLS logistic model one by one, to estimate the cross-validation error, shown in Figure 3b. It can be seen that the smallest classification error appeared when the top 94 SNPs were selected as predictors. The classification error stabilizes after 94 SNPs were selected into the PLS logistic model, with some slight increase shown. Therefore the top 94 SNPs were selected as ‘significant’ predictors whose information was listed in (see the Additional file 1: Table S1).

**Information of 24 common SNPs selected by both iBVS and Lasso**

SNP | Chromosome | Position | Gene | Posterior probability | Lasso coefficient | Stepwise coefficient |
---|---|---|---|---|---|---|

rs9270984 | 6 | 32681969 | HLA-DR–DQ | 0.583 | 0.307 | 0.195 |

rs7595482 | 2 | 38106517 | FAM82A1 | 0.329 | −0.031 | - |

rs10133203 | 14 | 51425137 | GNG2 | 0.311 | −0.314 | - |

rs2517467 | 6 | 30997239 | VARS2 | 0.283 | 0.237 | 0.148 |

rs3764147 | 13 | 43355925 | C13orf31 | 0.272 | 0.227 | 0.114 |

rs1446297 | 2 | 38061737 | FAM82A1 | 0.256 | −0.208 | - |

rs2237585 | 7 | 94887754 | PON2 | 0.187 | −0.222 | - |

rs42490 | 8 | 90847650 | RIPK2 | 0.135 | −0.143 | - |

rs602875 | 6 | 32681607 | HLA-DR–DQ | 0.104 | −0.090 | - |

rs16945848 | 15 | 60913837 | TLN2 | 0.093 | 0.245 | - |

rs241409 | 6 | 32969898 | LOC100294145 | 0.082 | 0.018 | - |

rs12817755 | 12 | 38585079 | SLC2A13 | 0.08 | −0.137 | - |

rs1343104 | 20 | 57607136 | PHACTR3 | 0.075 | −0.06 | - |

rs10502281 | 11 | 123261833 | TMEM225 | 0.071 | −0.105 | - |

rs2305100 | 13 | 43346934 | CCDC122 | 0.066 | 0.001 | - |

rs447833 | 20 | 42696770 | ADA | 0.058 | 0.209 | - |

rs11632705 | 15 | 25141046 | GABRG3 | 0.057 | −0.043 | - |

rs17065164 | 13 | 43342706 | CCDC122 | 0.051 | −0.066 | - |

rs11900859 | 2 | 138039737 | THSD7B | 0.051 | 0.071 | - |

rs241443 | 6 | 32905093 | TAP2 | 0.045 | 0.18 | - |

rs1897419 | 2 | 137473187 | THSD7B | 0.045 | 0.023 | - |

rs1805867 | 8 | 91100250 | DECR1 | 0.043 | −0.126 | - |

rs2517598 | 6 | 30188253 | TRIM31 | 0.043 | 0.248 | - |

rs17110817 | 14 | 80120188 | CEP128 | 0.04 | 0.005 | - |

## Discussion

GWAS analyses typically approach data as a list of single SNPs, a strategy which has yielded a catalog of susceptibility loci for complex diseases. However, the statistically significant variants detected so far account for only a small proportion of the total phenotypic variation. Gene-based tests for association are increasingly being seen as useful complements to GWASs, demonstrating several attractive features compared with traditional SNP-based analysis [12]-[14]. Beyond gene-based methods, there is an increasing recognition of the potential contributions of pathway-based analyses, in which variants in groups of genes within specific pathways are considered together to predict the phenotype [41]-[43]. In this paper, we followed an integrative biomarker identification scheme to conduct a novel hierarchical model incorporating a gene-gene network or pathway information for SNP identification in GWAS via the Bayesian inference paradigm.

Three scenarios of data sets were simulated, each considering different proportions of variance of outcome explained by genetic factors. Simulation results show our iBVS method outperformed the LASSO and Stepwise methods in identifying causal SNPs in each scenario (Table 1). In addition, we also evaluate the prediction ability of the three methods using additional data sets, and show that iBVS had advantageous performance (Figure 2). The advantages of the proposed iBVS strategy is verified when the real network is known and explicitly employed through a prior specification with the MRF distribution.

After applying iBVS to an actual GWAS dataset, a panel of 94 SNPs were selected as predictors of leprosy. The LASSO method selected 100 SNPs, which included only 24 SNPs in common with iBVS. The Stepwise regression yielded a very parsimonious model with only 3 SNPs. The results indicate that the iBVS method has comparable prediction performance with LASSO, and advantageous with Stepwise. Moreover, all the results are quite stable under the different genetic models. Stepwise regression searches the model space by adding or removing one SNP at a time and therefore the searching is partial, leading to convergence at a local optimum. The reason iBVS does not outperform LASSO may be due to deficiencies in pathway information from existing databases that do not reflect complete signaling pathways. The performance of iBVS can be improved by developing a stochastic inference of the gene-gene networks from the data and merging it into the current iBVS MCMC algorithm, which remains a future goal. Compared with LASSO and other penalized regression methods, which lack appropriate interpretation, iBVS has an additional advantage in that it produces posterior probabilities for each variable. This is a particularly important advantage in GWAS because the primary goal of the analysis is to identify the effect of SNPs. Comparing to metrics like *p*-values, posterior probabilities have clear interpretation. A reviewer of this article suggested that many important traits are generally quantitative and are often controlled by multiple genes in shared biology pathways; our model could be naturally extended to analyze quantitative data by removing the latent variable *Z* from the hierarchical model.

Efficiency of stochastic algorithms often diminishes as the total number of variables increases [19],[21]. It would be appealing to remove noisy data points or those with lower quality before using a stochastic search. Therefore, we first screened the total number of SNPs using a conventional SNP-based model to filter the number of SNPs included in the Bayesian hierarchical model. This set of candidate SNPs and their associated genes is called the ‘signature set’ in the sense that they are possibly signaling SNPs/genes (in other words, causal or marker SNPs/genes). We extracted all the SNPs in each significant gene to comprise the joint effect of a gene, leaving the weaker candidate genes out of the iBVS. The screening step should be viewed as a general step not only for dimension reduction but for constructing a functional context before conducting BVS.

A few issues regarding our model choices and computation should be highlighted. We mainly adopted the perspective of subjective Bayesian analysis due to the fact that we want to incorporate informative priors from relevant scientific sources. Choosing an objective prior that satisfies some fundamental principles as summarized in Bayarri et al. [44] would be theoretically appealing in future work. Another issue concerns computational burden. With a large number of parameters in the model, the inference is mainly based on Monte Carlo simulation, which may take a prolonged time. Running over a single computer with 3.3GH CPU computer and 8GB memory, 6 days were required to finish the leprosy data analysis. With the advent of high-speed cluster computers and the existence of cloud computing technologies, it is becoming increasingly feasible to apply full iBVS methods for biomarker identification.

## Conclusions

We proposed an iBVS method to analyze high dimensional GWAS data sets based on a hierarchical model that incorporates an informative prior on networked gene interrelationships. Simulation studies showed that our iBVS method had better performance in both biomarker identification and disease prediction than LASSO and Stepwise models. A leprosy GWAS analysis showed iBVS method demonstrated a comparable performance with LASSO, and better than Stepwise methods. iBVS did not outperform LASSO, which may be due to deficiencies in existing signaling pathway databases that are likely to be improved as the knowledge base increases. In summary, the proposed iBVS strategy is a valid method for GWAS, having an additional advantage in the production of posterior probabilities for each variable that are again subject to continued refinement.

## Additional file

## Notes

### Acknowledgments

This research was supported by Eunice Kennedy Shriver National Institute of Child Health & Human Development, NIH, USA [5R01HD061404]; National Institute on Drug Abuse, NIH, USA [R44DA026683]; National Natural Science Foundation of China, China [81273177]. The authors would like to thank the editor and reviewers for their valuable comments and suggestions to improve the quality of the paper.

## Supplementary material

## References

- 1.Kettunen J, Tukiainen T, Sarin AP, Ortega-Alonso A, Tikkanen E, Lyytikäinen LP, Kangas AJ, Soininen P, Würtz P, Silander K, Dick DM, Rose RJ, Savolainen MJ, Viikari J, Kähönen M, Lehtimäki T, Pietiläinen KH, Inouye M, McCarthy MI, Jula A, Eriksson J, Raitakari OT, Salomaa V, Kaprio J, Järvelin MR, Peltonen L, Perola M, Freimer NB, Ala-Korpela M, Palotie A, et al: Genome-wide association study identifies multiple loci influencing human serum metabolite levels. Nat Genet. 2012, 44 (3): 269-276. 10.1038/ng.1073.PubMedPubMedCentralCrossRefGoogle Scholar
- 2.Chasman DI, Schürks M, Anttila V, de Vries B, Schminke U, Launer LJ, Terwindt GM, van den Maagdenberg AM, Fendrich K, Völzke H, Ernst F, Griffiths LR, Buring JE, Kallela M, Freilinger T, Kubisch C, Ridker PM, Palotie A, Ferrari MD, Hoffmann W, Zee RY, Kurth T: Genome-wide association study reveals three susceptibility loci for common migraine in the general population. Nat Genet. 2011, 43 (7): 695-698. 10.1038/ng.856.PubMedPubMedCentralCrossRefGoogle Scholar
- 3.Goode EL, Chenevix-Trench G, Song H, Ramus SJ, Notaridou M, Lawrenson K, Widschwendter M, Vierkant RA, Larson MC, Kjaer SK, Birrer MJ, Berchuck A, Schildkraut J, Tomlinson I, Kiemeney LA, Cook LS, Gronwald J, Garcia-Closas M, Gore ME, Campbell I, Whittemore AS, Sutphen R, Phelan C, Anton-Culver H, Pearce CL, Lambrechts D, Rossing MA, Chang-Claude J, Moysich KB, Goodman MT, et al: A genome-wide association study identifies susceptibility loci for ovarian cancer at 2q31 and 8q24. Nat Genet. 2010, 42 (10): 874-879. 10.1038/ng.668.PubMedPubMedCentralCrossRefGoogle Scholar
- 4.Smyth DJ, Cooper JD, Bailey R, Field S, Burren O, Smink LJ, Guja C, Ionescu-Tirgoviste C, Widmer B, Dunger DB, Savage DA, Walker NM, Clayton DG, Todd JA: A genome-wide association study of nonsynonymous SNPs identifies a type 1 diabetes locus in the interferon-induced helicase (IFIH1) region. Nat Genet. 2006, 38 (6): 617-619. 10.1038/ng1800.PubMedCrossRefGoogle Scholar
- 5.Cooper JD, Smyth DJ, Smiles AM, Plagnol V, Walker NM, Allen JE, Downes K, Barrett JC, Healy BC, Mychaleckyj JC, Warram JH, Todd JA: Meta-analysis of genome-wide association study data identifies additional type 1 diabetes risk loci. Nat Genet. 2008, 40 (12): 1399-1401. 10.1038/ng.249.PubMedPubMedCentralCrossRefGoogle Scholar
- 6.McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, Ioannidis JPA, Hirschhorn JN: Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet. 2008, 9 (5): 356-369. 10.1038/nrg2344.PubMedCrossRefGoogle Scholar
- 7.Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, McCarthy MI, Ramos EM, Cardon LR, Chakravarti A, Cho JH, Guttmacher AE, Kong A, Kruglyak L, Mardis E, Rotimi CN, Slatkin M, Valle D, Whittemore AS, Boehnke M, Clark AG, Eichler EE, Gibson G, Haines JL, Mackay TF, McCarroll SA, Visscher PM: Finding the missing heritability of complex diseases. Nature. 2009, 461 (7265): 747-753. 10.1038/nature08494.PubMedPubMedCentralCrossRefGoogle Scholar
- 8.Visscher PM: Sizing up human height variation. Nat Genet. 2008, 40 (5): 489-490. 10.1038/ng0508-489.PubMedCrossRefGoogle Scholar
- 9.Gibson G: Hints of hidden heritability in GWAS. Nat Genet. 2010, 42 (7): 558-560. 10.1038/ng0710-558.PubMedCrossRefGoogle Scholar
- 10.Eichler EE, Flint J, Gibson G, Kong A, Leal SM, Moore JH, Nadeau JH: Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet. 2010, 11 (6): 446-450. 10.1038/nrg2809.PubMedPubMedCentralCrossRefGoogle Scholar
- 11.Stranger BE, Stahl EA, Raj T: Progress and promise of genome-wide association studies for human complex trait genetics. Genetics. 2011, 187 (2): 367-383. 10.1534/genetics.110.120907.PubMedPubMedCentralCrossRefGoogle Scholar
- 12.Liu JZ, McRae AF, Nyholt DR, Medland SE, Wray NR, Brown KM, Investigators AMFS, Hayward NK, Montgomery GW, Visscher PM, Martin NG, Macgregor S: A versatile gene-based test for genome-wide association studies. Am J Hum Genet. 2010, 87 (1): 139-145. 10.1016/j.ajhg.2010.06.009.PubMedPubMedCentralCrossRefGoogle Scholar
- 13.Yang J, Ferreira T, Morris AP, Medland SE, Madden PA, Heath AC, Martin NG, Montgomery GW, Weedon MN, Loos RJ, Frayling TM, McCarthy MI, Hirschhorn JN, Goddard ME, Visscher PM, Genetic Investigation of ANthropometric Traits (GIANT) Consortium, DIAbetes Genetics Replication And Meta-analysis (DIAGRAM) Consortium: Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat Genet. 2012, 44 (4): 369-375. 10.1038/ng.2213. S1-S3PubMedPubMedCentralCrossRefGoogle Scholar
- 14.Li M, Gui H, Kwan JS, Sham PC: GATES: a rapid and powerful gene-based association test using extended Simes procedure. Am J Hum Genet. 2011, 88 (3): 283-293. 10.1016/j.ajhg.2011.01.019.PubMedPubMedCentralCrossRefGoogle Scholar
- 15.Prentice RL, Lihong QI: Aspects of the design and analysis of high-dimensional SNP studies for disease risk estimation. Biostatistics. 2006, 7 (3): 339-354. 10.1093/biostatistics/kxj020.PubMedCrossRefGoogle Scholar
- 16.Sölkner J: Very many variables and limited numbers of observations; The p>> n problem in current statistical applications. Information Technology Interfaces (ITI). Proceedings of the ITI 2012 34th International Conference 25-28 June 2012. 2012, 13-14.Google Scholar
- 17.Tadesse MG, Sha N, Vannucci M: Bayesian variable selection in clustering high-dimensional data. J Am Stat Assoc. 2005, 100 (470): 602-617. 10.1198/016214504000001565.CrossRefGoogle Scholar
- 18.Mitchell TJ, Beauchamp JJ: Bayesian variable selection in linear regression. J Am Stat Assoc. 1988, 83 (404): 1023-1032. 10.1080/01621459.1988.10478694.CrossRefGoogle Scholar
- 19.Guan Y, Stephens M: Bayesian variable selection regression for genome-wide association studies and other large-scale problems. Ann Appl Stat. 2011, 5 (3): 1780-1815. 10.1214/11-AOAS455.CrossRefGoogle Scholar
- 20.Fridley BL: Bayesian variable and model selection methods for genetic association studies. Genet Epidemiol. 2008, 33 (1): 27-37. 10.1002/gepi.20353.CrossRefGoogle Scholar
- 21.Wilson MA, Iversen ES, Clyde MA, Schmidler SC, Schildkraut JM: Bayesian model search and multilevel inference for SNP association studies. Ann Appl Stat. 2010, 4 (3): 1342-10.1214/09-AOAS322.PubMedPubMedCentralCrossRefGoogle Scholar
- 22.Banerjee S, Yandell BS, Yi N: Bayesian quantitative trait loci mapping for multiple traits. Genetics. 2008, 179 (4): 2275-2289. 10.1534/genetics.108.088427.PubMedPubMedCentralCrossRefGoogle Scholar
- 23.Russu A, Malovini A, Puca AA, Bellazzi R: Stochastic model search with binary outcomes for genome-wide association studies. J Am Med Inform Assn. 2012, 19 (e1): e13-e20. 10.1136/amiajnl-2011-000741.CrossRefGoogle Scholar
- 24.Hoggart CJ, Whittaker JC, De Iorio M, Balding DJ: Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genet. 2008, 4 (7): e1000130-10.1371/journal.pgen.1000130.PubMedPubMedCentralCrossRefGoogle Scholar
- 25.Kwon S, Wang D, Guo X: Application of an iterative Bayesian variable selection method in a genome-wide association study of rheumatoid arthritis. BMC Proc. 2007, 1 (Suppl 1): S109-10.1186/1753-6561-1-s1-s109.PubMedPubMedCentralCrossRefGoogle Scholar
- 26.Torkamani A, Schork NJ: Pathway and network analysis with high-density allelic association data. Methods Mol Biol. 2009, 563: 289-301. 10.1007/978-1-60761-175-2_16.PubMedCrossRefGoogle Scholar
- 27.Baranzini SE, Galwey NW, Wang J, Khankhanian P, Lindberg R, Pelletier D, Wu W, Uitdehaag BMJ, Kappos L, Polman CH: Pathway and network-based analysis of genome-wide association studies in multiple sclerosis. Hum Mol Genet. 2009, 18 (11): 2078-2090. 10.1093/hmg/ddp120.PubMedPubMedCentralCrossRefGoogle Scholar
- 28.Stingo FC, Chen YA, Tadesse MG, Vannucci M: Incorporating biological information into linear models: a Bayesian approach to the selection of pathways and genes. Ann Appl Stat. 2011, 5 (3): 1978-2002. 10.1214/11-AOAS463.PubMedCrossRefGoogle Scholar
- 29.Peng B, Zhu D, Ander BP, Zhang X, Xue F, Sharp FR, Yang X: An Integrative Framework for Bayesian variable selection with informative priors for identifying genes and pathways. PLoS One. 2013, 8 (7): e67672-10.1371/journal.pone.0067672.PubMedPubMedCentralCrossRefGoogle Scholar
- 30.Chuang H, Lee E, Liu Y, Lee D, Ideker T: Network-based classification of breast cancer metastasis. Mol Syst Biol. 2007, 3: 140-10.1038/msb4100180.PubMedPubMedCentralCrossRefGoogle Scholar
- 31.Lee E, Chuang H, Kim J, Ideker T, Lee D: Inferring pathway activity toward precise disease classification. PLoS Comput Biol. 2008, 4 (11): e1000217-10.1371/journal.pcbi.1000217.PubMedPubMedCentralCrossRefGoogle Scholar
- 32.Zellner A: On assessing prior distributions and Bayesian regression analysis with g-prior distributions. Bayesian Inference Decision Techniques. 1986, 6: 233-243.Google Scholar
- 33.Ai-Jun Y, Xin-Yuan S: Bayesian variable selection for disease classification using gene expression data. Bioinformatics. 2010, 26 (2): 215-222. 10.1093/bioinformatics/btp638.CrossRefGoogle Scholar
- 34.Li F, Zhang NR: Bayesian variable selection in structured high-dimensional covariate spaces with applications in genomics. J Am Stat Assoc. 2010, 105 (491): 1202-1214. 10.1198/jasa.2010.tm08177.CrossRefGoogle Scholar
- 35.Devroye L: Sample-based Non-uniform random variate generation. Proceedings of the 18th conference on Winter simulation. ACM. 1986, 260-265. 10.1145/318242.318443.CrossRefGoogle Scholar
- 36.Su Z, Marchini J, Donnelly P: HAPGEN2: simulation of multiple disease SNPs. Bioinformatics. 2011, 27 (16): 2304-2305. 10.1093/bioinformatics/btr341.PubMedPubMedCentralCrossRefGoogle Scholar
- 37.Zhang FR, Huang W, Chen SM, Sun LD, Liu H, Li Y, Cui Y, Yan XX, Yang HT, Yang RD: Genomewide association study of leprosy. New Engl J Med. 2009, 361 (27): 2609-2618. 10.1056/NEJMoa0903753.PubMedCrossRefGoogle Scholar
- 38.Srivastava S, Chen L: Comparison between the stochastic search variable selection and the least absolute shrinkage and selection operator for genome-wide association studies of rheumatoid arthritis. BMC Proc. 2009, 3 (Suppl 7): S21-10.1186/1753-6561-3-s7-s21.PubMedPubMedCentralCrossRefGoogle Scholar
- 39.Ma S, Huang J: Combining multiple markers for classification using ROC. Biometrics. 2007, 63 (3): 751-757. 10.1111/j.1541-0420.2006.00731.x.PubMedCrossRefGoogle Scholar
- 40.Efron B, Hastie T, Johnstone I, Tibshirani R: Least angle regression. Ann Appl Stat. 2004, 32 (2): 407-499. 10.1214/009053604000000067.CrossRefGoogle Scholar
- 41.Ramanan VK, Shen L, Moore JH, Saykin AJ: Pathway analysis of genomic data: concepts, methods, and prospects for future development. Trends Genet. 2012, 28 (7): 323-332. 10.1016/j.tig.2012.03.004.PubMedPubMedCentralCrossRefGoogle Scholar
- 42.Consortium IMSG: Network-based multiple sclerosis pathway analysis with GWAS data from 15,000 cases and 30,000 controls. Am J Hum Genet. 2013, 92 (6): 854-10.1016/j.ajhg.2013.04.019.CrossRefGoogle Scholar
- 43.Mukherjee S, Kim S, Ramanan VK, Gibbons LE, Nho K, Glymour MM, Ertekin-Taner N, Montine TJ, Saykin AJ, Crane PK: Gene-based GWAS and biological pathway analysis of the resilience of executive functioning. Brain Imaging Behav. 2014, 8 (1): 110-118. 10.1007/s11682-013-9259-7.PubMedPubMedCentralCrossRefGoogle Scholar
- 44.Bayarri MJ, Berger JO, Forte A, García-Donato G: Criteria for Bayesian model choice with application to variable selection. Ann Appl Stat. 2012, 40 (3): 1550-1577. 10.1214/12-AOS1013.CrossRefGoogle Scholar

## Copyright information

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.