Background

Expression quantitative trait loci (eQTL) mapping is the process of identifying single nucleotide polymorphisms (SNPs) that play important roles in the expression of genes. It has been widely used to dissect genetic basis of complex traits [1,2]. Traditionally, associations between individual expression traits and SNPs are assessed separately [3,4].

Since genes in the same biological pathway are often co-regulated and may share a common genetic basis [5,6], it is crucial to understand how multiple modestly-associated SNPs interact to influence the phenotypes [7]. To address this issue, several approaches have been proposed to study the joint effect of multiple SNPs by testing the association between a set of SNPs and a gene expression trait. A straightforward approach is to follow the gene set enrichment analysis (GESA) [8]. In [9], the authors propose variance component models for SNP set testing. Aggregation-based approaches such as collapsing SNPs are investigated in [10]. In [11], the authors take confounding factors into consideration.

Despite their success, these methods have two common limitations. First, they only study the association between a set of SNPs and a single expression trait, thus overlook the joint effect of a set of SNPs on the activities of a set of genes, which may act and interact with each other to achieve certain biological function. Second, the SNP sets used in these methods are usually taken from known pathways. However, the existing knowledge on biological pathways is far from being complete. These methods cannot identify unknown associations between SNP sets or gene sets.

To address these limitations, in [12], a method is developed to identify cliques in a bipartite graph derived from the eQTL data. Cliques are used to model the hidden correlations between SNP sets and gene sets. However, this method needs the progeny strain information, which is used as a bridge for modeling the eQTL association graphs. In [13], the authors proposed a method to infer associations between sets of SNPs and sets of genes. However, this method does not consider the associations between individual SNPs and genes. A two-graph-guided multi-task Lasso approach was developed in [14]. This method needs to calculate gene co-expression network and SNP correlation network first. Errors and noises in these two networks may introduce bias in the final results. A graph regularized dual lasso approach considering the factor of group-wise association was developed in [15]. This method, however, needs extra SNP-SNP interaction network and PPI network data to penalize the regression model and it’s not able to infer novel group-wise associations. Note that all these methods do not consider confounding factors.

To better elucidate the genetic basis of gene expression and understand the underlying biology pathways, it is highly desirable to develop methods that can automatically infer associations between a group of SNPs and a group of genes. We refer to the process of identifying such associations as group-wise eQTL mapping. In contrast, we refer to the process of identifying associations between individual SNPs and genes as individual eQTL mapping. In this paper, we introduce a fast and robust approach to identify novel associations between sets of SNPs and sets of genes. Our model is a multi-layer linear-Gaussian model and uses two different types of hidden variables: one capturing group-wise associations and the other capturing confounding factors [11,16-20]. We apply an 1-norm on the parameters [3,21], which yields a sparse network with a large number of association weights being zero [22]. We develop an efficient optimization procedure that makes this approach suitable for large-scale studiesa. Extensive experimental evaluations using both simulated and real datasets demonstrate that the proposed methods can effectively capture both group-wise and individual associations and significantly outperforms the state-of-the-art eQTL mapping methods.

Methods

Preliminaries

Throughout the paper, we assume that, for each sample, the SNPs and genes are represented by column vectors. Let x = [ x 1,x 2,…,x K ]T represent the K SNPs in the study, where x i ∈{0,1,2} is a random variable corresponding to the i-th SNPb. Let z= [ z 1,z 2,…,z N ]T represent the N genes in the study, where z j is a continuous random variable corresponding to the j-th gene. Table 1 summarizes the main symbols used in this paper.

Table 1 Summary of notations

The traditional linear regression model for association mapping between x and z is

$$ \mathbf{z}=\boldsymbol{\beta}\mathbf{x}+\boldsymbol{\mu}+\boldsymbol{\epsilon}, $$
((1))

where z is a linear function of x with coefficient matrix β. μ is an N×1 translation factor vector. ε is the additive noise of Gaussian distribution with zero-mean and variance ψ I, where ψ is a scalar. That is, εN(0,ψ I).

The question now is how to define an appropriate objective function to decompose β which (1) can effectively detect both individual and group-wise eQTL associations, and (2) is efficient to compute so that it is suitable for large-scale studies. In the next, we will propose a group-wise eQTL detection method first, then improve it to capture both individual and group-wise associations. Then we will discuss how to boost the computational efficiency.

Graphical model for group-wise eQTL mapping

To infer associations between SNP sets and gene sets while taking into consideration confounding factors, we propose a graphical model as shown in Figure 1. This model is a two-layer linear Gaussian model. There are two different types of hidden variables in the middle layer. One is used to capture the group-wise association between SNP sets and gene sets. These latent variables are presented as y= [ y 1,y 2,…,y M ]T, where M is the total number of latent variables bridging SNP sets and gene sets. Each hidden variable may represent a latent factor regulating a set of genes, and its associated genes may correspond to a set of genes in the same pathway or participating in certain biological function. Another type of hidden variable, s= [ s 1,s 2,…,s H ]T, is used to model confounding factors. Note that this model allows a SNP or gene to participate in multiple (SNP set, gene set) pairs. This is reasonable because SNPs and genes may play different roles in multiple biology pathways.

Figure 1
figure 1

Graphical model with two types of hidden variables, shaded nodes denote observed random variables, and unshaded nodes denote latent variables.

Incorporating individual effect

In the graphical model shown in Figure 1, we use a hidden variable y as a bridge between a SNP set and a gene set to capture the group-wise effect. In addition, individual effects may exist as well [11]. To incorporate both individual and group-wise effects, we extend the model in Figure 1 and add one edge between x and z to capture individual associations as shown in Figure 2. We will show that this refinement will significantly improve the accuracy of model and enhance its computational efficiency.

Figure 2
figure 2

Refined graphical model to capture both individual and group-wise associations, shaded nodes denote observed random variables, and unshaded nodes denote latent variables.

Objective function

Next, we give the derivation of the objective function for the model in Figure 2. We assume that the two conditional probabilities follow normal distributions:

$$ \mathbf{y}|\mathbf{x} \sim N\left(\mathbf{y}|\textbf{A}x+\boldsymbol{\mu}_{\textbf{A}},{\sigma_{1}^{2}}\mathbf{I}_{\mathbf{M}}\right), $$
((2))

and

$$ \mathbf{z}|\mathbf{y,x}\sim N\left(\mathbf{z}|\mathbf{By} + \mathbf{Cx} + \mathbf{Ws} + \boldsymbol{\mu}_{\mathbf{B}},{\sigma_{2}^{2}}\mathbf{I}_{\mathbf{N}}\right), $$
((3))

where \(\textbf {A} \in \mathbb {R}^{M\times K}\) is the coefficient matrix between x and y, \(\textbf {B} \in \mathbb {R}^{N\times M}\) is the coefficient matrix between y and z, \(\textbf {C} \in \mathbb {R}^{N\times K}\) is the coefficient matrix between x and z to capture the individual associations, \(\textbf {W} \in \mathbb {R}^{N\times H}\) is the coefficient matrix of confounding factors. \(\boldsymbol {\mu }_{\textbf {A}}\in \mathbb {R}^{M\times 1}\) and \(\boldsymbol {\mu }_{\textbf {B}}\in \mathbb {R}^{N\times 1}\) are the translation factor vectors, \({\sigma _{1}^{2}}\mathbf {I}_{M}\) and \({\sigma _{2}^{2}}\mathbf {I}_{N}\) are the variances of the two conditional probabilities respectively (σ 1 and σ 2 are constant scalars and I M and I N are identity matrices).

Since the expression level of a gene is usually affected by a small fraction of SNPs, we impose sparsity on A, B and C. We assume that the entries of these matrices follow Laplace distributions:

A i,j Laplace(0,1/λ),

B i,j Laplace(0,1/γ), and

C i,j Laplace(0,1/α).

λ, γ and α will be used as parameters in the objective function. The probability density function of Laplace(μ,b) distribution is \(f(x|\mu,b)=\frac {1}{2b}\exp \left (-\frac {|x-\mu |}{b}\right)\).

Thus, we have

$$ \mathbf{y} = \mathbf{Ax} + \boldsymbol{\mu}_{\mathbf{A}} + \boldsymbol{\epsilon}_{1}, $$
((4))
$$ \mathbf{z} = \mathbf{By} + \mathbf{Cx} + \mathbf{Ws} + \boldsymbol{\mu}_{\mathbf{B}} + \boldsymbol{\epsilon}_{2}, $$
((5))

where \(\boldsymbol {\epsilon }_{1} \sim N\left (\textbf {0}, {\sigma _{1}^{2}} \mathbf {I}_{M}\right), \boldsymbol {\epsilon }_{2} \sim N\left (\textbf {0}, {\sigma _{2}^{2}} \mathbf {I}_{N}\right)\). From Eq. (2) we have

$$ \mathbf{By}|\mathbf{x} \sim N\left(\mathbf{BAx} + \mathbf{B}\boldsymbol{\mu}_{\mathbf{A}}, {\sigma_{1}^{2}} \mathbf{BB}^{\mathrm{T}}\right), $$
((6))

Assuming that the confounding factors follow normal distribution [11], sN(0,I H ), then we have

$$ \mathbf{W}\mathbf{s} \sim N(\mathbf{0}, \mathbf{WW}^{\mathrm{T}}). $$
((7))

We substitute Eq. (6), (7) into Eq. (5), and get

$$ {}\mathbf{z}|\mathbf{x}\!\sim\! N\!\left(\mathbf{BAx}\,+\,\mathbf{B}\boldsymbol{\mu}_{\mathbf{A}} \,+\, \mathbf{Cx} \,+\, \boldsymbol{\mu}_{\mathbf{B}}, {\sigma_{1}^{2}} \mathbf{BB}^{\mathrm{T}} + \mathbf{WW}^{\mathrm{T}} \,+\, {\sigma_{2}^{2}} \mathbf{I}_{N}\right). $$

From the formula above, we observe that the summand B μ A can also be integrated in μ B . Thus to simplify the model, we set μ A =0 and obtain

$$ \mathbf{z}|\mathbf{x} \sim N\left(\mathbf{BAx} + \mathbf{Cx} + \boldsymbol{\mu}_{\mathbf{B}}, {\sigma_{1}^{2}} \mathbf{BB}^{\mathrm{T}} + \mathbf{WW}^{\mathrm{T}} + {\sigma_{2}^{2}} \mathbf{I}_{N}\right). $$

To learn the parameters, we can use MLE (Maximize Likelihood Estimation) or MAP (Maximum a posteriori). Then, we get the likelihood function as \(p(\mathbf {z}|\mathbf {x})=\prod _{d=1}^{D} p(\mathbf {z}_{d}|\mathbf {x}_{d})\). Maximizing the likelihood function is identical to minimizing the negative log-likelihood. Here, the negative log-likelihood (loss function) is

$$ \begin{aligned} \mathcal{J}&=\sum_{d=1}^{D}\mathcal{J}_{d}\\ &=-1\cdot\log \prod_{d=1}^{D} p(\mathbf{z}_{d}|\mathbf{x}_{d})\\ &=\sum_{d=1}^{D}(-1)\cdot\log p(\mathbf{z}_{d}|\mathbf{x}_{d})\\ &= \frac{D \cdot N}{2} \log(2 \pi) + \frac{D}{2} \log|\boldsymbol{\Sigma}| \\ &\quad+ \frac{1}{2} \sum_{d=1}^{D} \left[\left(\mathbf{z}_{d} - \boldsymbol{\mu}_{d}\right)^{\mathrm{T}} \boldsymbol{\Sigma}^{-1} \left(\mathbf{z}_{d} - \boldsymbol{\mu}_{d}\right)\right], \end{aligned} $$
((8))

where

$$ \boldsymbol{\mu}_{d} = \mathbf{B A x}_{d} + \mathbf{C} \mathbf{x}_{d} + \boldsymbol{\mu}_{\mathbf{B}}, $$
$$ \boldsymbol{\Sigma} = {\sigma_{1}^{2}} \mathbf{B B}^{\mathrm{T}} + \mathbf{WW}^{\mathrm{T}} + {\sigma_{2}^{2}} \mathbf{I}_{N}. $$

Moreover, taking into account the prior distributions of A, B and C, we have

$$ \begin{aligned} &p(\mathbf{z}_{d},\textbf{A},\textbf{B},\textbf{C}|\mathbf{x}_{d},\textbf{W}, \sigma_{1}, \sigma_{2})\\&\quad= \exp(-\mathcal{J}_{d})\cdot\frac{\lambda}{2}\prod_{i,j}\exp(-\lambda|\textbf{A}_{i,j}|)\\ &\qquad\cdot\frac{\gamma}{2}\prod_{i,j}\exp(-\gamma|\textbf{B}_{i,j}|)\cdot\frac{\alpha}{2}\prod_{i,j}\exp(-\alpha|\textbf{C}_{i,j}|). \end{aligned} $$
((9))

Thus, we have the 1-regularized objective function

$$\begin{array}{@{}rcl@{}} \max_{\textbf{A},\textbf{B},\textbf{C},\textbf{W},\sigma_{1},\sigma_{2}}\log\prod_{d=1}^{D} p(\mathbf{z}_{d},\textbf{A},\textbf{B},\textbf{C}|\mathbf{x}_{d},\textbf{W}, \sigma_{1}, \sigma_{2}), \end{array} $$

which is identical to

$${} \min_{\textbf{A},\textbf{B},\textbf{C},\textbf{W},\sigma_{1},\sigma_{2}} [\!\mathcal{J}+D\cdot(\lambda||\textbf{A}||_{1}+\gamma||\textbf{B}||_{1}+\alpha||\textbf{C}||_{1})], $$
((10))

where ||·||1 is the 1-norm. λ, γ and α are the precision of the prior Laplace distributions of A, B and C respectively. They serve as the regularization parameters and can be determined by cross or holdout validation.

The explicit expression of μ B can be derived as follows. When A, B and C are fixed, we have \( \mathcal {J} = \frac {D \cdot N}{2} \log (2 \pi) + \frac {D}{2} \log |\boldsymbol {\Sigma }| + \frac {1}{2} \sum _{d=1}^{D} \vphantom {\boldsymbol {\Sigma }^{-1}}[\!(\mathbf {z}_{d} - \mathbf {B A x}_{d} - \mathbf {C x}_{d}- \boldsymbol {\mu }_{\mathbf {B}})^{\mathrm {T}} \boldsymbol {\Sigma }^{-1} (\mathbf {z}_{d} - \mathbf {B A x}_{d} - \mathbf {C x}_{d} - \boldsymbol {\mu }_{\mathbf {B}})] \). When D=1, this is a classic maximum likelihood estimatation problem, and we have μ B =z d B A x d C x d . When D>1, leveraging the fact that Σ −1 is symmetric, we convert the problem into a least-square problem, which leads to

$$\boldsymbol{\mu}_{\mathbf{B}} = \frac{1}{D} \sum\limits_{d=1}^{D} (\mathbf{z}_{d} - \mathbf{B A x}_{d} - \mathbf{C x}_{d}). $$

Substituting it into Eq. (8), we have

$$ \begin{aligned} \mathcal{J} =&\, \frac{D \cdot N}{2} \log(2 \pi) + \frac{D}{2} \log|\boldsymbol{\Sigma}| + \frac{1}{2} {\sum\nolimits}_{d=1}^{D} \left\{\left[\left(\mathbf{z}_{d} - \bar{\mathbf{z}}\right) \right.\right.\\&\left.- \left(\mathbf{B A} + \mathbf{C}\right) \left(\mathbf{x}_{d} - \bar{\mathbf{x}}\right)\right]^{\mathrm{T}} \boldsymbol{\Sigma}^{-1} \left[\left(\mathbf{z}_{d} - \bar{\mathbf{z}}\right) \right.\\&\left.\left.- \left(\mathbf{B A} + \mathbf{C}\right) \left(\mathbf{x}_{d} - \bar{\mathbf{x}}\right)\right]\right\}, \end{aligned} $$
((11))

where

$$ \bar{\mathbf{x}} = \frac{1}{D}\sum_{d=1}^{D} \mathbf{x}_{d}, \qquad\qquad \bar{\mathbf{z}} = \frac{1}{D}\sum_{d=1}^{D} \mathbf{z}_{d}. $$

Optimization

To optimize the objective function, there are many off-the-shelf 1-penalized optimization tools. We use the Orthant-Wise Limited-memory Quasi-Newton (OWL-QN) algorithm described in [23]. The OWL-QN algorithm minimizes functions of the form

$$f(w) = loss(w) + c ||w||_{1}, $$

where loss(·) is an arbitrary differentiable loss function, and ||w||1 is the 1-norm of the parameter vector. It is based on the L-BFGS Quasi-Newton algorithm [24], with modifications to deal with the fact that the 1-norm is not differentiable. The algorithm is proven to converge to a local optimum of the parameter vector. The algorithm is very fast, and capable of scaling efficiently to problems with millions of parameters. Thus it is a good option for our problem where the parameter space is large when dealing with large scale eQTL data.

In addition to the loss function and penalized parameters, the OWL-QN algorithm also requires the gradient of the loss function, which (without detailed derivation) is given in the Additional file 1.

Computational speedup

In this section, we discuss how to speedup the optimization process for the proposed model. In the previous section, we have shown that A, B, C, W, σ 1, and σ 2 are the parameters to be solved. Here, we first derive an updating scheme for σ 2 when other parameters are fixed by following a similar technique as discussed in [25]. For other parameters, we develop an efficient method for calculating the inverse of the covariance matrix which is the main bottleneck of the optimization process.

Updating σ 2

When all other parameters are fixed, using spectral decomposition on \(\left ({\sigma _{1}^{2}} \mathbf {B B}^{\mathrm {T}} + \mathbf {WW}^{\mathrm {T}}\right)\), we have

$$ { \fontsize{9.5}{6}\begin{aligned} \boldsymbol{\Sigma} &= \left({\sigma_{1}^{2}} \mathbf{B B}^{\mathrm{T}} + \mathbf{WW}^{\mathrm{T}}\right) + {\sigma_{2}^{2}} \mathbf{I}_{N}\\ &= \left[\mathbf{U}, \mathbf{V}\right] \text{diag}\left(\lambda_{1}+{\sigma_{2}^{2}}, \ldots, \lambda_{N-q}+{\sigma_{2}^{2}}, 0, \ldots, 0\right) \left[\mathbf{U}, \mathbf{V}\right]^{\mathrm{T}}\\ &= \mathbf{U} \text{diag}\left(\lambda_{1}+{\sigma_{2}^{2}}, \ldots, \lambda_{N-q}+{\sigma_{2}^{2}}\right) \mathbf{U}^{\mathrm{T}}, \end{aligned}} $$
((12))

where U is an N×(Nq) eigenvector matrix corresponding to the nonzero eigenvalues; V is an N×q eigenvector matrix corresponding to the zero eigenvalues. A reasonable solution should have no zero eigenvalues in Σ, otherwise the loss function would be infinitely big. Therefore, q=0.

Thus

$$\boldsymbol{\Sigma}^{-1} = \mathbf{U} \text{diag}\left(\frac{1}{\lambda_{1}+{\sigma_{2}^{2}}}, \ldots, \frac{1}{\lambda_{N}+{\sigma_{2}^{2}}}\right) \mathbf{U}^{\mathrm{T}}. $$

Let U T(z d B A x d C x d μ B )= :[η d,1,η d,2,…,η d,N ]T. Then solving σ 2 is equivalent to minimizing

$$ \begin{aligned} l\left({\sigma_{2}^{2}}\right) =&\, \frac{D \cdot N}{2} \log(2 \pi) + \frac{D}{2} \sum\limits_{s=1}^{N} \log\left(\lambda_{s}+{\sigma_{2}^{2}}\right) \\ &+ \frac{1}{2} \sum\limits_{d=1}^{D} \sum\limits_{s=1}^{N} \frac{\eta_{d, s}^{2}}{\lambda_{s}+{\sigma_{2}^{2}}}, \end{aligned} $$
((13))

whose derivative is

$$l^{\prime}\left({\sigma_{2}^{2}}\right) = \frac{D}{2} \sum\limits_{s=1}^{N} \frac{1}{\lambda_{s}+{\sigma_{2}^{2}}} - \frac{1}{2} \sum_{d=1}^{D} \sum\limits_{s=1}^{N} \frac{\eta_{d, s}^{2}}{\left(\lambda_{s}+{\sigma_{2}^{2}}\right)^{2}}. $$

This is a 1-dimensional optimization problem that can be solved very efficiently.

Efficiently inverting the covariance matrix

From objective function Eq. 11 and the gradient of the parameters (given in the Additional file 1), the time complexity of each iteration in the optimization procedure is \(\mathcal {O}(DN^{2}M+DN^{2}H+DN^{3}+DNMK)\). Since MN and HN, the third term of the time complexity (\(\mathcal {O}(DN^{3})\)) is the bottleneck of the overall performance. This is for computing the inverse of the covariance matrix

$$\boldsymbol{\Sigma} = {\sigma_{1}^{2}} \mathbf{BB}^{\mathrm{T}} + \mathbf{WW}^{\mathrm{T}} + {\sigma_{2}^{2}} \mathbf{I}_{N}, $$

which is much more time-consuming than other matrix multiplication operations.

We devise an acceleration strategy that calculates Σ −1 using formula (14) in the following theorem. The complexity of computing the inverse reduces to \(\mathcal {O}(M^{3}+H^{3})\).

Theorem 1.

Given \(\mathbf {B} \in \mathbb {R}^{N \times M}\), \(\mathbf {W} \in \mathbb {R}^{N \times H}\), and

$$ \boldsymbol{\Sigma} = {\sigma_{2}^{2}} \mathbf{I}_{N} + {\sigma_{1}^{2}} \mathbf{BB}^{\mathrm{T}} + \mathbf{WW}^{\mathrm{T}}. $$

Then

$$ \mathbf{\Sigma}^{-1} = \mathbf{T} - \mathbf{TWS}^{-1} \mathbf{W}^{\mathrm{T}} \mathbf{T}, $$
((14))

where

$$ \mathbf{S} = \mathbf{I}_{H} + \mathbf{W}^{\mathrm{T}} \mathbf{TW}, $$
((15))
$$ \mathbf{T} = \sigma_{2}^{-2}\left(\mathbf{I}_{N} - {\sigma_{1}^{2}} \mathbf{B} \left({\sigma_{2}^{2}}\mathbf{I}_{M} + {\sigma_{1}^{2}} \mathbf{B}^{\mathrm{T}}\mathbf{B}\right)^{-1} \mathbf{B}^{\mathrm{T}}\right). $$
((16))

The proof is provided in the Additional file 1.

Results and discussion

We apply our method to both simulation datasets and yeast eQTL datasets [26] to evaluate its performance. For simplicity, we refer to the proposed model that only considers group-wise associations as Model 1, and the model that considers both individual and group-wise associations as Model 2. For comparision, we select several recent eQTL methods, including LORS [27], MTLasso2G [14], FaST-LMM [11], SET-eQTL [13] and Lasso [3]. The tuning parameters in the selected methods are learned using cross-validation. All experiments are performed on a PC with 2.20 GHz Intel i7 eight-core CPU and 8 GB memory.

Simulation study

We first evaluate whether Model 2 can identify both individual and group-wise associations. We adopt a similar setup for simulation study to that in [27,28] and generate synthetic datasets as follows. 100 SNPs are randomly selected from the yeast eQTL dataset [26]. N gene expression profiles are generated by Z j=β j X+Ξ j+E j (1≤jN), where E jN(0,η I) (η=0.1) denotes Gaussian noise. Ξ j is used to model non-genetic effects, which is drawn from N(0,ρ Λ), where ρ=0.1. Λ is generated by F F T, where \(\mathbf {F}\in \mathbb {R}^{D\times U}\) and F ij N(0,1). U is the number of hidden factors and is set to 10 by default. The association matrix β is shown in the top-left plot in Figure 3. The association strength is 1 for all selected SNPs. There are in total four group-wise associations of different scales. The associations on the diagonal are used to represent individual association signals in cis-regulation.

Figure 3
figure 3

Ground truth of matrix β and linkage weights estimated by Model 2 on simulated data. The x-axis represents traits and y-axis represents SNPs. Normalized absolute values of regression coefficients are used. Darker color implies stronger association.

The remaining three plots in Figure 3 show associations estimated by Model 2. From the figure, we can see that Model 2 well captures both individual and group-wise signals. For comparison, Figure 4 visualizes the association weights estimated by Model 1 and Model 2 when varying the number of hidden variables (M). We observe that for Model 1, when M=20, most of the individual association signals on the diagonal are not captured. As M increases, more individual association signals are detected by Model 1. In contrast, Model 2 recovers both individual and group-wise linkage signals with small M.

Figure 4
figure 4

Association weights estimated by Model 1 and Model 2 on simulated data with different M ’s. The x-axis represents traits and y-axis represents SNPs.

Next, we generate 50 simulated datasets with different signal-to-noise ratios (defined as \(SNR=\sqrt {\frac {Var(\mathbf {\beta }\mathbf {X})}{Var(\Xi + \mathbf {E})}}\)) in the eQTL datasets [27] to compare the performance of the selected methods. Here, we fix H=10,ρ=0.1, and use different η’s to control SNR. For each setting, we report the averaged result from the 50 datasets. For the proposed methods, we use B A+C as the overall associations. Since FaST-LMM needs extra information (e.g., the genetic similarities between individuals) and uses PLINK format, we do not list it here and will compare it on the real data set.

Figure 5 shows the ROC curves of TPR-FPR for performance comparison. The corresponding areas under the TPR-FPR curve and the areas under the precision-recall curve (AUCs) [14] are shown in Figure 6. It can be seen that Model 2 outperforms all alternative methods by a large margin. Model 2 outperforms Model 1 because it considers both group-wise and individual associations. Model 1 outperforms SET-eQTL because it considers confounding factors that is not considered by SET-eQTL. SET-eQTL considers all associations as group-wise, thus it may miss some individual associations. MTLasso2G is comparable to LORS because MTLasso2G considers the group-wise associations while neglecting confounding factors. LORS considers the confounding factors, but does not distinguish individual and group-wise associations. LORS outperforms Lasso since confounding factors are not considered in Lasso.

Figure 5
figure 5

The ROC curve of FPR-TPR on simulated data. The black solid line denotes what random guessing would have achieved.

Figure 6
figure 6

The areas under the precision-recall/FPR-TPR curve (AUCs) of different methods with different signal-to-noise ratios (defined as SNR ) on simulated data.

Shrinkage of C and B×A

As discussed in the Methods, the group-wise associations are encoded in B×A and individual associations are encoded in C. To enforce sparsity on A, B and C, we use Laplace prior on the elements of these matrices. Thus, it is interesting to study the the overall shrinkage of B×A and C. We randomly generate 7 predictors ({ x 1 , x 2 , , x 7}) and 1 response (z) with sample size 100. x i N(0,0.6·I)(i∈[ 1,7]). The response vector was generated with the formula: \(\mathbf {z}=5\cdot (\mathbf {x_{1}}+\mathbf {x_{2}})-3\cdot (\mathbf {x_{3}}+\mathbf {x_{4}})+2\cdot \mathbf {x_{5}}+\tilde {\boldsymbol {\epsilon }}\) and \(\tilde {\boldsymbol {\epsilon }}\in N(\mathbf {0},\mathbf {I})\). Thus, there are two groups of predictors ({ x 1,x 2} and { x 3,x 4}) and one individual predictor x 5. Figure 7 shows the Model 2 shrinkage of coefficients for B×A and C respectively. Each curve represents a coefficient as a function of the scaled parameter \(s=\frac {|\mathbf {B}\times \mathbf {A}|}{\max |\mathbf {B}\times \mathbf {A}|}\) or \(s=\frac {|\mathbf {C}|}{\max |\mathbf {C}|}\). We can see that the two groups of predictors can be identified by B×A as the most important variables, and the individual predictor can be identified by C.

Figure 7
figure 7

Model 2 shrinkage of coefficients for B×A and C respectively. Each curve represents a coefficient as a function of the scaled parameter \(s=\frac {|\mathbf {B}\times \mathbf {A}|}{\max |\mathbf {B}\times \mathbf {A}|}\) or \(s=\frac {|\mathbf {C}|}{\max |\mathbf {C}|}\).

Computational efficiency evaluation

Scalability is an important issue for eQTL study. To evaluate the techniques for speeding up the computational efficiency, we compare the running time with/without these techniques. Figure 8 shows the running time when varying the number of hidden variables (M) and number of traits (N). The results are consistent with the theoretical analysis in Methods part that the time complexity is reduced to \(\mathcal {O}(M^{3}+H^{3})\) from \(\mathcal {O}(N^{3})\) when using the improved method for inverting the covariance matrix. We also observe that Model 2 uses slightly more time than Model 1, since it has more parameters to optimize. However, to get similar performance, Model 1 needs a significantly larger number of hidden variables M. As shown in Figure 8(a), a larger M results in a longer running time. In some cases, Model 2 is actually faster than Model 1. As an example, to obtain the same performance (i.e., AUC), Model 1 needs 60 hidden variables (M), while Model 2 only needs M=20. In this case, from Figure 8(a), we can observe that Model 2 needs less time than Model 1 to obtain the same results.

Figure 8
figure 8

Running time performance on simulated data when varying N and M . When varying N, we fix M, and when varying M, we fix N.

Yeast eQTL study

We apply the proposed methods to a yeast (Saccharomyces cerevisiae) eQTL dataset of 112 yeast segregants generated from a cross of two inbred strains [26]. The dataset originally includes expression profiles of 6229 gene expression traits and genotype profiles of 2956 SNP markers. After removing SNPs with more than 10% missing values and merging consecutive SNPS with high linkage disequilibrium, we obtain 1017 SNPs with distinct genotypes [29]. In total, 4474 expression profiles are selected after removing the ones with missing values. It takes about 5 hours for Model 1, and 3 hours for Model 1 to run to completion. The regularization parameters are set by grid search in {0.1, 1, 10, 50, 100, 500, 1000, 2000}. Specifically, grid search trains the model with each combinations of three regularization parameters in the grid and evaluates their performance (by measuring out-of-sample loss function value) for a two-fold cross validation. Finally, the grid search algorithm outputs the settings that achieved the smallest loss in the validation procedure.

We use hold-out validation to find the optimal number of hidden variables M and H for each model. Specifically, we partition the samples into 2 subsets of equal size. We use one subset as training data and test the learned model using the other subset of samples. By measuring out-of-sample predictions, we can find optimal combination of M and H that avoids over-fitting. For each combination, optimal values for regularization parameters were determined with two-fold cross validation. The loss function values for different {M, H} combinations of Model2 are shown in Figure 9. We find that M = 30 and H = 10 for Model 2 delivers the best overall performance. Similarly, we find that the optimal M and H values for Model 1 are 150 and 10 respectively.

Figure 9
figure 9

Parameter tuning for M and H (Model 2).

The significant associations given by Model 1, Model 2, LORS, MTLasso2G and Lasso are shown in Figure 10. For Model 2, we can clearly see that the estimated matrices C and B×A well capture the non group-wise and group-wise signals respectively. C+B×A and C of Model 2 have stronger cis-regulatory signals and weaker trans-regulatory bands than that of Model 1, LORS, and Lasso. C of Model 2 has the weakest trans-regulatory bands. LORS has weaker trans-regulatory bands than Lasso since it considers confounding factors. With more hidden variables (larger M), Model 1 obtains stronger cis-regulatory signals.

Figure 10
figure 10

Significant associations discovered by different methods in yeast. The top 4500 associations ranked by abs(weight) are shown in each plot unless otherwise noted. The x-axis represents SNPs and y-axis represents genes (traits). Both SNPs and genes are arranged by their locations in the genome. C+B×A and C of Model 2 have stronger cis-regulatory signals and weaker trans-regulatory bands than Model 1, LORS, and Lasso. C of Model 2 has weaker trans-regulatory bands than others. LORS has weaker trans-regulatory bands than Lasso since it considers confounding factors.

cis- and trans-enrichment analysis

In total, the proposed two methods detect about 6000 associations with non-zero weight values (B×A for Model 1 and C+B×A for Model 2). We estimate their FDR values by following the method proposed in [27]. With FDR ≤ 0.01, both models obtain about 4500 associations. The visualization of significant associations detected by different methods is provided in Additional file 1.

We apply cis- and trans-enrichment analysis on the discovered associations. In particular, we follow the standard cis-enrichment analysis [30,31] to compare the performance of two competing models. The intuition behind cis-enrichment analysis is that more cis-acting SNPs are expected than trans-acting SNPs. A two-step procedure is used in the cis-enrichment analysis [30]: (1) for each model, we apply a one-tailed Mann-Whitney test on each SNP to test the null hypothesis that the model ranks its cis hypotheses (we use <500 bp for yeast) no better than its trans hypotheses, (2) for each pair of models compared, we perform a two-tailed paired Wilcoxon sign-rank test on the p-values obtained from the previous step. The null hypothesis is that the median difference of the p-values in the Mann-Whitney test for each SNP is zero. The trans-enrichment is implemented using a similar strategy as in [32], in which genes regulated by transcription factorsc are used as trans-acting signals.

The results of pairwise comparison of selected models are shown in Table 2. A p-value shows how significant a method on the left column outperforms a method in the top row in terms of cis-enrichment or trans-enrichment. We observe that the proposed Model 2 has significantly better cis-enrichment scores than other methods. For trans-enrichment, Model 2 is the best, and FaST-LMM comes in second. This is because both Model 2 and FaST-LMM consider confounding factors (FaST-LMM considers confounders from population structure) and joint effects of SNPs, but only Model 2 considers grouping of genes. Model 1 has poor performance because a larger M may be needed for Model 1 to capture those individual associations.

Table 2 Pairwise comparison of different models using cis -enrichment and trans -enrichment in yeast

Reproducibility of trans regulatory hotspots between studies

We also evaluate the consistency of calling eQTL hotspots between two independent glucose yeast datasets [33]. The glucose environment from Smith et al. [33] shares a common set of segregants. It includes 5493 probes measured in 109 segregates. Since our algorithm aims at finding group-wise associations, we focus on the consistency of regulatory hotspots.

We examine the reproducibility of trans regulatory hotspots based on the following criteria [18,19,27]. For each SNP, we count the number of associated genes from the detected SNP-gene associations. We use this number as the regulatory degree of each SNP. For Model2, LORS, and Lasso, all SNP-Gene pairs with non-zero association weights are defined as associations. Note that Model2 uses B A+C as the overall associations. For FaST-LMM, SNP-Gene pairs with a q-value < 0.001 are defined as associations. Note that we also tried different cutoffs for FaST-LMM (from 0.01 to 0.001), the results are similar. SNPs with large regulatory degrees are often referred to as hotspots. We sort SNPs by the extent of trans regulation (regulatory degrees) in a descending order. We denote the sorted SNPs lists as S 1 and S 2 for the two yeast datasets. Let \({S_{1}^{T}}\) and \({S_{2}^{T}}\) be the top T SNPs in the sorted SNP lists. The trans calling consistency of detected hotspots is defined as \(\frac {|{S_{1}^{T}}\bigcap {S_{2}^{T}}|}{T}\).

Figure 11 compares the reproducibility of trans regulatory hotspots given by different studies. It can be seen that the proposed Model2 gives much higher consistency than any other competitors do. In particular, the consistency of trans hotspots suggests the superiority of Model2 in identifying hotspots that are likely to have a true genetic underpinning.

Figure 11
figure 11

Consistency of detected eQTL hotspots between two independent yeast eQTL datasets.

Gene ontology enrichment analysis

As discussed in Methods, hidden variables y in the middle layer may model the joint effect of SNPs that have influence on a group of genes. To better understand the learned model, we look for correlations between a set of genes associated with a hidden variable and GO categories (Biological Process Ontology) [34]. In particular, for each gene set G, we identify the GO category whose set of genes is most correlated with G. We measure the correlation by a p-value determined by the Fisher’s exact test. Since multiple gene sets G need to be examined, the raw p-values need to be calibrated because of the multiple testing problem [35]. To compute the calibrated p-values for each gene set G, we perform a randomization test, wherein we apply the same test to randomly created gene sets that have the same number of genes as G. Specifically, the enrichment test is performed using DAVID [29]. And gene sets with calibrated p-values less than 0.01 are considered as significantly enriched.

The results from Model 2 are reported in Table 3. Each row of Table 3 represents the gene set associated with a hidden variable. All of these detected gene sets are significantly enriched in certain GO categories. The significantly enriched gene sets of Model 1 are included in Additional file 1.

Table 3 Summary of all detected groups of genes from Model 2 on yeast data

For comparison, we visualize the number of SNPs and genes in each group-wise association in Figure 12. We observe that 90 out of 150 gene sets reported by Model 1 are significantly enriched, and all 30 gene sets reported by Model 2 are significantly enriched (GOA results of Model 1 are reported in Additional file 2). This indicates that Model 2 is able to detect group-wise linkages more precisely than Model 1. We also study the hotspots detected by LORS, which affect >10 gene traits [28]. Specifically, we delve into the top 15 hotspots detected by LORS (ranking by number of associated genes for each SNP), as listed in Table 4. We can see that only 9 out of 15 top ranked hotspots are significantly enriched.

Figure 12
figure 12

Number of SNPs and genes in each group-wise association detected by Model 1 and Model 2 in yeast. x-axis is the associated SNP-Gene Group ID, y-axis is the number of SNPs or genes in the group. The red line denotes the number of SNPs in the group and the bar denotes number of genes in the group, among which the green bar indicates that the group of genes is significantly enriched by some GO categrory, while the red bar indicates that the groups of genes is not significantly enriched.

Table 4 Summary of the top 15 detected hotspots by LORS

Conclusion

A crucial challenge in eQTL study is to understand how multiple SNPs interact with each other to jointly affect the expression level of genes. In this paper, we propose a sparse graphical model to identify novel group-wise eQTL associations. The proposed model can also take into account potential confounding factors and individual associations. 1-regularization is applied to learn the sparse structure of the graphical model. We also introduce computational techniques to make this approach suitable for large scale studies. Extensive experimental evaluations using both simulated and real datasets demonstrate that the proposed methods can effectively capture both individual and group-wise signals and significantly outperform the state-of-the-art eQTL mapping methods.

Endnotes

a The software is implemented in both C++ and matlab, and publicly available at http://www.cs.unc.edu/~weicheng/Group-Wise-EQTL.zip.

b For example, 0, 1, 2 may encode the homozygous major allele, heterozygous allele, and homozygous minor allele, respectively.

c http://www.yeastract.com/download.php.