# Structured regularization for conditional Gaussian graphical models

- 244 Downloads

## Abstract

Conditional Gaussian graphical models are a reparametrization of the multivariate linear regression model which explicitly exhibits (i) the partial covariances between the predictors and the responses, and (ii) the partial covariances between the responses themselves. Such models are particularly suitable for interpretability since partial covariances describe direct relationships between variables. In this framework, we propose a regularization scheme to enhance the learning strategy of the model by driving the selection of the relevant input features by prior structural information. It comes with an efficient alternating optimization procedure which is guaranteed to converge to the global minimum. On top of showing competitive performance on artificial and real datasets, our method demonstrates capabilities for fine interpretation, as illustrated on three high-dimensional datasets from spectroscopy, genetics, and genomics.

### Keywords

Multivariate regression Regularization Sparsity Conditional Gaussian graphical model Structured elastic net Regulatory motif QTL study Spectroscopy## 1 Introduction

*q*—responses from a set of

*p*predictors, relying on a training data set \(\left\{ (\mathbf {x}_i,\mathbf {y}_i)\right\} _{i=1,\dots ,n}\):

*q*independent regressions, each column \(\mathbf {B}_j\) describing the weights associating the

*p*predictors to the

*j*th response. In the \(n<p\) setup however, these estimators are not defined.

Mimicking the univariate-output case, multivariate penalized methods aim to regularize the problem by biasing the regression coefficients toward a given feasible set. Sparsity within the set of predictors is usually the most wanted feature in the high-dimensional setting, which can be met in the multivariate framework by a straightforward application of the most popular penalty-based methods from the univariate world involving \(\ell _1\)-regularization. In the context of multitask learning, more sophisticated penalties encouraging similarities between parameters across outputs have been proposed. For example, many authors have suggested the use of group-norms penalties to set to zero full rows of \(\mathbf {B}\) (see, e.g., Obozinski et al. 2011; Chiquet et al. 2011). Refinements exist to cope with complex dependency structures, using graph or tree structures (Kim and Xing 2009, 2010).

Importantly, all aforementioned methods are based on the regularization of parameter matrix \(\mathbf {B}\), that accounts for both direct and undirect links between the inputs and the outputs. Alternatively, one may ask for sparsity in the direct links only. The distinction between direct and undirect links can be formalized when \((\mathbf {x}_i,\mathbf {y}_i)\) are jointly Gaussian. In this context, by conditioning \(\mathbf {y}_i\) over \(\mathbf {x}_i\), one obtains an expression of \(\mathbf {B}\) that depends on \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\) and \({\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}} = \mathbf {R}^{-1}\), where \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\) is the partial covariance matrix between \(\mathbf {x}_i\) and \(\mathbf {y}_i\) (see, e.g. Mardia et al. 1979). Note that imposing sparsity on the regression coefficients \(\mathbf {B}\) or on \({\varvec{\varOmega }}_{xy}\) is not equivalent and may therefore lead to different results. Here we chose to impose sparsity on \({\varvec{\varOmega }}_{xy}\). Indeed, these partial covariances account for the *direct* relationships that exist between the predictors \(\mathbf {x}_i\) and the responses \(\mathbf {y}_i\). Approaches encouraging sparsity on direct links \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\) have been referred to as ‘*conditional Gaussian Graphical Model’* (cGGM) or ‘*partial Gaussian Graphical Model’* in the recent literature (Sohn and Kim 2012; Yuan and Zhang 2014). In these papers, the authors propose to regularize the cGGM log-likelihood by two \(\ell _1\)-norms respectively acting on the partial covariances between the features and the responses \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\), and on the partial covariances between the responses themselves via \(\mathbf {R}^{-1}\).

In the present paper we also consider the direct link regularization approach for multivariate regression with the following specifications. First, we typically consider situations where the number of outputs is small compared to the number of predictors and sample size (while we insist on the fact that the number of predictors may still exceed the sample size). Consequently no sparsity assumption is required for the partial covariances between outputs. Second, we consider applications where structural information about the effect of the inputs is available. Here the structural information will be embedded in the regularization scheme via an additional regularization term using an application-specific metrics, in the same manner as in the ‘structured’ versions of the Elastic-net (Slawski et al. 2010; Hebiri and Geer 2011; Lorbert et al. 2010), or quadratic penalty function using the Laplacian graph (Rapaport et al. 2007; Li and Li 2010) proposed in the univariate-output case. Adding a structured regularization term is of importance when interpretability of the estimated model is as much important as its predictive performance. These two specifications (small number of outputs, availability of prior structural information) arise in application fields such as biology, agronomy or health and 3 examples will be investigated hereafter.

We show that the resulting penalized likelihood optimization problem is jointly convex and can be solved efficiently using a two-step procedure for which algorithmic convergence guarantees are provided. Penalized criteria for the choice of the regularization parameters are also provided. The importance of embedding for structural prior information will be exemplified in the various contexts of spectroscopy, genomic selection and regulatory motif discovery, illustrating how accounting for application-specific improves both performance and interpretability. The procedure is available as an R-package called **spring**, available on the R-forge.

The outline of the paper is as follows. In Sect. 2 we provide background on cGGM and present our regularization scheme. In Sect. 3, we develop an efficient optimization strategy in order to minimize the associated criterion. A paragraph also addresses the model selection issue. Section 4 is dedicated to illustrative simulation studies. In Sect. 5, we investigate three multivariate data sets: first, we consider an example in spectrometry with the analysis of cookie dough samples; second, the relationships between genetic markers and a series of phenotypes of the plant *Brassica napus* is addressed; and third, we investigate the discovery of regulatory motifs of yeast from time-course microarray experiments. In these applications, some specific underlying structuring priors arise, the integration of which within our model is detailed as it is one of the main contributions of our work.

## 2 Model setup

### 2.1 Background on cGGM

*j*th column, the \(\mathrm {vec}\) operator is defined as \(\mathrm {vec}(\mathbf{A}) = (\mathbf {A}_1^T \dots \mathbf {A}_p^T)^T\). The log-likelihood of (4)—which is a conditional likelihood regarding the joint model (2)—is written

*direct*relationships between predictors and responses, the support of which we are looking for to select relevant interactions. On the other hand, \(\mathbf {B}\) entails

*both direct and indirect*influences, possibly due to some strong correlations between the responses, described by the covariance matrix \(\mathbf {R}\) (or equivalently its inverse \({\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}\)). To provide additional insights on cGGM, Fig. 1 illustrates the relationships between \(\mathbf {B},{\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\) and \(\mathbf {R}\) in two simple sparse scenarios where \(p=40\) and \(q=5\). Scenarios (a) and (b) are discriminated by the presence of a strong structure among the predictors. An important point to grasp in both scenarios is how strong correlations between outcomes can completely “mask” the direct links in the regression coefficients: the stronger the correlation in \(\mathbf {R}\), the less it is possible to distinguish in \(\mathbf {B}\) the non-zero coefficients of \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\). Scenario (c) illustrates the reverse situation of (a), where \(\mathbf {B}\) would be sparse and \({\varvec{\varOmega }}_{xy}\) would not. Choosing between the two is a matter of modeling. Our approach relies on the assumptions that direct links are sparse.

In this context, sparse inference strategy have been recently proposed in Sohn and Kim (2012), Yuan and Zhang (2014) when both \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\) and \(\mathbf {R}^{-1}\) are assumed to be sparse. In contrast, we discuss in the following a regularization scheme tailored to recover \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\) and \(\mathbf {R}^{-1}\) in the case when the later is sparse and structured and the former is relatively small and dense, as in scenario *b* of Fig. 1.

### 2.2 Structured regularization with underlying sparsity

Our regularization scheme starts by considering some structural prior information about the relationships between the coefficients. We are typically thinking of a situation where similar inputs are expected to have similar direct relationships with the outputs. The right panel of Fig. 1 represents an illustration of such a situation, where there exists an extreme neighborhood structure between the predictors. This depicts a pattern that acts along the rows of \(\mathbf {B}\) or \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\) as substantiated by the following Bayesian point of view.

#### 2.2.1 Bayesian interpretation

*g*-prior (see e.g. Brown et al. 1998; Krishna et al. 2009). In following we consider the case where \(\mathbf {L}\) is given, is built upon some exogenous prior information and has an arbitrary form. In Sect. 5, we will show how such a matrix can be constructed for a specific application.

#### 2.2.2 Criterion

### 2.3 Connection to other sparse methods

To get more insight into our model and to facilitate connections with existing approaches, we shall write the objective function (7) as a penalized *univariate* regression problem. This amounts to “vectorizing” model (4) with respect to \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\), i.e., to writing the objective as a function of \(({\varvec{\omega }},{\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}})\) where \({\varvec{\omega }}=\mathrm {vec}({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}})\). This is stated in the following proposition, which can be derived from straightforward matrix algebra, as proved in “Appendix”. The main interest of this proposition will become clear when deriving the optimization procedure that aims at minimizing (7), as the optimization problem when \({\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}\) is fixed turns to a generalized Elastic-Net problem. Note that we use \(\mathbf {A}^{1/2}\) to denote the square root of a matrix, obtained for instance by a Cholesky factorization in the case of a symmetric positive definite matrix.

**Proposition 1**

*p*-vector \({\varvec{\beta }}\), and the objective (7) can be rewritten

*g*th row of matrix \(\mathbf {B}\).

## 3 Learning

### 3.1 Optimization

In the classical framework of parametrization (1), alternate strategies where optimization is successively performed over \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\) and \({\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}\) have been proposed Rothman et al. (2010) and Yin and Li (2011). They come with no guarantee of convergence to the global optimum since the objective is only bi-convex. In the cGGM framework of Sohn and Kim (2012) and Yuan and Zhang (2014), the optimized criterion is jointly convex yet no convergence result is provided regarding the optimization procedure proposed by the authors. Here we also consider the alternate strategy for which the following proposition can be stated:

**Proposition 2**

The proof is given in “Appendix”, and relies on the fact that efficient procedures exist to solve the two convex sub-problems (9a) and (9b).

Because our procedure relies on alternating optimization, it is difficult to give either a global rate of convergence or a complexity bound. Nevertheless, the complexity of each iteration is easy to derive, since it amounts to two well-known problems: the main computational cost in (9a) is due to the SVD of a \(q\times q\) matrix, which costs \(\mathscr {O}(q^3)\). Concerning (9b), it amounts to the resolution of an Elastic-Net problem with \(p\,{\times }\, q\) variables and \(n \times q\) samples. If the final number of nonzero entries in \(\hat{{\varvec{\varOmega }}}_{\mathbf {x}\mathbf {y}}\) is *k*, a good implementation with Cholesky update/downdate is roughly in \(\mathscr {O}(n p q^2 k)\) (see, e.g. Bach et al. 2012). Since we typically assumed that \(p\ge n\ge q\), the global cost of a single iteration of the alternating scheme is thus \(\mathscr {O}(n p q^2 k)\), and we theoretically can treat problems with large *p* when *k* remains moderate.

Finally, we typically want to compute a series of solutions along the regularization path of Problem (7), i.e. for various values of \((\lambda _1,\lambda _2)\). To this end, we simply choose a grid of penalties \(\varLambda _1 \times \varLambda _2 = \left\{ \lambda _1^{\text {min}}, \dots , \lambda _1^{\text {max}}\right\} \times \left\{ \lambda _2^{\text {min}}, \dots , \lambda _2^{\text {max}}\right\} \). The process is easily distributed on different computer cores, each core corresponding to a value picked in \(\varLambda _2\). Then on each core—i.e. for a fix \(\lambda _2\in \varLambda _2\)—we cover all the values of \(\lambda _1\in \varLambda _1\) relying on the warm start strategy frequently used to go through the regularization path of \(\ell _1\)-penalized problems (see Osborne et al. 2000).

### 3.2 Model selection and parameter tuning

*K*-fold cross-validation is the recommended option (Hesterberg et al. 2008) despite its additional computational cost. Letting \(\kappa : \left\{ 1,\dots ,n\right\} \rightarrow \left\{ 1,\dots ,K\right\} \) the function indexing the fold to which observation

*i*is allocated, the CV-choices for \((\lambda _1^\text {cv},\lambda _2^\text {cv})\) are the ones that minimize

*n*is not too small compared to the problem dimension. For penalized methods, a general form for various information criteria is expressed as a function of the likelihood

*L*(defined by (5) here) and the effective degrees of freedom:

**Proposition 3**

Note that we can compute this expression at no additional cost, relying on computations already made during the optimization process.

## 4 Simulation studies

In this section, we would like to illustrate the new features of our proposal compared to several baselines in well controlled settings. To this end, we perform three simulation studies to evaluate (i) the gain brought by the estimation of the residual covariance \(\mathbf {R}\), (ii) the gain brought by the inclusion of informative structure on the predictors via \(\mathbf {L}\) and (iii) the behavior of the method in presence of both residual covariance and structure.

### 4.1 Implementation details

In our experiments, performance are compared with well-established regularization methods, whose implementation is easily accessible: the LASSO (Tibshirani 1996), the fused-LASSO (Tibshirani et al. 2005), the multitask group-LASSO, MRCE (Rothman et al. 2010), the Elastic-Net (Zou and Hastie 2005) and the Structured Elastic-Net (Slawski et al. 2010). LASSO and group-LASSO are fitted with the R-package **glmnet** (Friedman et al. 2010), fused-LASSO with the R-package **FusedLasso** (Hoefling 2010) and MRCE with Rothman et al. (2010)’s package. All other methods are fitted using our own code. When applied to multiple outcomes, all univariate methods (LASSO, fused-LASSO, MRCE, Elastic-Net and Structured Elastic-Net) were run on each output separately. The tuning parameter was selected by a cross-validation step per output. Our own procedure is available as an R-package called **spring**, distributed on the R-forge.^{1} As such, we sometimes refer to our method as ‘SPRING’ in the simulation part.

#### 4.1.1 Data generation

Artificial datasets are generated according to the multivariate regression model (1). We assume that the decomposition \(\mathbf {B}= {\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}} {\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}^{-1} = {\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\mathbf {R}\) holds for the regression coefficients. We control the sparsity pattern of \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\) by arranging non null entries according to a possible structure of the predictors along the rows of \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\). We always use uncorrelated Gaussian predictors \(\mathbf {x}_i\sim \mathscr {N}(\mathbf {0},\mathbf {I})\) in order not to promote excessively the methods that take this structure into account. Strength of the relationships between the outputs are tuned by the covariance matrix \(\mathbf {R}\). We measure the performance of the learning procedures thanks to the prediction error (PE) estimated using a large test set of observations generated according to the true model. When relevant, mean squared error (MSE) of the regression coefficients \(\mathbf {B}\) is also presented. For conciseness, it is eluded when it shows results which are quantitatively identical to PE.

### 4.2 Influence of covariance between outcomes

^{2}one has \(\mathbf {R}_{ij} = \tau ^{|i-j|}\), for \(i,j=1,\dots ,q\). We consider three scenarios tuned by \(\tau \in \left\{ .1,.5,.9\right\} \) corresponding to an increasing correlation between the outcomes that eventually makes the cGGM more relevant. These settings have been used to generate panel (a) of Fig. 1. For each covariance scenario, we generate a training set with size \(n=50\) and a test set with size \(n=1000\). We assess the performance of SPRING (taking \(\mathbf {L}= \mathbf {I}\)) by comparison with three baselines: (i) the LASSO, (ii) the \(\ell _1/\ell _2\) multitask group-LASSO and (iii) SPRING with known covariance matrix \(\mathbf {R}\). As it corresponds to the best fit we can obtain with our proposal, we call this variant the “oracle” mode of SPRING. The final estimators are obtained by fivefold cross-validation on the training set. Figure 2 gives the boxplots of PE obtained for 100 replicates. As expected, the advantage of taking the covariance into account becomes more and more important for maintaining a low PE when \(\tau \) increases. When correlation is low, the LASSO dominates the group-LASSO; this is the other way around in the high correlation setup, where the latter takes full advantage of its grouping effect along the outcomes. Still, our proposal remains significantly better as soon as \( \tau \) is substantial enough. We also note that our iterative algorithm does a good job since SPRING remains close to its oracle variant.

### 4.3 Structure integration and robustness

*p*-size vectors sharing the same sparsity pattern such that \({\varvec{\beta }}= -{\varvec{\omega }}\sigma ^2\). In this situation, SPRING is close to Slawski et al. (2010)’s structured Elastic-Net, except that we hope for a better estimation of the coefficients thanks to the estimation of \(\sigma \). For comparison, we thus draw inspiration from the simulation settings originally used to illustrate the structured Elastic-Net: we set \({\varvec{\omega }}= (\omega _j)_{j=1}^p, \) with \(p=100\), so that we observe a sparse vector with two smooth bumps, one positive and the other one negative:

Illustrating gain brought by structure information toward statistical performances

Method | Hamming dist. | MSE | PE |
---|---|---|---|

SPRING (\(\lambda _2=0.01\)) | 14.31 (3.51) | 0.05 (0.01) | 30.46 (2.12) |

SPRING (\(\lambda _2=0\)) | 20.66 (4.18) | 0.31 (0.07) | 55.73 (7.82) |

LASSO | 20.47 (4.08) | 0.31 (0.07) | 55.66 (7.78) |

Table 1 shows the typical gain brought by prior structure knowledge. We generate 100 learning samples of size \(n=100\) with \(\sigma ^2=5\) and reported the estimated PE, MSE and Hamming distance with the true coefficient) for the LASSO and for two versions of SPRING, with (\(\lambda _2 = .01\)) and without (\(\lambda _2 = 0\)) informative prior. We remind that the Hamming distance is \(\sum _j s_j (1-\widehat{s}_j) + \widehat{s}_j (1-s_j)\), where \(s_j\) (resp. \(\widehat{s}_j\)) is 1 if variable *j* has a non-zero coefficient in the true (resp. estimated) model, and 0 otherwise. It quantifies the distance between the true and estimated supports. Incorporating relevant structural information leads to a dramatic improvement in prediction but also in estimating the vector of parameters. As expected, univariate SPRING with no prior performs like the LASSO.

^{3}We then apply SPRING using respectively a non informative prior equal to the identity matrix (thus mimicking the Elastic-Net) and a ‘wrong’ prior \(\mathbf {L}\) whose rows and columns remain unswapped. We also try the LASSO, the Elastic-Net and the structured Elastic-Net. All methods are tuned with fivefold cross-validation on the learning set. Table 2 presents the results averaged over 100 runs both in terms of PE (using a size-1000 test set) and MSE.

Structure integration: performance and robustness

Method | Scenario | MSE | PE |
---|---|---|---|

LASSO | – | .336 (.096) | 58.6 (10.2) |

E-Net (\(\mathbf {L}=\mathbf {I}\)) | – | .340 (.095) | 59 (10.3) |

SPRING (\(\mathbf {L}=\mathbf {I}\)) | – | .358 (.094) | 60.7 (10) |

S. E-net | Unswapped | .163 (.036) | 41.3 ( 4.08) |

(\(\mathbf {L}=\mathbf {D}^T \mathbf {D}\)) | Swapped | .352 (.107) | 60.3 (11.42) |

SPRING | Unswapped | .062 (.022) | 31.4 ( 2.99) |

(\(\mathbf {L}=\mathbf {D}^T \mathbf {D}\)) | Swapped | .378 (.123) | 62.9 (13.15) |

As expected, the methods that do not integrate any structural information (LASSO, Elastic-Net and SPRING with \(\mathbf {L}=\mathbf {I}\)) are not affected by the permutation, and we avoid these redundancies in Table 2 to save space. Overall, these three methods share similar performance both in terms of PE and MSE. When the prior structure is relevant, SPRING, and to a lesser extent the structured Elastic-Net, clearly outperform the other competitors. Surprisingly, this is particularly true in terms of MSE, where SPRING also dominates the structured Elastic-Net that works with the same information. This means that the estimation of the variance also helped in the inference process. Finally, these results essentially support the robustness of the structured methods which are not much altered when using a wrong prior specification.

### 4.4 Assessing both covariance and structure estimation

This last simulation setting is made under a complete setting, including a structure between the predictors and correlations between the outcomes just like in Fig. 1, panel (b), with \(p=40\) and \(q=5\). The \(\mathbf {R}\) matrix has Toeplitz shape with parameter \(\tau = 0.75\). The \(\mathbf {L}\) matrix is as described in Sect. 4.3. The results are displayed in Fig. 3. As expected, being the only method accounting for both the structure and the covariance, SPRING with structure performs well. Not accounting for the structure in SPRING (but keeping the covariance between outcomes) degrades the performances, which still remain comparable to these of the structured Elastic-net (that do account for the structure).

## 5 Application studies

In this section the flexibility of our proposal is illustrated by investigating three multivariate data problems from various contexts, namely spectroscopy, genetics and genomics, where we insist on the construction of the structuring matrix \(\mathbf {L}\).

### 5.1 Near-infrared spectroscopy of cookie dough pieces

#### 5.1.1 Context

In Near-Infrared (NIR) spectroscopy, one aims to predict one or several quantitative variables from the NIR spectrum of a given sample. Each sampled spectrum is a curve that represents the level of reflectance along the NIR region, that is, wavelengths from 800 to 2500 nanometers (nm). The quantitative variables are typically related to the chemical composition of the sample. The problem is then to select the most predictive region of the spectrum, i.e. some peaks that show good capabilities for predicting the response variable(s). This is known as a “calibration problem” in Statistics. NIR technique is used in fields as diverse as agronomy, astronomy or pharmacology. In such experiments, it is likely to encounter very strong correlations and structure along the predictors. In this perspective, Hans (2011) proposes to apply the Elastic-Net which is known to select simultaneously groups of correlated predictors. However it is not adapted to the prediction of several responses simultaneously. In Brown et al. (2001), an interesting wavelet regression model with Bayesian inference is introduced that enters the multivariate regression model, as does our proposal.

#### 5.1.2 Description of the dataset

**fds**R package. After data pretreatments as in Brown et al. (2001), we have \(n=39\) dough pieces in the training set: each sample consists in an NIR spectrum with \(p=256\) points measured from 1380 to 2400 nm (spaced by 4 nm), and in four quantitative variables that describe the percentages of fat, sugar flour and water of in the piece of dough.

Prediction error for the cookie dough data

Method | Fat | Sucrose | Flour | Water |
---|---|---|---|---|

Stepwise MLR | . | 1.188 | .722 | .221 |

Decision theory | .076 | .566 | .265 | .176 |

PLS | .151 | .583 | .375 | .105 |

PCR | .160 | .614 | .388 | .106 |

Wavelet regression | .058 | .819 | .457 | .080 |

LASSO | . | .853 | .370 | .088 |

Fused-LASSO | .106 | 1.93 | .378 | .231 |

Group-LASSO | .127 | .918 | .467 | .102 |

MRCE | .151 | .821 | .321 | .081 |

Structured E-net | . | .596 | .363 | .082 |

SPRING (CV) | .065 | .397 | . | .083 |

SPRING (BIC) | .048 | . | .243 | . |

#### 5.1.3 Structure specification

We would like to account for the neighborhood structure between the predictors which is obvious in the context of NIR spectroscopy: since spectra are continuous curves, a smooth neighborhood prior will encourage predictors to be selected by “wave”, which seems more satisfactory than isolated peaks. Thus, we naturally define \(\mathbf {L}\) by means of the first forward difference operator (11). We also tested higher orders of the operator to induce a stronger smoothing effect, but they do not lead to dramatic changes in terms of PE and we omit them. Order *k* is simply obtained by powering the first order matrix. Such techniques have been studied in a structured \(\ell _1\)-penalty framework in Kim et al. (2009), Tibshirani and Taylor (2011) and is known as *trend filtering*. Our approach however is different, though: it enters a multivariate framework and is based on a smooth \(\ell _2\)-penalty coupled with the \(\ell _1\)-penalty for selection.

#### 5.1.4 Results

The predictive performance of a series of regression techniques is compared in Table 3, evaluated on the same test set.

Figure 4 shows the regression coefficients adjusted with the penalized regression techniques: apart from LASSO which selects very isolated (and unstable) wavelengths, non-zero regression coefficients have quite a wide spread, and therefore hard to interpret. As expected, the multitask group-Lasso activates the same predictors across the responses, which is not a good idea when looking at the predictive performance in Table 3. The Fused-Lasso detects the most significant regions in the spectrum. However, the piece-wise constant shape of the estimated signal leads to poor predictive performances. On the other hand, in Fig. 5, the direct effects selected by SPRING with BIC define predictive regions specific to each response which are well suited for interpretability purposes. Parameters fitted by our approach with BIC are represented on Fig. 5 with, from left to right, the estimators of the regression coefficients \(\mathbf {B}\), of the direct effects \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\) and of the residual covariance \(\mathbf {R}\). As expected, the regression coefficients show no sparsity pattern since sparsity was imposed on direct links in the estimation step. One can observe a strong spatial structure along the spectrum, characterized by waves induced by a smooth first order difference prior. Concerning a potential structure between the outputs, we identify interesting regions where a strong correlation between the responses induces correlations between the regression coefficients. Consider for instance positions of the wavelength between 1750 and 2000 nm: the regression parameters related to “dry-flour” are clearly anti-correlated with those related to “sucrose”. Still, we cannot distinguish in \(\hat{\mathbf {B}}\) for a direct effect of this region on either the flour or sucrose composition. Such a distinction is achieved on the middle panel where direct effects \(\hat{{\varvec{\varOmega }}}_{\mathbf {x}\mathbf {y}}\) selected are plotted: it defines sparse predictive regions specific to each response which are well suited for interpretability purposes; in fact, it is now obvious that region 1750 to 2000 is rather linked to the sucrose.

### 5.2 Multi-trait genomic selection in Brassica napus

#### 5.2.1 Context

Estimated prediction error for the *Brassica napus* data for all traits analyzed jointly (standard error in parentheses)

Method | surv92 | surv93 | surv94 | surv97 | surv99 | flower0 | flower4 | flower8 |
---|---|---|---|---|---|---|---|---|

LASSO | .730 (.011) | .977 (.009) | .943 (.010) | .947 (.009) | .916 (.010) | .609 (.011) | .501 (.011) | .744 (.011) |

S. Enet | . | .987 (.009) | .941 (.011) | .945 (.009) | .911 (.010) | .577 (.011) | .478 (.010) | .727 (.012) |

MRCE | .759 (.010) | . | 917 (.006) | . | .926 (.006) | .591 (.011) | .479 (.011) | .736 (.011) |

group-lasso | .748 (.013) | .995 (.012) | .892 (.013) | .939 (.013) | .906 (.017) | .603 (.017) | .504 (.015) | .717 (.013) |

SPRING | .724 (.010) | .948 (.008) | . | .940 (.006) | . | . | . | . |

#### 5.2.2 Description of the dataset

We consider the *Brassica napus* dataset described in Ferreira et al. (1995) and Kole et al. (2002). Data consists in \(n=103\) double-haploid lines derived from 2 parent cultivars, ‘Stellar’ and ‘Major’, on which \(p = 300\) genetic markers and \(q=8\) traits (responses) were recorded. Each marker is a 0/1 covariate with \(x_i^j=0\) if line *i* has the ‘Stellar’ allele at marker *j*, and \(x_i^j=1\) otherwise. Traits included are percent winter survival for 1992, 1993, 1994, 1997 and 1999 (surv92, surv93, surv94, surv97, surv99, respectively), and days to flowering after no vernalization (flower0), 4 weeks vernalization (flower4) or 8 weeks vernalization (flower8).

#### 5.2.3 Structure specification

^{4}The covariance matrix \(\mathbf {L}^{-1}\) can hence be defined as \(\mathbf {L}_{ij}^{-1} = \rho ^{d_{ij}}\). Moreover, assuming recombination events are independent between \(M_1\) and \(M_2\) on the one hand, and \(M_2\) and \(M_3\) on the other hand, one has \(d_{13} = d_{12} + d_{23}\) and matrix \(\mathbf {L}^{-1}\) exhibits an inhomogeneous AR(1) profile. As a consequence, \(\mathbf {L}\) is tridiagonal with general elements

#### 5.2.4 Results

Estimated prediction error for the *Brassica napus* data for the flowering traits analyzed separately (standard error in parentheses)

Method | surv92 | surv93 | surv94 | surv97 | surv99 | flower0 | flower4 | flower8 |
---|---|---|---|---|---|---|---|---|

group-lasso | .783 (.012) | . | .907 (.011) | .950 (.010) | .911 (.011) | .597 (.019) | .536 (.016) | .757 (.022) |

SPRING | . | .986 (.010) | . | . | . | . | . | . |

As suggested by a reviewer, we ran the analysis on survival and flowering responses separately. Indeed, merging the two subsets of traits may hamper the performances of the group Lasso, as it forces the active covariates to be same for all responses. The result may also change for SPRING as the correlation between the two subsets of traits are not accounted for when analyzed separately. The other (univariate) methods are not affected by this change. The results are summarized in Table 5. No dramatic improvement or degradation is observed for any of the two methods and SPRING still displays better results.

### 5.3 Selecting regulatory motifs from multiple microarrays

#### 5.3.1 Context

In genomics, the expression of genes is initiated by transcription factors that bind to the DNA upstream from the coding regions, called *regulatory regions*. This binding occurs when a given factor recognizes a certain (small) sequence called a *regulatory motif*. Genes hosting the same regulatory motif will be jointly expressed under certain conditions. As the binding relies on chemical affinity, some degeneracy can be tolerated in the motif definition, and motifs similar but for small variations may share the same functional properties (see, e.g. Lajoie et al. 2012).

We are interested in the detection of such regulatory motifs, the presence of which controls the gene expression profile. To this aim we try to establish a relationship between the expression level of all genes across a series of conditions with the content of their respective regulatory regions in terms of motifs. In this context, we expect (i) the set of influential motifs to be small for each condition, (ii) the influential motifs for a given condition to be degenerate versions of each other, and (iii) the expression under similar conditions to be controlled by the same motifs.

#### 5.3.2 Description of the dataset

*Saccharomyces cerevisae*). Among these assays, we consider 12 time-course experiments profiling \(n=5883\) genes under various environmental changes as listed in Table 6. These expression sets form 12 potential response matrices \(\mathbf {Y}\), the column number of which corresponds to the number of time points.

Time-course data from Gasch et al. (2000) considered for regulatory motif discovery

Experiment | # time point | # motifs selected | ||
---|---|---|---|---|

\(k=7\) | \(k=8\) | \(k=9\) | ||

Heat shock | 8 | 30 | 68 | 43 |

Shift from 37 to \(25\,^\circ \)C | 5 | 3 | 11 | 33 |

Mild heat shock | 4 | 24 | 13 | 23 |

Response to \(\text {H}_2 \text {O}_2\) | 10 | 15 | 10 | 21 |

Menadione exposure | 9 | 16 | 1 | 7 |

DDT exposure 1 | 8 | 15 | 10 | 30 |

DDT exposure 2 | 7 | 11 | 33 | 21 |

Diamide treatment | 8 | 45 | 25 | 35 |

Hyperosmotic shock | 7 | 36 | 24 | 15 |

Hypo-osmotic shock | 5 | 20 | 8 | 29 |

Amino-acid starvation | 5 | 47 | 30 | 39 |

Diauxic shift | 7 | 16 | 14 | 20 |

Total number of unique motifs inferred | 87 | 82 | 72 |

Comparison of SPRING-selected motifs with MotifDB patterns. Each cell corresponds to a MotifDB pattern (top) compared to a set of aligned SPRING motifs with size 7 (down).

Concerning the predictors, we consider the set of all motifs with length *k* formed with the four nucleotides, that is \(\mathscr {M}_k = \left\{ A,C,G,T\right\} ^k\). There are \(p = |\mathscr {M}_k| = 4^k\) such motifs. Unless otherwise stated, the motifs in \(\mathscr {M}\) are lined up in lexicographical order e.g., when \(k=2\), \(AA,AC,AG,AT,CA,CC,\dots \) and so on. Then, the \(n\times p\) matrix of predictors \(\mathbf {X}\) is filled such that \(X_{ij}\) equals the occurrence count of motif *j* in the regulatory region of gene *i*.

#### 5.3.3 Structure specification

*distance matrix*\(\mathbf {D}^{k,\ell }=(d^{k,\ell }_{ab})_{a,b\in \mathscr {M}_k}\) as

#### 5.3.4 Results

We apply our methodology for candidate motifs from \(\mathscr {M}_7,\mathscr {M}_8\) and \(\mathscr {M}_9\), which results in three lists of putative motifs having a direct effect on gene expression. Due to the very large number of potential predictors that comes with a sparser matrix \(\mathbf {X}\) when *k* increases, we first perform a screening step that keeps the 5000 motifs with the highest marginal correlations with \(\mathbf {Y}\). Second, SPRING is applied to each of the twelve time-course experiments described in Table 6. The selection of \((\lambda _1,\lambda _2)\) is performed on a grid using the BIC (10). At the end of the day, the three lists corresponding to \(k=7,8,9\) include respectively 87, 82 and 72 motifs, for which at least one coefficient in the associated row \(\widehat{{\varvec{\varOmega }}}_{\mathbf {x}\mathbf {y}}(j,\cdot )\) was found non-null for some of the twelve experiments, as detailed in Table 6.

To assess the relevance of the selected motifs, we compared them with the MotifDB patterns available in Bioconductor (Shannon 2013), where known transcription factor binding sites are recorded. There are 453 such reference motifs with size varying from 5 to 23 nucleotides. Consider the case of \(k=7\) for instance: among the 87 SPRING motifs, 62 match one MotifDB pattern each and 25 are clustered into 11 MotifDB patterns as depicted in Table 7. As seen in this table, the clusters of motifs selected by SPRING correspond to sets of variants of the same pattern. These clusters consist of motifs that are close according to the similarity encoded in the structure matrix \(\mathbf {L}\). In this example, the ability of SPRING to use domain-specific definitions of the structure between the predictors oriented the regression problem to account for motif degeneracy and helped in selecting motifs that are consistent known binding sites.

## 6 Discussion

We introduced a general framework for multivariate regression accounting for correlation between the outcomes, for a possible structure between the predictors, and inducing sparsity on the direct links between the predictors and the outcomes. The whole procedure is available through the R package **spring**. This provides a generic framework that can accommodate to a broad variety of structures between the covariates via the matrix \(\mathbf {L}\). This comes to the price of the construction of this matrix that is specific to each application. A parallel can be made with other generic machine learning approaches such as SVM, where the kernel has to be tailored for a given application to ensure good performances. Although we recommend the \(\mathbf {L}\) matrix to be carefully designed to fully include the prior knowledge, some typical shapes have been made available in the **spring** package.

The proposed approach is valid for a limited number of outcomes, typically \(q < n\). A natural extension of this work is to consider the case of large dimension outputs. This requires to also impose sparsity on the covariance matrix \(\mathbf {R}\) between the responses. The most natural way is to use an additional graphical Lasso-type penalty term. This can be achieved through a modification of step (9a) in the alternating optimization procedure. This is a work in progress.

## Footnotes

- 1.
- 2.
We set \(\mathbf {R}\) a correlation matrix in order not to excessively penalize the LASSO or the group-LASSO, which both use the same tuning parameter \(\lambda _1\) across the outcomes (and thus the same variance estimator).

- 3.
We also used the same seed and CV-folds for both scenarios.

- 4.
This value directly arises from the definition of the genetic distance itself.

## Notes

### Acknowledgments

We would like to thank Mathieu Lajoie and Laurent Bréhélin for kindly sharing the dataset from Gasch et al. (2000). We also thank the reviewers for their questions and remarks, which helped us to improve our manuscript. This project was conducted in the framework of the project AMAIZING funded by the French ANR. This work has been partially supported by the GRANT Reg4Sel from the French INRA-SelGen metaprogram.

### References

- Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with sparsity-inducing penalties. Found. Trends Mach. Learn.
**4**(1), 1–106 (2012)CrossRefMATHGoogle Scholar - Brown, P., Vannucci, M., Fearn, T.: Multivariate bayesian variable selection and prediction. J. R. Stat. Soc. B
**60**(3), 627–641 (1998)MathSciNetCrossRefMATHGoogle Scholar - Brown, P., Fearn, T., Vannucci, M.: Bayesian wavelet regression on curves with applications to a spectroscopic calibration problem. J. Am. Stat. Assoc.
**96**, 398–408 (2001)MathSciNetCrossRefMATHGoogle Scholar - Chiquet, J., Grandvalet, Y., Ambroise, C.: Inferring multiple graphical structures. Stat. Comput.
**21**(4), 537–553 (2011)MathSciNetCrossRefMATHGoogle Scholar - de los Campos, G., Hickey, J., Pong-Wong, R., Daetwyler, H., Calus, M.: Whole genome regression and prediction methods applied to plant and animal breeding. Genetics
**193**(2), 327–345 (2012)CrossRefGoogle Scholar - Efron, B.: The estimation of prediction error: covariance penalties and cross-validation (with discussion). J. Am. Stat. Assoc.
**99**, 619–642 (2004)MathSciNetCrossRefMATHGoogle Scholar - Ferreira, M., Satagopan, J., Yandell, B., Williams, P., Osborn, T.: Mapping loci controlling vernalization requirement and flowering time in brassica napus. Theor. Appl. Genet.
**90**, 727–732 (1995)CrossRefGoogle Scholar - Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw.
**33**, 1–22 (2010)CrossRefGoogle Scholar - Gasch, A., Spellman, P., Kao, C., Carmel-Harel, O., Eisen, M.B., Storz, G., Botstein, D., Brown, P.: Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell
**11**(12), 4241–4257 (2000)CrossRefGoogle Scholar - Hans, C.: Elastic net regression modeling with the orthant normal prior. J. Am. Stat. Assoc.
**106**, 1383–1393 (2011)MathSciNetCrossRefMATHGoogle Scholar - Harville, D.: Matrix Algebra from a Statistician’s Perspective. Springer, New York (1997)CrossRefMATHGoogle Scholar
- Hebiri, M., van De Geer, S.: The smooth-lasso and other l1 + l2 penalized methods. Electron. J. Stat.
**5**, 1184–1226 (2011)MathSciNetCrossRefMATHGoogle Scholar - Hesterberg, T., Choi, N.M., Meier, L., Fraley, C.: Least angle and \(\ell _{1}\) penalized regression: a review. Stat. Surv.
**2**, 61–93 (2008)MathSciNetCrossRefMATHGoogle Scholar - Hoefling, H.: A path algorithm for the fused lasso signal approximator. J. Comput. Graph. Stat.
**19**(4), 984–1006 (2010)MathSciNetCrossRefGoogle Scholar - Kim, S., Xing, E.: Statistical estimation of correlated genome associations to a quantitative trait network. PLoS Genet.
**5**(8), e1000587 (2009)CrossRefGoogle Scholar - Kim, S., Xing, E.: Tree-guided group lasso for multi-task regression with structured sparsity. In: Proceedings of the 27th International Conference on Machine Learning, pp. 543–550 (2010)Google Scholar
- Kim, S.J., Koh, K., Boyd, S., D, G.: \(\ell _1\) trend filtering. SIAM Rev.
**51**(2), 339–360 (2009)MathSciNetCrossRefGoogle Scholar - Kole, C., Thorman, C., Karlsson, B., Palta, J., Gaffney, P., Yandell, B., Osborn, T.: Comparative mapping of loci controlling winter survival and related traits in oilseed brassica rapa and
*B. napus*. Mol. Breed.**1**, 201–210 (2002)CrossRefGoogle Scholar - Krishna, A., Bondell, H., Ghosh, S.: Bayesian variable selection using an adaptive powered correlation prior. J. Stat. Plan. Inference
**139**(8), 2665–2674 (2009)MathSciNetCrossRefMATHGoogle Scholar - Lajoie, M., Gascuel, O., Lefort, V., Brehelin, L.: Computational discovery of regulatory elements in a continuous expression space. Genome Biol.
**13**(11), R109 (2012). doi:10.1186/gb-2012-13-11-r109 CrossRefGoogle Scholar - Li, C., Li, H.: Variable selection and regression analysis for graph-structured covariates with an application to genomics. Ann. Appl. Stat.
**4**(3), 1498–1516 (2010)MathSciNetCrossRefMATHGoogle Scholar - Li, X., Panea, C., Wiggins, C., Reinke, V., Leslie, C.: Learning “graph-mer” motifs that predict gene expression trajectories in development. PLoS Comput. Biol.
**6**(4), e1000,761 (2010)CrossRefGoogle Scholar - Lorbert, A., Eis, D., Kostina, V., Blei, D., Ramadge, P.: Exploiting covariate similarity in sparse regression via the pairwise elastic net. In: Teh, Y.W., Titterington, D.M. (eds.) Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS-10), vol. 9, pp. 477–484 (2010)Google Scholar
- Mardia, K., Kent, J., Bibby, J.: Multivariate Analysis. Academic Press, London (1979)Google Scholar
- Marin, J.M., Robert, C.P.: Bayesian Core: A Practical Approach to Computational Bayesian Statistics. Springer, New York (2007)MATHGoogle Scholar
- Obozinski, G., Wainwright, M., Jordan, M.: Support union recovery in high-dimensional multivariate regression. Ann. Stat.
**39**(1), 1–47 (2011)MathSciNetCrossRefMATHGoogle Scholar - Osborne, B., Fearn, T., Miller, A., Douglas, S.: Application of near infrared reflectance spectroscopy to compositional analysis of biscuits and biscuit doughs. J. Sci. Food Agric.
**35**, 99–105 (1984)CrossRefGoogle Scholar - Osborne, M.R., Presnell, B., Turlach, B.A.: On the lasso and its dual. J. Comput. Graph. Stat.
**9**(2), 319–337 (2000)MathSciNetGoogle Scholar - Park, T., Casella, G.: The Bayesian lasso. J. Am. Stat. Assoc.
**103**, 681–686 (2008)MathSciNetCrossRefMATHGoogle Scholar - Rapaport, F., Zinovyev, A., Dutreix, M., Barillot, E., Vert, J.P.: Classification of microarray data using gene networks. BMC Bioinform.
**8**, 35 (2007)CrossRefGoogle Scholar - Rothman, A., Levina, E., Zhu, J.: Sparse multivariate regression with covariance estimation. J. Comput. Graph. Stat.
**19**(4), 947–962 (2010)MathSciNetCrossRefGoogle Scholar - Shannon, P.: MotifDb: An Annotated Collection of Protein-DNA Binding Sequence Motifs. R package version 1.4.0 (2013)Google Scholar
- Slawski, M., W, Zu Castell, Tutz, G.: Feature selection guided by structural information. Ann. Appl. Stat.
**4**, 1056–1080 (2010)MathSciNetCrossRefMATHGoogle Scholar - Sohn, K., Kim, S.: Joint estimation of structured sparsity and output structure in multiple-output regression via inverse-covariance regularization. JMLR
**W&CP**(22), 1081–1089 (2012)Google Scholar - Städler, N., Bühlmann, P., Geer, S.: \(\ell _1\)-penalization for mixture regression models. Test
**19**(2), 209–256 (2010). doi:10.1007/s11749-010-0197-z MathSciNetCrossRefMATHGoogle Scholar - Stein, C.: Estimation of the mean of a multivariate normal distribution. Ann. Stat.
**9**, 1135–1151 (1981)MathSciNetCrossRefMATHGoogle Scholar - Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B
**58**, 267–288 (1996)MathSciNetMATHGoogle Scholar - Tibshirani, R., Taylor, J.: The solution path of the generalized lasso. Ann. Stat.
**39**(3), 1335–1371 (2011). doi:10.1214/11-AOS878 MathSciNetCrossRefMATHGoogle Scholar - Tibshirani, R., Taylor, J.: Degrees of freedom in lasso problems. Ann. Stat.
**40**, 1198–1232 (2012)MathSciNetCrossRefMATHGoogle Scholar - Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K.: Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. B
**67**, 91–108 (2005)MathSciNetCrossRefMATHGoogle Scholar - Tseng, P.: Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl.
**109**(3), 475–494 (2001)MathSciNetCrossRefMATHGoogle Scholar - Tseng, P., Yun, S.: A coordinate gradient descent method for nonsmooth separable minimization. Math. Program.
**117**, 387–423 (2009)MathSciNetCrossRefMATHGoogle Scholar - Yin, J., Li, H.: A sparse conditional Gaussian graphical model for analysis of genetical genomics data. Ann. Appl. Stat.
**5**, 2630–2650 (2011)MathSciNetCrossRefMATHGoogle Scholar - Yuan, X.T., Zhang, T.: Partial Gaussian graphical model estimation. IEEE Trans. Inform. Theory
**60**(3), 1673–1687 (2014)MathSciNetCrossRefGoogle Scholar - Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. B
**67**, 301–320 (2005)MathSciNetCrossRefMATHGoogle Scholar