Statistics and Computing

, Volume 27, Issue 3, pp 789–804 | Cite as

Structured regularization for conditional Gaussian graphical models

  • Julien Chiquet
  • Tristan Mary-Huard
  • Stéphane Robin
Article
  • 244 Downloads

Abstract

Conditional Gaussian graphical models are a reparametrization of the multivariate linear regression model which explicitly exhibits (i) the partial covariances between the predictors and the responses, and (ii) the partial covariances between the responses themselves. Such models are particularly suitable for interpretability since partial covariances describe direct relationships between variables. In this framework, we propose a regularization scheme to enhance the learning strategy of the model by driving the selection of the relevant input features by prior structural information. It comes with an efficient alternating optimization procedure which is guaranteed to converge to the global minimum. On top of showing competitive performance on artificial and real datasets, our method demonstrates capabilities for fine interpretation, as illustrated on three high-dimensional datasets from spectroscopy, genetics, and genomics.

Keywords

Multivariate regression Regularization Sparsity  Conditional Gaussian graphical model Structured elastic net Regulatory motif QTL study Spectroscopy 

1 Introduction

Multivariate regression, i.e. regression with multiple response variables, is increasingly used to model high dimensional problems. Compared to its univariate counterpart, the general linear model aims to predict several—say q—responses from a set of p predictors, relying on a training data set \(\left\{ (\mathbf {x}_i,\mathbf {y}_i)\right\} _{i=1,\dots ,n}\):
$$\begin{aligned} \mathbf {y}_i = \mathbf {B}^T \mathbf {x}_i + {\varvec{\varepsilon }}_i, \quad {\varvec{\varepsilon }}_i\sim \mathscr {N}(\mathbf {0},\mathbf {R}), \quad \forall i=1,\dots ,n. \end{aligned}$$
(1)
The \(p\times q\) matrix of regression coefficients \(\mathbf {B}\) and the \(q\times q\) covariance matrix \(\mathbf {R}\) of the Gaussian noise \({\varvec{\varepsilon }}_i\) are unknown. Note that we omit the intercept term and center the data for clarity. Model (1) has been studied by Mardia et al. (1979) in the low dimensional case where both ordinary and generalized least squares estimators of \(\mathbf {B}\) coincide and do not depend on \(\mathbf {R}\). These approaches boil down to performing q independent regressions, each column \(\mathbf {B}_j\) describing the weights associating the p predictors to the jth response. In the \(n<p\) setup however, these estimators are not defined.

Mimicking the univariate-output case, multivariate penalized methods aim to regularize the problem by biasing the regression coefficients toward a given feasible set. Sparsity within the set of predictors is usually the most wanted feature in the high-dimensional setting, which can be met in the multivariate framework by a straightforward application of the most popular penalty-based methods from the univariate world involving \(\ell _1\)-regularization. In the context of multitask learning, more sophisticated penalties encouraging similarities between parameters across outputs have been proposed. For example, many authors have suggested the use of group-norms penalties to set to zero full rows of \(\mathbf {B}\) (see, e.g., Obozinski et al. 2011; Chiquet et al. 2011). Refinements exist to cope with complex dependency structures, using graph or tree structures (Kim and Xing 2009, 2010).

Importantly, all aforementioned methods are based on the regularization of parameter matrix \(\mathbf {B}\), that accounts for both direct and undirect links between the inputs and the outputs. Alternatively, one may ask for sparsity in the direct links only. The distinction between direct and undirect links can be formalized when \((\mathbf {x}_i,\mathbf {y}_i)\) are jointly Gaussian. In this context, by conditioning \(\mathbf {y}_i\) over \(\mathbf {x}_i\), one obtains an expression of \(\mathbf {B}\) that depends on \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\) and \({\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}} = \mathbf {R}^{-1}\), where \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\) is the partial covariance matrix between \(\mathbf {x}_i\) and \(\mathbf {y}_i\) (see, e.g. Mardia et al. 1979). Note that imposing sparsity on the regression coefficients \(\mathbf {B}\) or on \({\varvec{\varOmega }}_{xy}\) is not equivalent and may therefore lead to different results. Here we chose to impose sparsity on \({\varvec{\varOmega }}_{xy}\). Indeed, these partial covariances account for the direct relationships that exist between the predictors \(\mathbf {x}_i\) and the responses \(\mathbf {y}_i\). Approaches encouraging sparsity on direct links \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\) have been referred to as ‘conditional Gaussian Graphical Model’ (cGGM) or ‘partial Gaussian Graphical Model’ in the recent literature (Sohn and Kim 2012; Yuan and Zhang 2014). In these papers, the authors propose to regularize the cGGM log-likelihood by two \(\ell _1\)-norms respectively acting on the partial covariances between the features and the responses \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\), and on the partial covariances between the responses themselves via \(\mathbf {R}^{-1}\).

In the present paper we also consider the direct link regularization approach for multivariate regression with the following specifications. First, we typically consider situations where the number of outputs is small compared to the number of predictors and sample size (while we insist on the fact that the number of predictors may still exceed the sample size). Consequently no sparsity assumption is required for the partial covariances between outputs. Second, we consider applications where structural information about the effect of the inputs is available. Here the structural information will be embedded in the regularization scheme via an additional regularization term using an application-specific metrics, in the same manner as in the ‘structured’ versions of the Elastic-net (Slawski et al. 2010; Hebiri and Geer 2011; Lorbert et al. 2010), or quadratic penalty function using the Laplacian graph (Rapaport et al. 2007; Li and Li 2010) proposed in the univariate-output case. Adding a structured regularization term is of importance when interpretability of the estimated model is as much important as its predictive performance. These two specifications (small number of outputs, availability of prior structural information) arise in application fields such as biology, agronomy or health and 3 examples will be investigated hereafter.

We show that the resulting penalized likelihood optimization problem is jointly convex and can be solved efficiently using a two-step procedure for which algorithmic convergence guarantees are provided. Penalized criteria for the choice of the regularization parameters are also provided. The importance of embedding for structural prior information will be exemplified in the various contexts of spectroscopy, genomic selection and regulatory motif discovery, illustrating how accounting for application-specific improves both performance and interpretability. The procedure is available as an R-package called spring, available on the R-forge.

The outline of the paper is as follows. In Sect. 2 we provide background on cGGM and present our regularization scheme. In Sect. 3, we develop an efficient optimization strategy in order to minimize the associated criterion. A paragraph also addresses the model selection issue. Section 4 is dedicated to illustrative simulation studies. In Sect. 5, we investigate three multivariate data sets: first, we consider an example in spectrometry with the analysis of cookie dough samples; second, the relationships between genetic markers and a series of phenotypes of the plant Brassica napus is addressed; and third, we investigate the discovery of regulatory motifs of yeast from time-course microarray experiments. In these applications, some specific underlying structuring priors arise, the integration of which within our model is detailed as it is one of the main contributions of our work.

2 Model setup

2.1 Background on cGGM

The statistical framework of cGGM arises from a different parametrization of (1) which fills the gap between multivariate regression and GGM. It extends to the multivariate case the links existing between the linear model, partial correlations and GGM, as depicted for instance in Mardia et al. (1979). It amounts to investigating the joint probability distribution of \((\mathbf {x}_i, \mathbf {y}_i)\) in the Gaussian case, with the following block-wise decomposition of the covariance matrix \({\varvec{\varSigma }}\) and its inverse \({\varvec{\varOmega }}={\varvec{\varSigma }}^{-1}\):
$$\begin{aligned} {\varvec{\varSigma }}= \begin{pmatrix} {\varvec{\varSigma }}_{\mathbf {x}\mathbf {x}} &{}\quad {\varvec{\varSigma }}_{\mathbf {x}\mathbf {y}} \\ {\varvec{\varSigma }}_{\mathbf {y}\mathbf {x}} &{}\quad {\varvec{\varSigma }}_{\mathbf {y}\mathbf {y}} \\ \end{pmatrix},\quad {\varvec{\varOmega }}= \begin{pmatrix} {\varvec{\varOmega }}_{\mathbf {x}\mathbf {x}} &{}\quad {\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}} \\ {\varvec{\varOmega }}_{\mathbf {y}\mathbf {x}} &{}\quad {\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}} \\ \end{pmatrix}. \end{aligned}$$
(2)
Back to the distribution of \(\mathbf {y}_i\) conditional on \(\mathbf {x}_i\), multivariate Gaussian analysis shows that, for centered \(\mathbf {x}_i\) and \(\mathbf {y}_i\),
$$\begin{aligned} \mathbf {y}_i | \mathbf {x}_i \sim \mathscr {N}\left( -{\varvec{\varOmega }}^{-1}_{\mathbf {y}\mathbf {y}}{\varvec{\varOmega }}_{\mathbf {y}\mathbf {x}} \mathbf {x}_i, {\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}^{-1} \right) . \end{aligned}$$
(3)
Fig. 1

Toy examples to illustrate the relationships between \(\mathbf {B}, {\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\) and \(\mathbf {R}\) in the cGGM (better seen in color): on panel (a), a situation with no particular structure among the predictors; on panel (b), a strong neighborhood structure. For panels (a) and (b), we represent the effect of stronger correlations in \(\mathbf {R}\) on masking the direct links in \(\mathbf {B}\); panel (c) illustrates the reverse situation of (a) where \(\mathbf {B}\) would be sparse and \({\varvec{\varOmega }}_{xy}\) would not. (Color figure online)

This model, associated with the full sample \(\{(\mathbf {x}_i,\mathbf {y}_i)\}_{i=1}^n\), can be written in a matrix form by stacking in rows first the observations of the responses, and then of the predictors, in two data matrices \(\mathbf {Y}\) and \(\mathbf {X}\) with respective sizes \(n\times q\) and \(n\times p\), such that
$$\begin{aligned} \mathbf {Y}= -\mathbf {X}{\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}} {\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}^{-1} + {\varvec{\varepsilon }}, \mathrm {vec}({\varvec{\varepsilon }}) \sim \mathscr {N}\left( \mathbf {0}_{nq}, \mathbf {I}_{n} \otimes {\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}^{-1} \right) , \end{aligned}$$
(4)
where, for a \(n \times p\) matrix \(\mathbf{A}\), denoting \(\mathbf{A}_j\) its jth column, the \(\mathrm {vec}\) operator is defined as \(\mathrm {vec}(\mathbf{A}) = (\mathbf {A}_1^T \dots \mathbf {A}_p^T)^T\). The log-likelihood of (4)—which is a conditional likelihood regarding the joint model (2)—is written
$$\begin{aligned} \log L({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}},{\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}})= & {} \frac{n}{2}\log \left| {\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}\right| \nonumber \\&+\,\frac{n}{2} \mathrm {tr}\left( \left( \mathbf {Y}+ \mathbf {X}{\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}{\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}^{-1} \right) {\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}\right. \nonumber \\&\times \left. \left( \mathbf {Y}+ \mathbf {X}{\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}{\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}^{-1} \right) ^T\right) \!+ \, \! \mathrm {cst.} \end{aligned}$$
(5)
Introducing the empirical matrices of covariance \(\mathbf {S}_{\mathbf {y}\mathbf {y}} = n^{-1}\sum _{i=1}^n\mathbf {y}_i \mathbf {y}_i^T\), \(\mathbf {S}_{\mathbf {x}\mathbf {x}} = n^{-1}\sum _{i=1}^n\mathbf {x}_i \mathbf {x}_i^T\), and \(\mathbf {S}_{\mathbf {y}\mathbf {x}} = n^{-1}\sum _{i=1}^n\mathbf {y}_i \mathbf {x}_i^T\), one has
$$\begin{aligned} - \frac{2}{n}\log L({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}},{\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}})= & {} - \log \left| {\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}\right| + \mathrm {tr}\left( \mathbf {S}_{\mathbf {y}\mathbf {y}} {\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}} \right) \nonumber \\&+\,2\mathrm {tr}\left( \mathbf {S}_{\mathbf {x}\mathbf {y}}{\varvec{\varOmega }}_{\mathbf {y}\mathbf {x}} \right) \nonumber \\&+ \, \mathrm {tr}({\varvec{\varOmega }}_{\mathbf {y}\mathbf {x}}\mathbf {S}_{\mathbf {x}\mathbf {x}} {\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}} {\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}^{-1}) + \mathrm {cst.} \end{aligned}$$
(6)
We notice by comparing the cGGM (3) to the multivariate regression model (1) that \({\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}^{-1} = \mathbf {R}\text { and } \mathbf {B}= -{\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}} {\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}^{-1}\). Although equivalent to (1), the alternative parametrization (3) shows two important differences with several implications. First, the negative log-likelihood (5) can be shown to be jointly convex in \(({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}},{\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}})\) Yuan and Zhang (2014). Minimization problems involving (5) will thus be amenable to a global solution, which facilitates both optimization and theoretical analysis. As such, the conditional negative log-likelihood (5) will serve as a building block for our learning criterion. Second, it unveils an easier interpretation of the relationships between input and output variables, which is especially appealing in modern data settings when the number of potential candidates is huge. Indeed, \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\) describes the direct relationships between predictors and responses, the support of which we are looking for to select relevant interactions. On the other hand, \(\mathbf {B}\) entails both direct and indirect influences, possibly due to some strong correlations between the responses, described by the covariance matrix \(\mathbf {R}\) (or equivalently its inverse \({\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}\)). To provide additional insights on cGGM, Fig. 1 illustrates the relationships between \(\mathbf {B},{\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\) and \(\mathbf {R}\) in two simple sparse scenarios where \(p=40\) and \(q=5\). Scenarios (a) and (b) are discriminated by the presence of a strong structure among the predictors. An important point to grasp in both scenarios is how strong correlations between outcomes can completely “mask” the direct links in the regression coefficients: the stronger the correlation in \(\mathbf {R}\), the less it is possible to distinguish in \(\mathbf {B}\) the non-zero coefficients of \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\). Scenario (c) illustrates the reverse situation of (a), where \(\mathbf {B}\) would be sparse and \({\varvec{\varOmega }}_{xy}\) would not. Choosing between the two is a matter of modeling. Our approach relies on the assumptions that direct links are sparse.

In this context, sparse inference strategy have been recently proposed in Sohn and Kim (2012), Yuan and Zhang (2014) when both \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\) and \(\mathbf {R}^{-1}\) are assumed to be sparse. In contrast, we discuss in the following a regularization scheme tailored to recover \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\) and \(\mathbf {R}^{-1}\) in the case when the later is sparse and structured and the former is relatively small and dense, as in scenario b of Fig. 1.

2.2 Structured regularization with underlying sparsity

Our regularization scheme starts by considering some structural prior information about the relationships between the coefficients. We are typically thinking of a situation where similar inputs are expected to have similar direct relationships with the outputs. The right panel of Fig. 1 represents an illustration of such a situation, where there exists an extreme neighborhood structure between the predictors. This depicts a pattern that acts along the rows of \(\mathbf {B}\) or \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\) as substantiated by the following Bayesian point of view.

2.2.1 Bayesian interpretation

Suppose that the similarities between the covariates can be encoded into a \(p \times p\) matrix \(\mathbf {L}\). Our aim is to account for this information when learning the coefficients. The Bayesian framework provides a convenient setup for defining the way the structural information should be accounted for. In the single output case (see, e.g. Marin and Robert 2007), a conjugate prior for \(\beta \) (the univariate version of \(\mathbf {B}\)) would be \(\mathscr {N}(\mathbf {0}, \sigma ^2\mathbf {L}^{-1})\). Combined with the covariance between the outputs, this gives
$$\begin{aligned} \mathrm {vec}(\mathbf {B}) \sim \mathscr {N}(\mathbf {0},\mathbf {R}\otimes \mathbf {L}^{-1}), \end{aligned}$$
where \(\otimes \) is the Kronecker product. This framework is similar to the one considered in Brown et al. (1998) (where the prior for \(\mathbf {B}\) would be written as \(\mathscr {N}(\mathbf {L}^{-1}, \mathbf {R})\)). Two classical choices for \(\mathbf {L}\) would be \(\mathbf {L}= \lambda \mathbf {I}\), where no prior structure is encoded, or \(\mathbf {L}= \mathbf {X}^T \mathbf {X}\), where the prior structure is borrowed from the empirical covariance matrix between the covariates. The later structure is known as the g-prior (see e.g. Brown et al. 1998; Krishna et al. 2009). In following we consider the case where \(\mathbf {L}\) is given, is built upon some exogenous prior information and has an arbitrary form. In Sect. 5, we will show how such a matrix can be constructed for a specific application.
As explained in the previous section, we are primarily interested in the (sparse) estimation of \({\varvec{\varOmega }}_{xy}\). By properties of the \(\mathrm {vec}\) operator, the prior distribution of \(\mathbf {B}\) can be restated as
$$\begin{aligned} \mathrm {vec}({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}) \sim \mathscr {N}(\mathbf {0},\mathbf {R}^{-1} \otimes \mathbf {L}^{-1}) \end{aligned}$$
for the direct links. Choosing such a prior results in
$$\begin{aligned} \log \mathbb {P}({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}|\mathbf {L},\mathbf {R}) = \frac{1}{2}\mathrm {tr}\left( {\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}^T \mathbf {L} {\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}} \mathbf {R}\right) + \mathrm {cst.} \end{aligned}$$

2.2.2 Criterion

Through this argument, we propose a criterion with two terms to regularize the conditional negative log-likelihood (5): first, a smooth trace term relying on the available structural information \(\mathbf {L}\); and second, an \(\ell _1\) norm that encourages sparsity among the direct links. We write the criterion as a function of \(({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}, {\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}})\) rather than \(({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}, \mathbf {R})\), although equivalent in terms of estimation, since the former leads to a convex formulation. The optimization problem turns to the minimization of
$$\begin{aligned} J({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}},{\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}})= & {} -\frac{1}{n}\log L({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}},{\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}})\nonumber \\&+\, \frac{\lambda _2}{2} \mathrm {tr}\left( {\varvec{\varOmega }}_{\mathbf {y}\mathbf {x}}\mathbf {L}{\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}{\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}^{-1}\right) \!+\! \lambda _1 \Vert {\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\Vert _1 . \end{aligned}$$
(7)

2.3 Connection to other sparse methods

To get more insight into our model and to facilitate connections with existing approaches, we shall write the objective function (7) as a penalized univariate regression problem. This amounts to “vectorizing” model (4) with respect to \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\), i.e., to writing the objective as a function of \(({\varvec{\omega }},{\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}})\) where \({\varvec{\omega }}=\mathrm {vec}({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}})\). This is stated in the following proposition, which can be derived from straightforward matrix algebra, as proved in “Appendix”. The main interest of this proposition will become clear when deriving the optimization procedure that aims at minimizing (7), as the optimization problem when \({\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}\) is fixed turns to a generalized Elastic-Net problem. Note that we use \(\mathbf {A}^{1/2}\) to denote the square root of a matrix, obtained for instance by a Cholesky factorization in the case of a symmetric positive definite matrix.

Proposition 1

Let \({\varvec{\omega }}= \mathrm {vec}({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}})\). An equivalent vector form of (7) is
$$\begin{aligned} J({\varvec{\omega }}, {\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}})= & {} -\frac{1}{2} \log |{\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}| + \frac{1}{2 n} \left\| \tilde{\mathbf {y}} + \tilde{\mathbf {X}} {\varvec{\omega }}\right\| _2^2 \\&+\,\frac{\lambda _2}{2} {\varvec{\omega }}^T \left( {\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}^{-1} \otimes \mathbf {L}\right) {\varvec{\omega }}+ \lambda _1 \left\| {\varvec{\omega }}\right\| _1, \end{aligned}$$
where the \(n\times q\) dimensional vector \(\tilde{\mathbf {y}}\) and the \(n q \times p q\) dimensional matrix \(\tilde{\mathbf {X}}\) depends on \({\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}\) such that
$$\begin{aligned} \tilde{\mathbf {y}} = \mathrm {vec}(\mathbf {Y}{\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}^{1/2}), \quad \tilde{\mathbf {X}} = \left( {\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}^{-1/2} \otimes \mathbf {X}\right) . \end{aligned}$$
When there is only one response (\(q=1\)) the variance–covariance matrix \(\mathbf {R}\) and its inverse \({\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}\) turns to a scalar denoted by \(\sigma ^2\), the matrix \(\mathbf {B}\) to a p-vector \({\varvec{\beta }}\), and the objective (7) can be rewritten
$$\begin{aligned} \log (\sigma ) + \frac{1}{2n\sigma ^2}\left\| \mathbf {y}-\mathbf {X}{\varvec{\beta }}\right\| _2^2 + \frac{\lambda _2}{2\sigma ^2} {\varvec{\beta }}^T\mathbf {L}{\varvec{\beta }}+ \frac{\lambda _1}{\sigma ^2} \Vert {\varvec{\beta }}\Vert _1. \end{aligned}$$
(8)
We recognize in (8) the log-likelihood of a “usual” linear model with Gaussian noise, unknown variance \(\sigma ^2\) and regression parameters \({\varvec{\beta }}\), plus a mixture of \(\ell _1\) and \(\ell _2\) norms. Assuming that \(\sigma \) is known, we recognize the general structured Elastic-net regression:
$$\begin{aligned} \frac{1}{2}\left\| \mathbf {y}-\mathbf {X}{\varvec{\beta }}\right\| _2^2 + \frac{\lambda _2}{2} {\varvec{\beta }}^T\mathbf {L}{\varvec{\beta }}+ {\lambda _1} \Vert {\varvec{\beta }}\Vert _1. \end{aligned}$$
If \(\sigma \) is unknown, letting \(\lambda _2=0\) in (8) leads to the \(\ell _1\)-penalized mixture regression model proposed by Städler et al. (2010), where both \({\varvec{\beta }}\) and \(\sigma ^2\) are inferred. In Städler et al. (2010), the authors insist on the importance of penalizing the vector of coefficients by an amount that is inversely proportional to \(\sigma \): in the closely related Bayesian Lasso framework of Park and Casella (2008), the prior on \({\varvec{\beta }}\) is defined conditionally on \(\sigma \) to guarantee an unimodal posterior.
We also mention the multitask group-lasso that has been proposed to deal with multivariate outcomes. The purpose is to select a subset of covariates that jointly influence all the outcomes. Denoting \(\check{\mathbf {y}} = \mathrm {vec}(\mathbf {Y})\), \(\check{{\varvec{\beta }}} = \mathrm {vec}({\varvec{\beta }})\) and \(\check{\mathbf {X}} = \mathbf {I}\otimes \mathbf {X}\), the objective function is
$$\begin{aligned} \left\| \check{\mathbf {y}}- \check{\mathbf {X}}\check{{\varvec{\beta }}}\right\| _2^2 + \sum _g \lambda _{g} \Vert \check{{\varvec{\beta }}}_g \Vert _2. \end{aligned}$$
where \(\check{{\varvec{\beta }}}_g = (\beta _{g1}, \dots \beta _{gq})\) stands for the gth row of matrix \(\mathbf {B}\).

3 Learning

3.1 Optimization

In the classical framework of parametrization (1), alternate strategies where optimization is successively performed over \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\) and \({\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}\) have been proposed Rothman et al. (2010) and Yin and Li (2011). They come with no guarantee of convergence to the global optimum since the objective is only bi-convex. In the cGGM framework of Sohn and Kim (2012) and Yuan and Zhang (2014), the optimized criterion is jointly convex yet no convergence result is provided regarding the optimization procedure proposed by the authors. Here we also consider the alternate strategy for which the following proposition can be stated:

Proposition 2

Let \(n\ge q\). Criterion (7) is jointly convex in \(({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}},{\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}})\). Moreover, the alternate optimization
$$\begin{aligned} \hat{{\varvec{\varOmega }}}_{\mathbf {y}\mathbf {y}}^{(k+1)}= & {} \mathop {\mathrm {arg min}}\limits _{{\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}\succ 0} J_{\lambda _1\lambda _2}\left( \hat{{\varvec{\varOmega }}}_{\mathbf {x}\mathbf {y}}^{(k)}, {\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}\right) , \end{aligned}$$
(9a)
$$\begin{aligned} \hat{{\varvec{\varOmega }}}_{\mathbf {x}\mathbf {y}}^{(k+1)}= & {} \mathop {\mathrm {arg min}}\limits _{{\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}} J_{\lambda _1\lambda _2}\left( {\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}},\hat{{\varvec{\varOmega }}}_{\mathbf {y}\mathbf {y}}^{(k+1)}\right) . \end{aligned}$$
(9b)
leads to the optimal solution.

The proof is given in “Appendix”, and relies on the fact that efficient procedures exist to solve the two convex sub-problems (9a) and (9b).

Because our procedure relies on alternating optimization, it is difficult to give either a global rate of convergence or a complexity bound. Nevertheless, the complexity of each iteration is easy to derive, since it amounts to two well-known problems: the main computational cost in (9a) is due to the SVD of a \(q\times q\) matrix, which costs \(\mathscr {O}(q^3)\). Concerning (9b), it amounts to the resolution of an Elastic-Net problem with \(p\,{\times }\, q\) variables and \(n \times q\) samples. If the final number of nonzero entries in \(\hat{{\varvec{\varOmega }}}_{\mathbf {x}\mathbf {y}}\) is k, a good implementation with Cholesky update/downdate is roughly in \(\mathscr {O}(n p q^2 k)\) (see, e.g. Bach et al. 2012). Since we typically assumed that \(p\ge n\ge q\), the global cost of a single iteration of the alternating scheme is thus \(\mathscr {O}(n p q^2 k)\), and we theoretically can treat problems with large p when k remains moderate.

Finally, we typically want to compute a series of solutions along the regularization path of Problem (7), i.e. for various values of \((\lambda _1,\lambda _2)\). To this end, we simply choose a grid of penalties \(\varLambda _1 \times \varLambda _2 = \left\{ \lambda _1^{\text {min}}, \dots , \lambda _1^{\text {max}}\right\} \times \left\{ \lambda _2^{\text {min}}, \dots , \lambda _2^{\text {max}}\right\} \). The process is easily distributed on different computer cores, each core corresponding to a value picked in \(\varLambda _2\). Then on each core—i.e. for a fix \(\lambda _2\in \varLambda _2\)—we cover all the values of \(\lambda _1\in \varLambda _1\) relying on the warm start strategy frequently used to go through the regularization path of \(\ell _1\)-penalized problems (see Osborne et al. 2000).

3.2 Model selection and parameter tuning

Model selection amounts here to choosing a couple of tuning parameters \((\lambda _1,\lambda _2)\) among the grid \(\varLambda _1 \times \varLambda _2 \), where \(\lambda _1\) tunes the sparsity while \(\lambda _2\) tunes the amount of structural regularization. We aim to pick either the model with minimum prediction error, or the one closest to the true model, assuming Eq. (3) holds. These two perspectives generally do not lead to the same model choices: when looking for the model minimizing the prediction error, K-fold cross-validation is the recommended option (Hesterberg et al. 2008) despite its additional computational cost. Letting \(\kappa : \left\{ 1,\dots ,n\right\} \rightarrow \left\{ 1,\dots ,K\right\} \) the function indexing the fold to which observation i is allocated, the CV-choices for \((\lambda _1^\text {cv},\lambda _2^\text {cv})\) are the ones that minimize
$$\begin{aligned} \frac{1}{n} \sum _{i=1}^n \left\| \mathbf {x}_{i}^T \hat{\mathbf {B}}_{-\kappa (i)}^{\lambda _1,\lambda _2}- \mathbf {y}_{i} \right\| _2^2, \end{aligned}$$
where \(\hat{\mathbf {B}}_{-\kappa (i)}^{\lambda _1,\lambda _2}\) minimizes the objective (7), once data \(\kappa (i)\) has been removed. In place of the prediction error, other quantities can be cross-validated in our context, e.g. the negative log-likelihood (5). In the remainder of the paper, however, we only consider cross-validating the prediction error, though.
As an alternative to cross-validation, penalyzed criteria provide a fast way to perform model selection and are sometimes more suited to the selection of the true underlying model. Although relying on asymptotic derivations, they often offer good practical performance when the sample size n is not too small compared to the problem dimension. For penalized methods, a general form for various information criteria is expressed as a function of the likelihood L (defined by (5) here) and the effective degrees of freedom:
$$\begin{aligned} - 2\log L\left( \hat{{\varvec{\varOmega }}}^{\lambda _1,\lambda _2}_{\mathbf {x}\mathbf {y}}, \hat{{\varvec{\varOmega }}}^{\lambda _1,\lambda _2}_{\mathbf {y}\mathbf {y}}\right) + \mathrm {pen} \cdot \mathrm {df}_{\lambda _1, \lambda _2}. \end{aligned}$$
(10)
Setting \(\text {pen}=2\) or \(\log (n)\) respectively leads to AIC or BIC. For the practical evaluation of (10), we must give some sense to the effective degrees of freedom \(\text {df}_{\lambda _1,\lambda _2}\). We use the now classical definition of Efron (2004) and rely on the work of Tibshirani and Taylor (2012) to derive the following Proposition, the proof of which is postponed to “Appendix”.

Proposition 3

An unbiased estimator of \(\text {df}_{\lambda _1,\lambda _2}\) for our fitting procedure is
$$\begin{aligned} \hat{\mathrm {df}}_{\lambda _1, \lambda _2}= & {} \mathrm {card}(\mathscr {A})\\&- \, \lambda _2 \mathrm {tr}\left( (\hat{\mathbf {R}}\otimes \mathbf {L})_{\mathscr {A}\mathscr {A}} \left( (\hat{\mathbf {R}}\otimes (\mathbf {S}_{\mathbf {x}\mathbf {x}} + \lambda _2\mathbf {L}))_{\mathscr {A}\mathscr {A}}\right) ^{-1}\right) , \end{aligned}$$
where \(\mathscr {A} = \left\{ j: \mathrm {vec}\left( \hat{{\varvec{\varOmega }}}_{\mathbf {x}\mathbf {y}}^{\lambda _1,\lambda _2}\right) \ne 0\right\} \) is the set of nonzero entries in \(\hat{{\varvec{\varOmega }}}_{\mathbf {x}\mathbf {y}}^{\lambda _1,\lambda _2}\).

Note that we can compute this expression at no additional cost, relying on computations already made during the optimization process.

4 Simulation studies

In this section, we would like to illustrate the new features of our proposal compared to several baselines in well controlled settings. To this end, we perform three simulation studies to evaluate (i) the gain brought by the estimation of the residual covariance \(\mathbf {R}\), (ii) the gain brought by the inclusion of informative structure on the predictors via \(\mathbf {L}\) and (iii) the behavior of the method in presence of both residual covariance and structure.

4.1 Implementation details

In our experiments, performance are compared with well-established regularization methods, whose implementation is easily accessible: the LASSO (Tibshirani 1996), the fused-LASSO (Tibshirani et al. 2005), the multitask group-LASSO, MRCE (Rothman et al. 2010), the Elastic-Net (Zou and Hastie 2005) and the Structured Elastic-Net (Slawski et al. 2010). LASSO and group-LASSO are fitted with the R-package glmnet (Friedman et al. 2010), fused-LASSO with the R-package FusedLasso (Hoefling 2010) and MRCE with Rothman et al. (2010)’s package. All other methods are fitted using our own code. When applied to multiple outcomes, all univariate methods (LASSO, fused-LASSO, MRCE, Elastic-Net and Structured Elastic-Net) were run on each output separately. The tuning parameter was selected by a cross-validation step per output. Our own procedure is available as an R-package called spring, distributed on the R-forge.1 As such, we sometimes refer to our method as ‘SPRING’ in the simulation part.

4.1.1 Data generation

Artificial datasets are generated according to the multivariate regression model (1). We assume that the decomposition \(\mathbf {B}= {\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}} {\varvec{\varOmega }}_{\mathbf {y}\mathbf {y}}^{-1} = {\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\mathbf {R}\) holds for the regression coefficients. We control the sparsity pattern of \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\) by arranging non null entries according to a possible structure of the predictors along the rows of \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\). We always use uncorrelated Gaussian predictors \(\mathbf {x}_i\sim \mathscr {N}(\mathbf {0},\mathbf {I})\) in order not to promote excessively the methods that take this structure into account. Strength of the relationships between the outputs are tuned by the covariance matrix \(\mathbf {R}\). We measure the performance of the learning procedures thanks to the prediction error (PE) estimated using a large test set of observations generated according to the true model. When relevant, mean squared error (MSE) of the regression coefficients \(\mathbf {B}\) is also presented. For conciseness, it is eluded when it shows results which are quantitatively identical to PE.

4.2 Influence of covariance between outcomes

The first simulation study aims to illustrate the advantage of splitting \(\mathbf {B}\) into \(-{\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\mathbf {R}\) over working directly on \(\mathbf {B}\). We set \(p=40\) predictors, \(q=5\) outcomes and randomly select 25 non null entries in \(\left\{ -1,1\right\} \) to fill \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\). We do not put any structure along the predictors as this study intends to measure the gain of using the cGGM approach. The covariance follows a Toeplitz scheme:2 one has \(\mathbf {R}_{ij} = \tau ^{|i-j|}\), for \(i,j=1,\dots ,q\). We consider three scenarios tuned by \(\tau \in \left\{ .1,.5,.9\right\} \) corresponding to an increasing correlation between the outcomes that eventually makes the cGGM more relevant. These settings have been used to generate panel (a) of Fig. 1. For each covariance scenario, we generate a training set with size \(n=50\) and a test set with size \(n=1000\). We assess the performance of SPRING (taking \(\mathbf {L}= \mathbf {I}\)) by comparison with three baselines: (i) the LASSO, (ii) the \(\ell _1/\ell _2\) multitask group-LASSO and (iii) SPRING with known covariance matrix \(\mathbf {R}\). As it corresponds to the best fit we can obtain with our proposal, we call this variant the “oracle” mode of SPRING. The final estimators are obtained by fivefold cross-validation on the training set. Figure 2 gives the boxplots of PE obtained for 100 replicates. As expected, the advantage of taking the covariance into account becomes more and more important for maintaining a low PE when \(\tau \) increases. When correlation is low, the LASSO dominates the group-LASSO; this is the other way around in the high correlation setup, where the latter takes full advantage of its grouping effect along the outcomes. Still, our proposal remains significantly better as soon as \( \tau \) is substantial enough. We also note that our iterative algorithm does a good job since SPRING remains close to its oracle variant.
Fig. 2

Illustration of the influence of correlations between outcomes. Scenarios \(\left\{ \text {low},\text {med},\text {high}\right\} \) map to \(\tau \in \left\{ .1,.5,.9\right\} \) in the Toeplitz-shaped covariance matrix \(\mathbf {R}\)

4.3 Structure integration and robustness

The second simulation study is designed to measure the impact of introducing an informative prior along the predictors via \(\mathbf {L}\). To remove any effect induced by the covariance between the outputs, we set \(q=1\). In this case the criterion is written as in Expression (8): \(\mathbf {R}\) boils down to a scalar \(\sigma ^2\) and \(\mathbf {B},{\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\) turn to two p-size vectors sharing the same sparsity pattern such that \({\varvec{\beta }}= -{\varvec{\omega }}\sigma ^2\). In this situation, SPRING is close to Slawski et al. (2010)’s structured Elastic-Net, except that we hope for a better estimation of the coefficients thanks to the estimation of \(\sigma \). For comparison, we thus draw inspiration from the simulation settings originally used to illustrate the structured Elastic-Net: we set \({\varvec{\omega }}= (\omega _j)_{j=1}^p, \) with \(p=100\), so that we observe a sparse vector with two smooth bumps, one positive and the other one negative:
$$\begin{aligned} \omega _j ={\left\{ \begin{array}{ll} -( (30-j)^2 -100)/200 &{} \quad j=21,\dots 39,\\ \quad ( (70-j)^2 -100)/200 &{} \quad j=61,\dots 80,\\ 0 &{} \quad \text {otherwise}. \end{array}\right. } \end{aligned}$$
A natural choice for \(\mathbf {L}\) is to rely on the first forward difference operator as in the fused-Lasso (Tibshirani et al. 2005), or its smooth counterpart (Hebiri and Geer 2011). We thus set \(\mathbf {L}= \mathbf {D}^T\mathbf {D}\) with
$$\begin{aligned} \mathbf {D}_{ij} = {\left\{ \begin{array}{ll} 1 &{} \quad \text {if } i = j,\\ -1 &{}\quad \text {if } j = i+1 \\ 0 &{} \quad \text {otherwise}\\ \end{array}\right. }, \quad \begin{array}{l} i=1,\dots ,p-1, \\ j=1,\dots ,p. \end{array} \end{aligned}$$
(11)
Hence, \(\mathbf {L}\) is the combinatorial Laplacian of a chain graph between the successive predictors.
Table 1

Illustrating gain brought by structure information toward statistical performances

Method

Hamming dist.

MSE

PE

SPRING (\(\lambda _2=0.01\))

14.31 (3.51)

0.05 (0.01)

30.46 (2.12)

SPRING (\(\lambda _2=0\))

20.66 (4.18)

0.31 (0.07)

55.73 (7.82)

LASSO

20.47 (4.08)

0.31 (0.07)

55.66 (7.78)

Table 1 shows the typical gain brought by prior structure knowledge. We generate 100 learning samples of size \(n=100\) with \(\sigma ^2=5\) and reported the estimated PE, MSE and Hamming distance with the true coefficient) for the LASSO and for two versions of SPRING, with (\(\lambda _2 = .01\)) and without (\(\lambda _2 = 0\)) informative prior. We remind that the Hamming distance is \(\sum _j s_j (1-\widehat{s}_j) + \widehat{s}_j (1-s_j)\), where \(s_j\) (resp. \(\widehat{s}_j\)) is 1 if variable j has a non-zero coefficient in the true (resp. estimated) model, and 0 otherwise. It quantifies the distance between the true and estimated supports. Incorporating relevant structural information leads to a dramatic improvement in prediction but also in estimating the vector of parameters. As expected, univariate SPRING with no prior performs like the LASSO.

Now, the question is: what if we introduce a wrong prior, i.e. a matrix \(\mathbf {L}\) completely disconnected from the true structure of the coefficients? To answer this question, we use the same settings as above and randomly swap the entries in \({\varvec{\omega }}\), using exactly the same \(\mathbf {x}_i\) to generate the \(\mathbf {y}_i\).3 We then apply SPRING using respectively a non informative prior equal to the identity matrix (thus mimicking the Elastic-Net) and a ‘wrong’ prior \(\mathbf {L}\) whose rows and columns remain unswapped. We also try the LASSO, the Elastic-Net and the structured Elastic-Net. All methods are tuned with fivefold cross-validation on the learning set. Table 2 presents the results averaged over 100 runs both in terms of PE (using a size-1000 test set) and MSE.
Table 2

Structure integration: performance and robustness

Method

Scenario

MSE

PE

LASSO

.336 (.096)

58.6 (10.2)

E-Net (\(\mathbf {L}=\mathbf {I}\))

.340 (.095)

59 (10.3)

SPRING (\(\mathbf {L}=\mathbf {I}\))

.358 (.094)

60.7 (10)

S. E-net

Unswapped

.163 (.036)

41.3 ( 4.08)

(\(\mathbf {L}=\mathbf {D}^T \mathbf {D}\))

Swapped

.352 (.107)

60.3 (11.42)

SPRING

Unswapped

.062 (.022)

31.4 ( 2.99)

(\(\mathbf {L}=\mathbf {D}^T \mathbf {D}\))

Swapped

.378 (.123)

62.9 (13.15)

Swapped and unswapped results are the same for LASSO, Elastic-Net and SPRING for \(\mathbf {L}= \mathbf {I}\)

Fig. 3

Illustration of SPRING in a complete setting. Scenarios \(\left\{ \text {low}, \text {med}, \text {high}\right\} \) sample size corresponds to \(n\in \left\{ 50, 100, 200\right\} \)

As expected, the methods that do not integrate any structural information (LASSO, Elastic-Net and SPRING with \(\mathbf {L}=\mathbf {I}\)) are not affected by the permutation, and we avoid these redundancies in Table 2 to save space. Overall, these three methods share similar performance both in terms of PE and MSE. When the prior structure is relevant, SPRING, and to a lesser extent the structured Elastic-Net, clearly outperform the other competitors. Surprisingly, this is particularly true in terms of MSE, where SPRING also dominates the structured Elastic-Net that works with the same information. This means that the estimation of the variance also helped in the inference process. Finally, these results essentially support the robustness of the structured methods which are not much altered when using a wrong prior specification.

4.4 Assessing both covariance and structure estimation

This last simulation setting is made under a complete setting, including a structure between the predictors and correlations between the outcomes just like in Fig. 1, panel (b), with \(p=40\) and \(q=5\). The \(\mathbf {R}\) matrix has Toeplitz shape with parameter \(\tau = 0.75\). The \(\mathbf {L}\) matrix is as described in Sect. 4.3. The results are displayed in Fig. 3. As expected, being the only method accounting for both the structure and the covariance, SPRING with structure performs well. Not accounting for the structure in SPRING (but keeping the covariance between outcomes) degrades the performances, which still remain comparable to these of the structured Elastic-net (that do account for the structure).

5 Application studies

In this section the flexibility of our proposal is illustrated by investigating three multivariate data problems from various contexts, namely spectroscopy, genetics and genomics, where we insist on the construction of the structuring matrix \(\mathbf {L}\).

5.1 Near-infrared spectroscopy of cookie dough pieces

5.1.1 Context

In Near-Infrared (NIR) spectroscopy, one aims to predict one or several quantitative variables from the NIR spectrum of a given sample. Each sampled spectrum is a curve that represents the level of reflectance along the NIR region, that is, wavelengths from 800 to 2500 nanometers (nm). The quantitative variables are typically related to the chemical composition of the sample. The problem is then to select the most predictive region of the spectrum, i.e. some peaks that show good capabilities for predicting the response variable(s). This is known as a “calibration problem” in Statistics. NIR technique is used in fields as diverse as agronomy, astronomy or pharmacology. In such experiments, it is likely to encounter very strong correlations and structure along the predictors. In this perspective, Hans (2011) proposes to apply the Elastic-Net which is known to select simultaneously groups of correlated predictors. However it is not adapted to the prediction of several responses simultaneously. In Brown et al. (2001), an interesting wavelet regression model with Bayesian inference is introduced that enters the multivariate regression model, as does our proposal.

5.1.2 Description of the dataset

We consider the cookie dough data from Osborne et al. (1984). The data with the corresponding test and training sets are available in the fdsR package. After data pretreatments as in Brown et al. (2001), we have \(n=39\) dough pieces in the training set: each sample consists in an NIR spectrum with \(p=256\) points measured from 1380 to 2400 nm (spaced by 4 nm), and in four quantitative variables that describe the percentages of fat, sugar flour and water of in the piece of dough.
Table 3

Prediction error for the cookie dough data

Method

Fat

Sucrose

Flour

Water

Stepwise MLR

.044

1.188

.722

.221

Decision theory

.076

.566

.265

.176

PLS

.151

.583

.375

.105

PCR

.160

.614

.388

.106

Wavelet regression

.058

.819

.457

.080

LASSO

.044

.853

.370

.088

Fused-LASSO

.106

1.93

.378

.231

Group-LASSO

.127

.918

.467

.102

MRCE

.151

.821

.321

.081

Structured E-net

.044

.596

.363

.082

SPRING (CV)

.065

.397

.237

.083

SPRING (BIC)

.048

.389

.243

.066

Results of the best competitor appears in bold

Fig. 4

Parameters estimated by the penalized regression methods for the cookie data

5.1.3 Structure specification

We would like to account for the neighborhood structure between the predictors which is obvious in the context of NIR spectroscopy: since spectra are continuous curves, a smooth neighborhood prior will encourage predictors to be selected by “wave”, which seems more satisfactory than isolated peaks. Thus, we naturally define \(\mathbf {L}\) by means of the first forward difference operator (11). We also tested higher orders of the operator to induce a stronger smoothing effect, but they do not lead to dramatic changes in terms of PE and we omit them. Order k is simply obtained by powering the first order matrix. Such techniques have been studied in a structured \(\ell _1\)-penalty framework in Kim et al. (2009), Tibshirani and Taylor (2011) and is known as trend filtering. Our approach however is different, though: it enters a multivariate framework and is based on a smooth \(\ell _2\)-penalty coupled with the \(\ell _1\)-penalty for selection.

5.1.4 Results

The predictive performance of a series of regression techniques is compared in Table 3, evaluated on the same test set.

The first four rows (namely stepwise multivariate regression, Bayesian decision theory approach, Partial Least Square and Principal Component Regression) correspond to calibration methods originally presented in Brown et al. (2001); row five (Bayesian Wavelet Regression) is the original proposal from Brown et al. (2001); the remainder of the table is due to our own analysis. Among all the considered methods, only the group-Lasso imposes the same covariates to be active for all responses. We observe that, on this example, SPRING achieves extremely competitive results. The BIC performs notably well as a model selection criterion.
Fig. 5

Parameters estimated by our proposal for the cookie dough data (better seen in color). Since \(\hat{\mathbf {B}}\) and \(\hat{{\varvec{\varOmega }}}_{\mathbf {x}\mathbf {y}}\) are opposite in sign in our model, we represent \(-\hat{{\varvec{\varOmega }}}_{\mathbf {x}\mathbf {y}}\) to ease the comparison between direct and indirect effects. (Color figure online)

Figure 4 shows the regression coefficients adjusted with the penalized regression techniques: apart from LASSO which selects very isolated (and unstable) wavelengths, non-zero regression coefficients have quite a wide spread, and therefore hard to interpret. As expected, the multitask group-Lasso activates the same predictors across the responses, which is not a good idea when looking at the predictive performance in Table 3. The Fused-Lasso detects the most significant regions in the spectrum. However, the piece-wise constant shape of the estimated signal leads to poor predictive performances. On the other hand, in Fig. 5, the direct effects selected by SPRING with BIC define predictive regions specific to each response which are well suited for interpretability purposes. Parameters fitted by our approach with BIC are represented on Fig. 5 with, from left to right, the estimators of the regression coefficients \(\mathbf {B}\), of the direct effects \({\varvec{\varOmega }}_{\mathbf {x}\mathbf {y}}\) and of the residual covariance \(\mathbf {R}\). As expected, the regression coefficients show no sparsity pattern since sparsity was imposed on direct links in the estimation step. One can observe a strong spatial structure along the spectrum, characterized by waves induced by a smooth first order difference prior. Concerning a potential structure between the outputs, we identify interesting regions where a strong correlation between the responses induces correlations between the regression coefficients. Consider for instance positions of the wavelength between 1750 and 2000 nm: the regression parameters related to “dry-flour” are clearly anti-correlated with those related to “sucrose”. Still, we cannot distinguish in \(\hat{\mathbf {B}}\) for a direct effect of this region on either the flour or sucrose composition. Such a distinction is achieved on the middle panel where direct effects \(\hat{{\varvec{\varOmega }}}_{\mathbf {x}\mathbf {y}}\) selected are plotted: it defines sparse predictive regions specific to each response which are well suited for interpretability purposes; in fact, it is now obvious that region 1750 to 2000 is rather linked to the sucrose.

5.2 Multi-trait genomic selection in Brassica napus

5.2.1 Context

In plant and animal genetics, deriving an accurate prediction of complex traits is of major importance for the early detection and selection of individuals of high genetic value. Due to the continued advancement of high-throughput genotyping and sequencing technologies, genetic information is now available at low cost, whereas acquiring trait information remains expensive. A typical experiment will only contain a few hundreds of phenotyped individuals genotyped for thousands of markers. Regularization methods such as ridge or Lasso regression or their Bayesian counterparts have been proposed (de los Campos et al. 2012) to predict one or several phenotypes based on the information of genetic markers. Still, in most studies only single trait genomic selection is performed, neglecting correlations between phenotypes and leading to poor prediction performance. Moreover, little attention has been devoted to the development of regularization methods including prior genetic knowledge.
Table 4

Estimated prediction error for the Brassica napus data for all traits analyzed jointly (standard error in parentheses)

Method

surv92

surv93

surv94

surv97

surv99

flower0

flower4

flower8

LASSO

.730 (.011)

.977 (.009)

.943 (.010)

.947 (.009)

.916 (.010)

.609 (.011)

.501 (.011)

.744 (.011)

S. Enet

.697 (.011)

.987 (.009)

.941 (.011)

.945 (.009)

.911 (.010)

.577 (.011)

.478 (.010)

.727 (.012)

MRCE

.759 (.010)

.919 (.003)

917 (.006)

.924 (.004)

.926 (.006)

.591 (.011)

.479 (.011)

.736 (.011)

group-lasso

.748 (.013)

.995 (.012)

.892 (.013)

.939 (.013)

.906 (.017)

.603 (.017)

.504 (.015)

.717 (.013)

SPRING

.724 (.010)

.948 (.008)

.848 (.010)

.940 (.006)

.907 (.009)

.489 (.010)

.419 (.009)

.616 (.012)

Results of the best competitor appears in bold

5.2.2 Description of the dataset

We consider the Brassica napus dataset described in Ferreira et al. (1995) and Kole et al. (2002). Data consists in \(n=103\) double-haploid lines derived from 2 parent cultivars, ‘Stellar’ and ‘Major’, on which \(p = 300\) genetic markers and \(q=8\) traits (responses) were recorded. Each marker is a 0/1 covariate with \(x_i^j=0\) if line i has the ‘Stellar’ allele at marker j, and \(x_i^j=1\) otherwise. Traits included are percent winter survival for 1992, 1993, 1994, 1997 and 1999 (surv92, surv93, surv94, surv97, surv99, respectively), and days to flowering after no vernalization (flower0), 4 weeks vernalization (flower4) or 8 weeks vernalization (flower8).

5.2.3 Structure specification

In a biparental line population, correlation between 2 markers depends on their genetic distance defined in terms of recombination fraction. As a consequence, one expects adjacent markers on the sequence to be correlated, yielding similar direct relationships with the phenotypic traits. Noting \(d_{12}\) the genetic distance between markers \(M_1\) and \(M_2\), one has \(\text {cor}(M_1,M_2) = \rho ^{d_{12}}\), where \(\rho = .98\).4 The covariance matrix \(\mathbf {L}^{-1}\) can hence be defined as \(\mathbf {L}_{ij}^{-1} = \rho ^{d_{ij}}\). Moreover, assuming recombination events are independent between \(M_1\) and \(M_2\) on the one hand, and \(M_2\) and \(M_3\) on the other hand, one has \(d_{13} = d_{12} + d_{23}\) and matrix \(\mathbf {L}^{-1}\) exhibits an inhomogeneous AR(1) profile. As a consequence, \(\mathbf {L}\) is tridiagonal with general elements
$$\begin{aligned}&w_{i, i} = \frac{1 - \rho ^{2d_{i-1, i} + 2d_{i, i+1}}}{(1 - \rho ^{2d_{i-1, i}})(1 - \rho ^{2d_{i, i+1}})}, \\&w_{i, i+1} = \frac{-\rho ^{d_{i, i+1}}}{1 - \rho ^{2d_{i, i+1}}} \end{aligned}$$
and \(w_{i,j} = 0\) if \(|i-j| > 1\). For the first (resp. last) marker, the distance \(d_{i-1, i}\) (resp. \(d_{i, i+1}\)) is infinite.

5.2.4 Results

To compare SPRING and its competitors in terms of predictive performance, PE is estimated by randomly splitting the 103 samples into training and test sets with sizes 93 and 10. Before adjusting the models, we first scale the outcomes on the training and test sets to facilitate interpretability. Fivefold cross-validation is used on the training set to choose the tuning parameters. Two hundred random samplings of the test and training sets were conducted to estimate the PE given in Table 4.
Table 5

Estimated prediction error for the Brassica napus data for the flowering traits analyzed separately (standard error in parentheses)

Method

surv92

surv93

surv94

surv97

surv99

flower0

flower4

flower8

group-lasso

.783 (.012)

.959 (.011)

.907 (.011)

.950 (.010)

.911 (.011)

.597 (.019)

.536 (.016)

.757 (.022)

SPRING

.732 (.010)

.986 (.010)

.906 (.011)

.927 (.009)

.897 (.009)

.484 (.018)

.443 (.015)

.639 (.021)

Results of the best competitor appears in bold

All methods provide similar results although SPRING provides the smallest error for half of the traits. A picture of the between-response covariance matrix estimated with SPRING is given in Fig. 6. It reflects the correlation between the traits, which are either explained by an unexplored part of the genotype, by the environment or by some interaction between the two. The residuals of the flowering times exhibit strong correlations, whereas correlations between the survival rates are weak. It also shows that the survival traits have a larger residual variability than do the flowering traits, suggesting a higher sensitivity to environmental conditions.
Fig. 6

Brassica study: residual covariance estimation

As suggested by a reviewer, we ran the analysis on survival and flowering responses separately. Indeed, merging the two subsets of traits may hamper the performances of the group Lasso, as it forces the active covariates to be same for all responses. The result may also change for SPRING as the correlation between the two subsets of traits are not accounted for when analyzed separately. The other (univariate) methods are not affected by this change. The results are summarized in Table 5. No dramatic improvement or degradation is observed for any of the two methods and SPRING still displays better results.

We then turn to the effects of each marker on the different traits. The left panels of Fig. 7 give both the regression coefficients (top) and the direct effects (bottom). The gray zones correspond to chromosomes 2, 8 and 10, respectively. The exact location of the markers within these chromosomes is displayed in the right panels, where the size of the dots reflects the absolute value of the regression coefficients (top) and of the direct effects (bottom). The interest of considering direct effects rather than regression coefficients appears clearly here, if one looks for example at chromosome 2. Three large overlapping regions are observed in the coefficient plot, for each flowering trait. A straightforward interpretation would suggest that the corresponding region controls the general flowering process. The direct effect plot allows one to go deeper and shows that these three responses are actually controlled by three separate sub-regions within this chromosome. The confusion in the coefficient plot only results from the strong correlations observed among the three flowering traits.
Fig. 7

Brassica study: direct and indirect genetic effects of the markers on the traits estimated by SPRING (better seen in color). (Color figure online)

5.3 Selecting regulatory motifs from multiple microarrays

5.3.1 Context

In genomics, the expression of genes is initiated by transcription factors that bind to the DNA upstream from the coding regions, called regulatory regions. This binding occurs when a given factor recognizes a certain (small) sequence called a regulatory motif. Genes hosting the same regulatory motif will be jointly expressed under certain conditions. As the binding relies on chemical affinity, some degeneracy can be tolerated in the motif definition, and motifs similar but for small variations may share the same functional properties (see, e.g. Lajoie et al. 2012).

We are interested in the detection of such regulatory motifs, the presence of which controls the gene expression profile. To this aim we try to establish a relationship between the expression level of all genes across a series of conditions with the content of their respective regulatory regions in terms of motifs. In this context, we expect (i) the set of influential motifs to be small for each condition, (ii) the influential motifs for a given condition to be degenerate versions of each other, and (iii) the expression under similar conditions to be controlled by the same motifs.

5.3.2 Description of the dataset

In Gasch et al. (2000), a series of microarray experiments are conducted on yeast cells (Saccharomyces cerevisae). Among these assays, we consider 12 time-course experiments profiling \(n=5883\) genes under various environmental changes as listed in Table 6. These expression sets form 12 potential response matrices \(\mathbf {Y}\), the column number of which corresponds to the number of time points.
Table 6

Time-course data from Gasch et al. (2000) considered for regulatory motif discovery

Experiment

# time point

# motifs selected

\(k=7\)

\(k=8\)

\(k=9\)

Heat shock

8

30

68

43

Shift from 37 to \(25\,^\circ \)C

5

3

11

33

Mild heat shock

4

24

13

23

Response to \(\text {H}_2 \text {O}_2\)

10

15

10

21

Menadione exposure

9

16

1

7

DDT exposure 1

8

15

10

30

DDT exposure 2

7

11

33

21

Diamide treatment

8

45

25

35

Hyperosmotic shock

7

36

24

15

Hypo-osmotic shock

5

20

8

29

Amino-acid starvation

5

47

30

39

Diauxic shift

7

16

14

20

Total number of unique motifs inferred

87

82

72

Table 7

Comparison of SPRING-selected motifs with MotifDB patterns. Each cell corresponds to a MotifDB pattern (top) compared to a set of aligned SPRING motifs with size 7 (down).

Concerning the predictors, we consider the set of all motifs with length k formed with the four nucleotides, that is \(\mathscr {M}_k = \left\{ A,C,G,T\right\} ^k\). There are \(p = |\mathscr {M}_k| = 4^k\) such motifs. Unless otherwise stated, the motifs in \(\mathscr {M}\) are lined up in lexicographical order e.g., when \(k=2\), \(AA,AC,AG,AT,CA,CC,\dots \) and so on. Then, the \(n\times p\) matrix of predictors \(\mathbf {X}\) is filled such that \(X_{ij}\) equals the occurrence count of motif j in the regulatory region of gene i.

5.3.3 Structure specification

As we expect influential motifs for a given condition to be degenerate versions of each other, we first measure the similarity between any two motifs from \(\mathscr {M}_k\) with the Hamming defined as
$$\begin{aligned} \forall a,b\in \mathscr {M}_k, \quad {\mathrm {dist}}(a,b) = \text {card}\{i:a_i\ne b_i\}. \end{aligned}$$
For a fixed value of interest \(0\le \ell \le k\), we further define the \(\ell \)-distance matrix\(\mathbf {D}^{k,\ell }=(d^{k,\ell }_{ab})_{a,b\in \mathscr {M}_k}\) as
$$\begin{aligned} \forall a,b\in \mathscr {M}_k, \quad d^{k,\ell }_{ab} = \left\{ \begin{array}{ll} 1 &{} \quad \text {if dist}(a,b) \le \ell \\ 0 &{} \quad \text {otherwise.} \\ \end{array}\right. \end{aligned}$$
\(\mathbf {D}^{k,\ell }\) can be viewed as the adjacency matrix of a graph where the nodes are the motifs and where an edge is present between 2 nodes when the 2 motifs are at a Hamming distance less or equal to \(\ell \). We finally use the Laplacian of this graph as a structuring matrix \(\mathbf {L}^{k,\ell }=(\ell ^{k,\ell }_{ab})_{a,b\in \mathscr {M}_k}\), that is
$$\begin{aligned} \ell ^{k,\ell }_{ab} = {\left\{ \begin{array}{ll} \sum _{c\in \mathscr {M}_k} d^{k,\ell }_{ac} &{} \text {if } a = b, \\ -1 &{} \text {if } d^{k,\ell }_{ab} = 1, \\ 0 &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(12)
Note that we came across a similar proposal by Li et al. (2010) in the context of sparse, structured PLS. The derivation, however, is different, and the objective of this method is not the selection of motifs but of families of motifs via the compression performed by the PLS.

5.3.4 Results

We apply our methodology for candidate motifs from \(\mathscr {M}_7,\mathscr {M}_8\) and \(\mathscr {M}_9\), which results in three lists of putative motifs having a direct effect on gene expression. Due to the very large number of potential predictors that comes with a sparser matrix \(\mathbf {X}\) when k increases, we first perform a screening step that keeps the 5000 motifs with the highest marginal correlations with \(\mathbf {Y}\). Second, SPRING is applied to each of the twelve time-course experiments described in Table 6. The selection of \((\lambda _1,\lambda _2)\) is performed on a grid using the BIC (10). At the end of the day, the three lists corresponding to \(k=7,8,9\) include respectively 87, 82 and 72 motifs, for which at least one coefficient in the associated row \(\widehat{{\varvec{\varOmega }}}_{\mathbf {x}\mathbf {y}}(j,\cdot )\) was found non-null for some of the twelve experiments, as detailed in Table 6.

To assess the relevance of the selected motifs, we compared them with the MotifDB patterns available in Bioconductor (Shannon 2013), where known transcription factor binding sites are recorded. There are 453 such reference motifs with size varying from 5 to 23 nucleotides. Consider the case of \(k=7\) for instance: among the 87 SPRING motifs, 62 match one MotifDB pattern each and 25 are clustered into 11 MotifDB patterns as depicted in Table 7. As seen in this table, the clusters of motifs selected by SPRING correspond to sets of variants of the same pattern. These clusters consist of motifs that are close according to the similarity encoded in the structure matrix \(\mathbf {L}\). In this example, the ability of SPRING to use domain-specific definitions of the structure between the predictors oriented the regression problem to account for motif degeneracy and helped in selecting motifs that are consistent known binding sites.

6 Discussion

We introduced a general framework for multivariate regression accounting for correlation between the outcomes, for a possible structure between the predictors, and inducing sparsity on the direct links between the predictors and the outcomes. The whole procedure is available through the R package spring. This provides a generic framework that can accommodate to a broad variety of structures between the covariates via the matrix \(\mathbf {L}\). This comes to the price of the construction of this matrix that is specific to each application. A parallel can be made with other generic machine learning approaches such as SVM, where the kernel has to be tailored for a given application to ensure good performances. Although we recommend the \(\mathbf {L}\) matrix to be carefully designed to fully include the prior knowledge, some typical shapes have been made available in the spring package.

The proposed approach is valid for a limited number of outcomes, typically \(q < n\). A natural extension of this work is to consider the case of large dimension outputs. This requires to also impose sparsity on the covariance matrix \(\mathbf {R}\) between the responses. The most natural way is to use an additional graphical Lasso-type penalty term. This can be achieved through a modification of step (9a) in the alternating optimization procedure. This is a work in progress.

Footnotes

  1. 1.
  2. 2.

    We set \(\mathbf {R}\) a correlation matrix in order not to excessively penalize the LASSO or the group-LASSO, which both use the same tuning parameter \(\lambda _1\) across the outcomes (and thus the same variance estimator).

  3. 3.

    We also used the same seed and CV-folds for both scenarios.

  4. 4.

    This value directly arises from the definition of the genetic distance itself.

Notes

Acknowledgments

We would like to thank Mathieu Lajoie and Laurent Bréhélin for kindly sharing the dataset from Gasch et al. (2000). We also thank the reviewers for their questions and remarks, which helped us to improve our manuscript. This project was conducted in the framework of the project AMAIZING funded by the French ANR. This work has been partially supported by the GRANT Reg4Sel from the French INRA-SelGen metaprogram.

References

  1. Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with sparsity-inducing penalties. Found. Trends Mach. Learn. 4(1), 1–106 (2012)CrossRefMATHGoogle Scholar
  2. Brown, P., Vannucci, M., Fearn, T.: Multivariate bayesian variable selection and prediction. J. R. Stat. Soc. B 60(3), 627–641 (1998)MathSciNetCrossRefMATHGoogle Scholar
  3. Brown, P., Fearn, T., Vannucci, M.: Bayesian wavelet regression on curves with applications to a spectroscopic calibration problem. J. Am. Stat. Assoc. 96, 398–408 (2001)MathSciNetCrossRefMATHGoogle Scholar
  4. Chiquet, J., Grandvalet, Y., Ambroise, C.: Inferring multiple graphical structures. Stat. Comput. 21(4), 537–553 (2011)MathSciNetCrossRefMATHGoogle Scholar
  5. de los Campos, G., Hickey, J., Pong-Wong, R., Daetwyler, H., Calus, M.: Whole genome regression and prediction methods applied to plant and animal breeding. Genetics 193(2), 327–345 (2012)CrossRefGoogle Scholar
  6. Efron, B.: The estimation of prediction error: covariance penalties and cross-validation (with discussion). J. Am. Stat. Assoc. 99, 619–642 (2004)MathSciNetCrossRefMATHGoogle Scholar
  7. Ferreira, M., Satagopan, J., Yandell, B., Williams, P., Osborn, T.: Mapping loci controlling vernalization requirement and flowering time in brassica napus. Theor. Appl. Genet. 90, 727–732 (1995)CrossRefGoogle Scholar
  8. Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010)CrossRefGoogle Scholar
  9. Gasch, A., Spellman, P., Kao, C., Carmel-Harel, O., Eisen, M.B., Storz, G., Botstein, D., Brown, P.: Genomic expression programs in the response of yeast cells to environmental changes. Mol. Biol. Cell 11(12), 4241–4257 (2000)CrossRefGoogle Scholar
  10. Hans, C.: Elastic net regression modeling with the orthant normal prior. J. Am. Stat. Assoc. 106, 1383–1393 (2011)MathSciNetCrossRefMATHGoogle Scholar
  11. Harville, D.: Matrix Algebra from a Statistician’s Perspective. Springer, New York (1997)CrossRefMATHGoogle Scholar
  12. Hebiri, M., van De Geer, S.: The smooth-lasso and other l1 + l2 penalized methods. Electron. J. Stat. 5, 1184–1226 (2011)MathSciNetCrossRefMATHGoogle Scholar
  13. Hesterberg, T., Choi, N.M., Meier, L., Fraley, C.: Least angle and \(\ell _{1}\) penalized regression: a review. Stat. Surv. 2, 61–93 (2008)MathSciNetCrossRefMATHGoogle Scholar
  14. Hoefling, H.: A path algorithm for the fused lasso signal approximator. J. Comput. Graph. Stat. 19(4), 984–1006 (2010)MathSciNetCrossRefGoogle Scholar
  15. Kim, S., Xing, E.: Statistical estimation of correlated genome associations to a quantitative trait network. PLoS Genet. 5(8), e1000587 (2009)CrossRefGoogle Scholar
  16. Kim, S., Xing, E.: Tree-guided group lasso for multi-task regression with structured sparsity. In: Proceedings of the 27th International Conference on Machine Learning, pp. 543–550 (2010)Google Scholar
  17. Kim, S.J., Koh, K., Boyd, S., D, G.: \(\ell _1\) trend filtering. SIAM Rev. 51(2), 339–360 (2009)MathSciNetCrossRefGoogle Scholar
  18. Kole, C., Thorman, C., Karlsson, B., Palta, J., Gaffney, P., Yandell, B., Osborn, T.: Comparative mapping of loci controlling winter survival and related traits in oilseed brassica rapa and B. napus. Mol. Breed. 1, 201–210 (2002)CrossRefGoogle Scholar
  19. Krishna, A., Bondell, H., Ghosh, S.: Bayesian variable selection using an adaptive powered correlation prior. J. Stat. Plan. Inference 139(8), 2665–2674 (2009)MathSciNetCrossRefMATHGoogle Scholar
  20. Lajoie, M., Gascuel, O., Lefort, V., Brehelin, L.: Computational discovery of regulatory elements in a continuous expression space. Genome Biol. 13(11), R109 (2012). doi:10.1186/gb-2012-13-11-r109 CrossRefGoogle Scholar
  21. Li, C., Li, H.: Variable selection and regression analysis for graph-structured covariates with an application to genomics. Ann. Appl. Stat. 4(3), 1498–1516 (2010)MathSciNetCrossRefMATHGoogle Scholar
  22. Li, X., Panea, C., Wiggins, C., Reinke, V., Leslie, C.: Learning “graph-mer” motifs that predict gene expression trajectories in development. PLoS Comput. Biol. 6(4), e1000,761 (2010)CrossRefGoogle Scholar
  23. Lorbert, A., Eis, D., Kostina, V., Blei, D., Ramadge, P.: Exploiting covariate similarity in sparse regression via the pairwise elastic net. In: Teh, Y.W., Titterington, D.M. (eds.) Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS-10), vol. 9, pp. 477–484 (2010)Google Scholar
  24. Mardia, K., Kent, J., Bibby, J.: Multivariate Analysis. Academic Press, London (1979)Google Scholar
  25. Marin, J.M., Robert, C.P.: Bayesian Core: A Practical Approach to Computational Bayesian Statistics. Springer, New York (2007)MATHGoogle Scholar
  26. Obozinski, G., Wainwright, M., Jordan, M.: Support union recovery in high-dimensional multivariate regression. Ann. Stat. 39(1), 1–47 (2011)MathSciNetCrossRefMATHGoogle Scholar
  27. Osborne, B., Fearn, T., Miller, A., Douglas, S.: Application of near infrared reflectance spectroscopy to compositional analysis of biscuits and biscuit doughs. J. Sci. Food Agric. 35, 99–105 (1984)CrossRefGoogle Scholar
  28. Osborne, M.R., Presnell, B., Turlach, B.A.: On the lasso and its dual. J. Comput. Graph. Stat. 9(2), 319–337 (2000)MathSciNetGoogle Scholar
  29. Park, T., Casella, G.: The Bayesian lasso. J. Am. Stat. Assoc. 103, 681–686 (2008)MathSciNetCrossRefMATHGoogle Scholar
  30. Rapaport, F., Zinovyev, A., Dutreix, M., Barillot, E., Vert, J.P.: Classification of microarray data using gene networks. BMC Bioinform. 8, 35 (2007)CrossRefGoogle Scholar
  31. Rothman, A., Levina, E., Zhu, J.: Sparse multivariate regression with covariance estimation. J. Comput. Graph. Stat. 19(4), 947–962 (2010)MathSciNetCrossRefGoogle Scholar
  32. Shannon, P.: MotifDb: An Annotated Collection of Protein-DNA Binding Sequence Motifs. R package version 1.4.0 (2013)Google Scholar
  33. Slawski, M., W, Zu Castell, Tutz, G.: Feature selection guided by structural information. Ann. Appl. Stat. 4, 1056–1080 (2010)MathSciNetCrossRefMATHGoogle Scholar
  34. Sohn, K., Kim, S.: Joint estimation of structured sparsity and output structure in multiple-output regression via inverse-covariance regularization. JMLR W&CP(22), 1081–1089 (2012)Google Scholar
  35. Städler, N., Bühlmann, P., Geer, S.: \(\ell _1\)-penalization for mixture regression models. Test 19(2), 209–256 (2010). doi:10.1007/s11749-010-0197-z MathSciNetCrossRefMATHGoogle Scholar
  36. Stein, C.: Estimation of the mean of a multivariate normal distribution. Ann. Stat. 9, 1135–1151 (1981)MathSciNetCrossRefMATHGoogle Scholar
  37. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B 58, 267–288 (1996)MathSciNetMATHGoogle Scholar
  38. Tibshirani, R., Taylor, J.: The solution path of the generalized lasso. Ann. Stat. 39(3), 1335–1371 (2011). doi:10.1214/11-AOS878 MathSciNetCrossRefMATHGoogle Scholar
  39. Tibshirani, R., Taylor, J.: Degrees of freedom in lasso problems. Ann. Stat. 40, 1198–1232 (2012)MathSciNetCrossRefMATHGoogle Scholar
  40. Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K.: Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. B 67, 91–108 (2005)MathSciNetCrossRefMATHGoogle Scholar
  41. Tseng, P.: Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl. 109(3), 475–494 (2001)MathSciNetCrossRefMATHGoogle Scholar
  42. Tseng, P., Yun, S.: A coordinate gradient descent method for nonsmooth separable minimization. Math. Program. 117, 387–423 (2009)MathSciNetCrossRefMATHGoogle Scholar
  43. Yin, J., Li, H.: A sparse conditional Gaussian graphical model for analysis of genetical genomics data. Ann. Appl. Stat. 5, 2630–2650 (2011)MathSciNetCrossRefMATHGoogle Scholar
  44. Yuan, X.T., Zhang, T.: Partial Gaussian graphical model estimation. IEEE Trans. Inform. Theory 60(3), 1673–1687 (2014)MathSciNetCrossRefGoogle Scholar
  45. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. B 67, 301–320 (2005)MathSciNetCrossRefMATHGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • Julien Chiquet
    • 1
  • Tristan Mary-Huard
    • 2
  • Stéphane Robin
    • 3
  1. 1.LaMME - UMR 8071 CNRS/Université d’Évry-Val-d’EssonneBoulevard de FranceFrance
  2. 2.UMR de Génétique Végétale du Moulon, INRA/Univ. Paris Sud/CNRSFerme du MoulonFrance
  3. 3.UMR 518 AgroParisTech/INRAParisFrance

Personalised recommendations