1 Introduction

One of the most common problems in statistics is linear regression. Given p samples of n-dimensional input data \(x_{i}^{\mu}\), i=1,…,n and 1-dimensional output data y μ, with μ=1,…,p, find weights w i ,w 0 that best describe the relation

$$\begin{aligned} y^\mu= \sum_{i=1}^n w_i x_i^\mu+ w_0+ \xi^\mu \end{aligned}$$
(1)

for all μ. ξ μ is zero-mean noise with inverse variance β.

The ordinary least square (OLS) solution is given by w=χ −1 b and \(w_{0}=\bar{y}-\sum_{i} w_{i} \bar{x}_{i}\), where χ is the input covariance matrix b is the vector of input-output covariances and \(\bar{x}_{i}, \bar{y}\) are the mean values. There are several problems with the OLS approach. When p is small, it typically has a low prediction accuracy due to over fitting. In particular, when p<n, χ is not of maximal rank and so its inverse is not uniquely defined. In addition, the OLS solution is not sparse: it will find a solution w i ≠0 for all i. Therefore, the interpretation of the OLS solution is often difficult.

These problems are well-known, and there exist a number of approaches to overcome these problems. The simplest approach is called ridge regression. It adds a regularization term \(\frac{1}{2}\lambda\sum_{i} w_{i}^{2}\) with λ>0 to the OLS criterion. This has the effect that the input covariance matrix χ gets replaced by χ+λI which is of maximal rank for all p. One optimizes λ by cross validation. Ridge regression improves the prediction accuracy but not the interpretability of the solution.

Another approach is lasso (Tibshirani 1996). It solves the OLS problem under the linear constraint ∑ i |w i |≤t. This problem is equivalent to adding an 1 regularization’s term λ i |w i | to the OLS criterion. The optimizations of the quadratic error under linear constraints can be solved efficiently. See Friedman et al. (2010) for a recent account. Again, λ or t may be found through cross validation. The advantage of the 1 regularization is that the solution tends to be sparse. This improves both the prediction accuracy and the interpretability of the solution.

The 1 or 2 regularization terms are known as shrinkage priors because their effect is to shrink the size of w i . The idea of shrinkage prior has been generalized by Frank and Friedman (1993) to the form λ i |w i |q with q>0 and q=1,2 corresponding to the lasso and ridge case, respectively. Better solutions can be obtained for q<1, however the resulting optimization problem is no longer convex and therefore more difficult to solve.

An alternative Bayesian approach to obtain a sparse solution using an 0 penalty was proposed by George and McCulloch (1993), Mitchell and Beauchamp (1988) under the “spike and slab” formulation. There are n variational selector variables s i such that the prior distribution over w i is a mixture of a narrow (spike) and wide (slab) Gaussian distribution, both centered on zero. The posterior distribution over s i indicates whether the input feature i is included in the model or not. Since the number of subsets of features is exponential in n, for large n one cannot compute the solution exactly. In addition, the posterior is a complex high dimensional distribution of the w i and the other (hyper) parameters of the model. The computation of the posterior requires thus the use of Markov chain Monte Carlo (MCMC) sampling (George and McCulloch 1993; Brown et al. 1998; Clyde and George 2004; Ishwaran and Rao 2005) or variational Bayesian approximations (Carbonetto and Stephens 2012; Titsias and Lázaro-Gredilla 2011; Logsdon et al. 2010).

Although Bayesian approaches tend to over fit less than a maximum likelihood or maximum a posteriori (MAP) estimators, they also tend to be relatively slow. Here we propose a partial Bayesian approach, where we apply a variational approximation to integrate out the binary (selector) variables in combination with a MAP approach for the remaining parameters. For clarity, we analyze this idea in its most simple form, in the absence of (hierarchical) priors. Instead, we infer the sparsity prior through cross validation. As we will motivate below, we call the method the Variational Garrote (VG).

The paper is organized as follows. In Sect. 2 we introduce the model and derive the variational approximation. Related work is described in Sect. 3. In Sect. 4 we study the case when the design matrix is orthogonal. In this case the solution can be computed exactly in closed form with no need to resort to approximations. In Sect. 5 we compare numerically the VG with a number of other MAP methods, such as lasso and ridge regression and with the paired mean field method (PMF) (Titsias and Lázaro-Gredilla 2011), a recently proposed variational Bayesian method. We conclude with discussion in Sect. 6.

2 The variational approximation

Consider the regression model of the formFootnote 1

$$\begin{aligned} y^\mu=\sum_{i=1}^n w_i s_i x^\mu_i+ \xi^\mu\quad \sum_{i=1}^n s_i \le t \end{aligned}$$
(2)

with s i =0,1. The bits s i =1 will identify the predictive inputs i. Using a Bayesian description, and denoting the data by D:{x μ,y μ},μ=1,…,p, the likelihood term is given by

$$\begin{aligned} p(y|\mathbf {x},\mathbf {s},\mathbf {w},\beta)&=\sqrt{\frac{\beta}{2\pi}}\exp \Biggl(- \frac{\beta }{2} \Biggl(y-\sum_{i=1}^n w_i s_i x_i \Biggr)^2 \Biggr) \\ p(D|\mathbf {s},\mathbf {w},\beta)&=\prod_\mu p \bigl(y^\mu|\mathbf {x}^\mu,\mathbf {s},\mathbf {w},\beta \bigr) \\ &= \biggl(\frac{\beta}{2\pi} \biggr)^{p/2}\exp \Biggl(-\frac{\beta p}{2} \Biggl(\sum_{i,j=1}^n s_i s_j w_i w_j \chi_{ij}-2\sum _{i=1}^n w_i s_i b_i +\sigma _y^2 \Biggr) \Biggr) \end{aligned}$$
(3)

with \(b_{i}=\frac{1}{p}\sum_{\mu}x_{i}^{\mu}y^{\mu}\), \(\sigma_{y}^{2}=\frac{1}{p}\sum_{\mu}(y^{\mu})^{2}\), \(\chi_{ij}=\frac{1}{p}\sum_{\mu}x_{i}^{\mu}x_{j}^{\mu}\).

We should also specify prior distributions over s,w,β. For concreteness, we assume that the prior over s is factorized over the individual s i , each with identical prior probability:

$$\begin{aligned} p(\mathbf {s}|\gamma)& =\prod_{i=1}^n p(s_i|\gamma) \qquad p(s_i|\gamma) =\frac{\exp (\gamma s_i )}{1+\exp(\gamma)} \end{aligned}$$
(4)

with γ given which specifies the sparsity of the solution. We denote by p(w,β) the prior over the inverse noise variance β and the feature weights w. We will leave this prior unspecified since its choice does not affect the variational approximation.

The posterior becomes

$$\begin{aligned} p(\mathbf {s},\mathbf {w},\beta|D,\gamma)=\frac{p(\mathbf {w},\beta)p(\mathbf {s}|\gamma )p(D|\mathbf {s},\mathbf {w},\beta)}{p(D|\gamma)} \end{aligned}$$
(5)

Computing the MAP estimate or computing statistics from the posterior is complex in particular due to the discrete nature or s. We propose to compute a variational approximation to the marginal posterior p(w,β|D,γ)=∑ s p(s,w,β|D,γ) and computing the MAP solution with respect to w,β. Since p(D|γ) does not depend on w,β we can ignore it.

The posterior distribution equation (5) for given w,β is a typical Boltzmann distribution involving terms linear and quadratic in s i . It is well-known that when the effective couplings w i w j χ ij are small, one can obtain good approximations using methods that originated in the statistical physics community and where s i denote binary spins. Most prominently, one can use the mean field or variational approximation (Jordan et al. 1999), the TAP approximation (Kappen and Spanjers 2000) or belief propagation (BP) (Murphy et al. 1999). For introductions into these methods also see Opper and Saad (2001), Wainwright and Jordan (2008). Here, we will develop a solution based on the simplest possible variational approximation and leave the possible improvements using BP or structured mean field approximations to the future.

We approximate the sum by the variational bound (by Jensen’s inequality)

$$\begin{aligned} \log\sum_{\mathbf {s}}p(\mathbf {s}|\gamma)p(D|\mathbf {s}, \mathbf {w},\beta)& \ge-\sum_{\mathbf {s}} q(\mathbf {s}) \log \frac{q(\mathbf {s})}{p(\mathbf {s}|\gamma)p(D|\mathbf {s},\mathbf {w},\beta)} \\ &=-F(q,\mathbf {w},\beta). \end{aligned}$$
(6)

The probability distribution q(s) is called the variational approximation and can be any positive probability distribution on s and F(q,w,β) is called the variational free energy. The optimal q(s) is found by minimizing F(q,w,β) with respect to q(s) so that the tightest bound (best approximation) is obtained.

In order to be able to compute the variational free energy efficiently, q(s) must be a tractable probability distribution, such as a chain or a tree with limited tree-width (Barber and Wiegerinck 1999). Here we consider the simplest case where q(s) is a fully factorized distribution: \(q(\mathbf {s})=\prod_{i=1}^{n} q_{i}(s_{i})\) with q i (s i )=m i s i +(1−m i )(1−s i ), so that q is fully specified by the expected values m i =q i (s i =1), which we collectively denote by m.

The expectation values with respect to q can now be easily evaluated and the result is

$$\begin{aligned} F =& \frac{\beta p}{2} \Biggl(\sum_{i, j}^n m_i m_j w_i w_j \chi_{ij}+\sum_i m_i(1-m_i) w_i^2 \chi_{ii}-2\sum _{i=1}^n m_i w_i b_i +\sigma_y^2 \Biggr) \\ &{}-\gamma\sum_{i=1}^n m_i + \sum_{i=1}^n \bigl(m_i \log m_i+(1-m_i)\log (1-m_i) \bigr)- \frac{p}{2}\log\frac{\beta}{2\pi}, \end{aligned}$$
(7)

where we have omitted terms independent of m,β,w. The first line is due to the likelihood term, the second line is due to the prior on s and the entropy of q(s). The approximate marginal posterior is then

$$\begin{aligned} p(\mathbf {w},\beta|D,\gamma)&\propto p(\mathbf {w},\beta)\sum_{\mathbf {s}}p( \mathbf {s}|\gamma)p(D|\mathbf {s},\mathbf {w},\beta) \\ & \approx p(\mathbf {w},\beta) \exp\bigl(-F(\mathbf {m},\mathbf {w},\beta,\gamma)\bigr). \end{aligned}$$

We can compute the variational approximation m for given w,β,γ by minimizing F with respect to m. In addition, p(w,β|D,γ) needs to be maximized with respect to w,β. Note, that the variational approximation only depends on the likelihood term and the prior on γ, since these are the only terms that depend on s. Thus, for given w, the variational approximation does not depend on the particular choices for the prior p(w,β). For concreteness, we assume a flat prior p(w,β)∝1. We set the derivatives of F with respect m,w,β equal to zero. This gives the following set of fixed point equations:

$$\begin{aligned} m_i =&\sigma \biggl(\gamma+\frac{\beta p}{2}w_i^2 \chi_{ii} \biggr) \end{aligned}$$
(8)
$$\begin{aligned} \mathbf {w} =&\bigl(\chi'\bigr)^{-1} \mathbf {b}\qquad \chi'_{ij}=\chi_{ij}m_j+(1-m_j) \chi_{jj}\delta_{ij} \end{aligned}$$
(9)
$$\begin{aligned} \frac{1}{\beta} =&\sigma_y^2-\sum _{i=1}^n m_i w_i b_i \end{aligned}$$
(10)

with σ(x)=(1+exp(−x))−1 and where in Eq. (10) we have used Eq. (9). Equations (8)–(10) provide the final solution. They can be solved by fixed point iteration as outlined in Algorithm 1.

Algorithm 1
figure 1

The Variational Garrote algorithm

Within the variational/MAP approximation the predictive model is

$$\begin{aligned} y =&\sum_i m_i w_i x_i +\xi \end{aligned}$$
(11)

with 〈ξ 2〉=1/β and m,w,β as estimated by the above procedure.

Equation (11) has some similarity with Breiman’s non-negative Garrote method (Breiman 1995). It computes the solution in a two step approach: it computes first w i using OLS and then finds m i by minimizing

$$\begin{aligned} \sum_\mu \Biggl(y^\mu-\sum _{i=1}^n x_i^\mu w_i m_i \Biggr)^2\quad \mbox{subject to} \quad m_i\ge0\quad\sum_i m_i \le t. \end{aligned}$$

Because of this similarity, we refer to our method as the Variational Garrote (VG). Note, that because of the OLS step the non-negative garrote requires that pn. Instead, the variational solution of Eqs. (8)–(10) computes the entire solution in one step (and as we will see does not require pn).

The model in Eqs. (2), (4) is also equivalent to a “spike and slab” prior on the weights parametrized as a product of a Gaussian random variable w i and a Bernoulli random variable s i

$$\begin{aligned} p(w_i,s_i) & = \mathcal{N} \bigl(w_i|0,\sigma^2_w\bigr) \pi^{s_i}(1-\pi)^{1-s_i} \quad\forall i, \end{aligned}$$
(12)

under the identification that the VG assumes a constant (improper) prior on w i \((\sigma^{2}_{w} = \infty)\) and the relation between the sparsity γ and π is given by γ=log(π/(1−π)).

Let us pause to make some observations about the VG solution. One might naively expect that the variational approximation would simply consist of replacing w i s i in Eq. (2) by its variational expectation w i m i . If this were the case, m would disappear entirely from the equations and one would expect in Eq. (9) the OLS solution with the normal input covariance matrix χ instead of the new matrix χ′ (note, that in the special case that m i =1 for all i, χ′=χ and Eq. (9) does reduce to the OLS solution). Instead, m and w are both to be optimized, giving in general a different solution than the OLS solution.Footnote 2

When m i <1, χ′ differs from χ by rescaling with m i and adding a positive diagonal to it, a ‘variational ridge’. This is similar to the mechanism of ridge regression, but with the important difference that the diagonal term depends on i and is dynamically adjusted depending on the solution for m. Thus, the sparsity prior together with variational approximation provides a mechanism that solves the rank problem. When all m i <1, χ′ is of maximal rank. Each m i that approaches 1, reduces the rank by one. Thus, if χ has rank p<n, χ′ can be still of rank n when no more than p of the m i =1, the remaining np of the m i <1 making up for the rank deficiency. Note, that the size of m i (and thus the rank of χ′) is controlled by γ through Eq. (8).

In the above procedure, we compute the VG solution for fixed γ and choose its optimal value through cross validation on independent data (Mitchell and Beauchamp 1988). This has the advantage that our result is independent of our (possibly incorrect) prior belief.

Another important advantage of varying γ manually is that it helps to avoid local minima. When we increase γ from a negative value γ min to a maximal value γ max in small steps, we obtain a sequence of solutions with decreasing sparseness. These solutions will better fit the data and as a result β increases with γ. Thus, increasing γ implements an annealing mechanism where we sequentially obtain solutions at lower noise levels. We found empirically that this approach is effective to reduce the problem of local minima. To further deal with the effect of hysteresis (see Sect. 4) we recompute the solution from γ max down to γ min and choose the solution with lowest free energy.

The minimal value of γ is chosen as the largest value such that m i =ϵ, with ϵ small. We find from Eqs. (8)–(10) that

$$\begin{aligned} \gamma_{\mathrm{min}} =-\frac{p b_i^2\chi_{ii}}{2\sigma_y^2}+\sigma^{-1}(\epsilon)+{ \mathcal{O}}( \epsilon ) \end{aligned}$$
(13)

with σ −1(x)=log(x/(1−x)). We heuristically set the maximal value of γ as well as the step size.

In Appendix B we provide an alternative fixed point iteration scheme that is more efficient in the large n small p limit. Whereas Eqs. (8)–(10) require the repeated solution of a n-dimensional linear system, the dual formulation, Eqs. (8), (22), (25)–(28), requires the repeated solution of a p dimensional linear system. Algorithm 1 summarizes the VG method.

3 Related work

The “spike and slab” model is one of the most widely approaches to sparse Bayesian variable selection. Inference in this model has been performed usually by MCMC sampling. These methods address the combinatorial problem of searching all possible 2n combinations of predictors by sampling from the posterior distribution. There is an extensive literature on MCMC methods for this model, e.g. George and McCulloch (1993), Brown et al. (1998), Clyde and George (2004), Ishwaran and Rao (2005), O’Hara and Sillanpää (2009). However, their applicability is limited on large-scale problems, since designing a Markov chain that explores the parameter space efficiently is a difficult task. In this paper, we focus on the alternative Bayesian variational approach.

A mean field variational approximation for the spike and slab prior was proposed initially in Logsdon et al. (2010) in the context of genetic association studies. Their model differs from the VG in the sense that they use separate and different priors for positive and negative effects. They also use truncated normal distributions for the feature weights and place hyper-priors on γ.

More recently, an alternative variational approximation called paired mean field (PMF) has been proposed in Titsias and Lázaro-Gredilla (2011). It is defined on a model for multiple outputs and considers a linear combination of an input layer of basis functions governed by a Gaussian process, thereby unifying several sparse linear models such as sparse factor analysis or sparse matrix factorization. To relate the PMF model to the VG, we consider the uni-variate response case without the extra input layer. Instead of assuming a fully factorized variational approximation, PMF places each weight w i and bit s i in the same factor, i.e. \(q(\mathbf{w},\mathbf{s})=\prod_{i=1}^{n}q_{i}(w_{i},s_{i})\).

An important difference between the VG and the two previous methods is the algorithm used for parameter optimization. The VG method computes expectation of s (called m) but finds MAP solution for w and β. Hyper-parameter γ is optimized using an annealing-reheating schedule and a validation dataset. In contrast, Logsdon et al. (2010) and Titsias and Lázaro-Gredilla (2011) rely exclusively on the expectation-maximization algorithm with random restarts. As we will show later, this can have important consequences in terms of sub-optimality in cases where inputs are highly correlated.

Around the time of publication of this paper, we became aware of the work of Carbonetto and Stephens (2012). Their approach also considers the fully factorized case but assumes a joint prior for the hyper-parameters and uses importance sampling to compute their posterior distribution. Similarly to the VG, their algorithm considers an inner-loop of coordinate ascent updates for m i and w i .Footnote 3 The difference is that it considers β as a hyper-parameter, together with \(\sigma^{2}_{w}\) and π, and they are jointly integrated using importance sampling. The sampling step is in practice performed using a three-dimensional grid with resolution selected heuristically. For each setting of the hyper-parameters, they compute the largest marginal likelihood solution \((\mathbf{m}^{\mathrm{(init)}},\mathbf{w}^{\mathrm{(init)}})\) using random initializations and, instead of annealing, the coordinate ascent updates are run separately again for each setting of the hyper-parameters with \((\mathbf{m}^{\mathrm{(init)}},\mathbf{w}^{\mathrm{(init)}})\) as initialization.

The fully factorized approximation considered here is also closely related to the one proposed for independent factor analysis (Attias 1999). Combined with a more complex form of annealing for MAP search has been proposed in Yoshida and West (2010) in the context of sparse latent factor analysis. They have shown that this type of optimization strategy can be useful to address the local minima problem and lead to robust estimation.

An alternative to the aforementioned variational approaches is the work of Hernández-Lobato et al. (2010), in which the expectation propagation (EP) algorithm is used in a multi-task setting where the latent variables indicate whether the corresponding features are used for classification in all the tasks or in none of them. EP also considers a factorized approximate distribution.

The problem of inconsistency of the lasso’s penalty has been addressed by many authors and lead to several generalizations (see Tibshirani 2011 and references therein). Two popular approaches that, similarly to the VG, consider non convex penalties, are the Smoothed Clipped Absolute Deviation penalty (known as SCAD) (Fan and Li 2001) and the SparseNet (Mazumder et al. 2011). The SCAD (Fan and Li 2001) replaces the lasso penalty with a continuous differentiable function that reduces the amount of shrinkage for larger values of w i , with eventually no shrinkage for w i →∞. The SparseNet (Mazumder et al. 2011) performs a coordinate-wise optimization of λ and q, covering the bridge of possible solution surfaces between lasso q=1 and variable selection q=0.

From a Bayesian point of view, the lasso estimator can be viewed as solving a MAP estimation problem when the feature weights have independent double exponential (Laplace) priors. A complete Bayesian analysis for the lasso prior is developed in Park and Casella (2008). Fully Bayesian approaches compute posterior mean and median estimates using MCMC sampling and may lead to solutions that are not necessarily sparse. Recently, a Bayesian model that extends the double exponential prior with a normal-exponential-gamma distribution (NEG) and uses MAP estimation has been proposed in Griffin and Brown (2011). The NEG prior has a finite spike at zero and heavy tails, thus preventing over-shrinkage of weights with large absolute values. The authors propose an EM method that alternates between estimation of the prior variances of the weights (E-Step) and the weight values conditioned on the variances (M-Step). Other hyper-parameters are chosen using cross validation.

4 Orthogonal and uni-variate case

In this section we show for the uni-variate case that the solution is either unique or has two solutions, depending on the input-output correlations, the number of samples p and on the sparsity prior γ. We derive a phase plot and show that the solution is unique, when the sparsity prior is not too strong or when the input-output correlation is not too large. The input-output behavior of the VG is shown to be close to optimal as a smoothed version of hard feature selection. We argue that this behavior also holds in the multi-variate case.

Consider the case in which the inputs are uncorrelated: χ ij =δ ij . In this case, we can derive the MAP solution of Eq. (5) exactly, without the need to resort to the variational approximation. Equation (5) reduces to a distribution that factorizes over i with log probability proportional to

$$\begin{aligned} L=\frac{p}{2}\log\beta-\frac{\beta p}{2} \Biggl(\sum _{i=1}^n s_i \bigl(w_i^2-2w_i b_i\bigr) +\sigma_y^2 \Biggr)+\gamma\sum _{i=1}^n s_i \end{aligned}$$

Maximizing wrt w i ,β yields w i =b i , \(\beta^{-1}=\sigma_{y}^{2}-\sum_{i=1}^{n} s_{i} b_{i}^{2}\) and

$$\begin{aligned} L=\frac{p}{2}\log\beta+\sum_{i=1}^n s_i \biggl(\frac{\beta p}{2}b_i^2+\gamma \biggr)-\frac{\beta p}{2} \sigma_y^2 \end{aligned}$$

Assume without loss of generality that \(b_{i}^{2}\) are sorted in decreasing order. L is maximized by setting s i =1 when \(\frac{\beta p}{2}b_{i}^{2}+\gamma>0\) and s i =0 otherwise. Thus, the optimal solution is s 1:k =1,s k+1:n =0, \(\beta^{-1}=\sigma_{y}^{2}-\sum_{i=1}^{k} b_{i}^{2}\) with k the smallest integer such that

$$\begin{aligned} \frac{\beta p}{2}b_{k+1}^2+\gamma<0 \end{aligned}$$
(14)

By varying γ from small to large, we find a sequence of solutions with decreasing sparsity.

In the variational approximation the solution is very similar but not identical. Equation (9) gives the same solution w i =b i . Equations (8) and (10) become

$$\begin{aligned} m_i =&\sigma \biggl(\gamma+\frac{\beta p}{2}b_i^2 \biggr) \\ \frac{1}{\beta} =&\sigma_y^2-\sum_i b_i^2 m_i \end{aligned}$$

which we can interpret as the variational approximations of Eq. (14), with m 1:k ≈1 and m k+1:n ≈0. The term \(\sum_{i} b_{i}^{2} m_{i}\) is the explained variance and is subtracted from the total output variance to give an estimate of the noise variance 1/β.

Note that the posterior is factorized in s i , the variational approximation is not identical to the exact map solution equation (14), although the results are very similar. The relation is s i =0⇔m i <0.5 and s i =1⇔m i >0.5.

In order to further analyze the variational solution, we consider the 1-dimensional case. The variational equations become

$$\begin{aligned} m =& \sigma \biggl(\gamma+ \frac{p}{2} \frac{\rho}{1-\rho m} \biggr)=f(m) \end{aligned}$$
(15)
$$\begin{aligned} \frac{1}{\beta} =&\sigma^2_y(1-m \rho) \end{aligned}$$
(16)

with \(\rho=b^{2}/\sigma_{y}^{2}\) the squared correlation coefficient.

In Eq. (15), we have eliminated β and we must find a solution for m for this non-linear equation. We see that it depends on the input-output correlation ρ, the number of samples p and the sparsity γ. For p=100, the solution for different ρ,γ is illustrated in Fig. 12 (see Appendix A). Equation (15) has one or three solutions for m, depending on the values of γ,ρ,p. The three solutions correspond to two local minima and one local maximum of the free energy F. For γ=−40 and γ=−10, we plot the stable solution(s) for different values of ρ in the inserts in Fig. 1. The best variational solution for m is given by the solution with the lowest free energy, indicated by the solid lines in the inserts in Fig. 1.

Fig. 1
figure 2

Phase plot ρ,γ for p=100 giving the different solutions for m. Dashed and dot-dashed lines for ρ>ρ =0.28 are from Eq. (19) where two solutions for m exist. Solid line for ρ<ρ is the solution for γ when m=1/2, to indicate the transition from the unique solution m≈0 to the unique solution m≈1. Dotted line is the exact transition from s=0 to s=1 from Eq. (14). Insets indicate solutions for m versus ρ for γ=−10,p=100 (top-right) and for γ=−40, p=100 (bottom-left). In the lower left corner of the insets, the unique solution m≈0 is found. In the top right corner, the unique solution m≈1 is found. Between the dot-dashed and the dashed line, the two variational solutions m≈0 and m≈1 co-exist

Figure 1 further shows the phase plot of γ,ρ that indicates that the variational solution is unique for γ>γ or for ρ<ρ . The solid line for 0<ρ<ρ in Fig. 1 indicates a smooth (second order) phase transition from m=0 to m=1. For ρ>ρ , the transition from m=0 to m=1 is discontinuous: for each ρ there is a range of values of γ where two variational solutions m≈0 and m≈1 co-exist. For comparison, we also show the line γ=−/2 that separates the solution s=0 and s=1 according the exact (non-variational) solution equation (14).

The multi-valued variational solution results in a hysteresis effect. When the solution is computed for increasing γ, the m≈0 solution is obtained until it no longer exists. If the sequence of solutions is computed for decreasing γ the m≈1 solution is obtained for values of γ where previously the m≈0 solution was obtained.

From this simple one-dimensional case we may infer that the variational approximation is relatively easy to compute in the uni-modal region (small ρ or γ not too negative) and becomes more inaccurate in the region where multiple optima exist (region between the dot-dashed and dashed lines in Fig. 1).

It is interesting to compare the uni-variate solution of the Variational Garrote with ridge regression, lasso or Breiman’s Garrote, which was previously done for the latter three methods in Tibshirani (1996). Suppose that data are generated from the model y=wx+ξ with 〈ξ 2〉=〈x 2〉=1. We compare the solutions as a function of w. The OLS solution is approximately given by w ols≈〈xy〉=w, where we ignore the statistical deviations of order 1/p due to the finite data set size. Similarly, the ridge regression solution is given by w ridgeλw, with 0<λ<1 depending on the ridge prior. The lasso solution (for non-negative w) is given by w lasso=(wγ)+ (Tibshirani 1996), with γ depending on the 1 constraint. Breiman’s Garrote solution is given by \(w_{\mathrm{garrote}}=(1-\frac {\gamma}{w^{2}})^{+} w\) (Tibshirani 1996), with γ depending on the 1 constraint. The VG solution is given by w vg=mw, with m the solution of Eq. (15). Note, that the VG solution depends, in addition to w,γ, on the unexplained variance \(\sigma _{y}^{2}\) and the number of samples p, whereas the other methods do not.

The qualitative difference of the solutions is shown in Fig. 2. The ridge regression solution is off by a constant multiplicative factor. The lasso solution is zero for small w and for larger w gives a solution that is shifted downwards by a constant factor. Breiman’s Garrote is identical to the lasso for small w and shrinks less for larger w. The VG gives an almost ideal behavior and can be interpreted as a soft version of variable selection: For small w the solution is close to zero and the variable is ignored, and above a threshold it is identical to the OLS solution.

Fig. 2
figure 3

Uni-variate solution for different regression methods. All methods yield a shrinked solution (deviation from diagonal line). Variational Garrote (VG) with γ=−10,p=100 and \(\sigma_{y}^{2}=1\). Ridge regression with λ=0.5. Garrote with γ=1/4. Lasso with γ=1/2

The qualitative nature of the phase plot Fig. 1 and the input-output behavior Fig. 2 extends to the multi-variate orthogonal case. The symmetry breaking of feature i is independent of all other features, except for the term \(\delta= \sum_{j\ne i}b_{j}^{2} m_{j}\) that enters through β. If we increase γ, δ increases in steps each time that one of the features j switches from m j ≈0 to m j ≈1. Thus δ is constant almost always, except at the step points. Since the critical values of ρ and γ depend in a simple way on δ, the phase plot for the multivariate orthogonal case is qualitatively the same as for the uni-variate case.

5 Numerical examples

In the following examples, we compare the VG with lasso, ridge regression and in some cases, with the paired mean field approach (PMF) (Titsias and Lázaro-Gredilla 2011). We show that the VG and PMF significantly outperform the lasso and ridge regression on a large number of different examples both in terms of the accuracy of the solution, as well as in prediction error. In addition, we show that the VG does not suffer from the inconsistency of the lasso method when the input correlations are large. We finally show how all methods compare as a function of the level of noise, the sparsity of the target solution, the number of samples and the number of irrelevant predictors.

For most of the examples, we generate training, validation and testing sets. Inputs are generated from a zero mean multi-variate Gaussian distribution with specified covariance structure. We generate outputs \(y^{\mu}=\sum_{i} \hat{w}_{i} x_{i}^{\mu}+d\xi^{\mu}\) with \(d\xi^{\mu}\in { \mathcal{N}}(0,\hat{\sigma})\) and \(\hat{w}_{i}\) depending on the problem.

For VG, ridge regression and lasso, we optimize the model parameters on the training set and, when necessary, optimize the hyper-parameters (γ for VG, λ for ridge regression and lasso) that minimize the quadratic error on the validation set. For the lasso, we used the method described in Friedman et al. (2010).Footnote 4

Comparison with PMF is performed using the software available online for the regression case with one-dimensional output.Footnote 5 For PMF, we merge both training and validation sets and the resulting dataset is used as input for the PMF method. This ensures that all methods use the same data for parameter estimation.

We also consider a modified version of PMF which replaces the update of π in the M-Step with a sequential annealing-reheating procedure such as the one proposed for γ in the VG. We observed empirically that the best strategy is to perform a sweep from sparse to dense π 0→1 solutions (forward pass) followed by a sweep from dense to sparse π 1→0 solutions (backward pass) and select the solution with maximum bound value (or minimum negative bound as we report here) in the backward pass. PMF does not over-fit as a function of π and thus does not require the use of a validation set. We refer to such variant of PMF as PMF-ANNEAL.

We define the solution vector for a given method as v. For VG, the components are v i m i w i . In the case of PMF and PMF-ANNEAL, m i corresponds to the spike-and-slab variational posterior and w i to the variational mean for the weights.Footnote 6 For ridge and lasso v i w i .

5.1 Small Example 1

In the first example, we take independent inputs \(x_{i}^{\mu}\in { \mathcal{N}}(0,1)\) and a teacher weight vector with only one non-zero entry: \(\hat{w}=(1,0,\ldots,0)\), n=100 and \(\hat{\sigma}=1\). The training set size p=50, validation set size p v =50 and test set size p t =400. We choose ϵ=0.001 in Eq. (13), γ max=0.02γ minγ=−0.02γ min (see Algorithm 1 for details).

Results for a single run of the VG are shown in Fig. 3. In Fig. 3a, we plot the minimal variational free energy F versus γ for both the forward and backward run. Note, the hysteresis effect due to the local minima. For each γ, we use the solution with the lowest F. In Fig. 3b, we plot the training error and validation error versus γ. The optimal γ≈−21 is denoted by a star and the corresponding \(\sigma=1/\sqrt{\beta}=1.05\). In Fig. 3c, we plot the non-zero component v 1=m 1 w 1 and the maximum absolute value of the remaining components versus γ. Note the robustness of the VG solution in the sense of the large range of γ values for which the correct solution is found. In Fig. 3d, we plot the optimal solution v i =m i w i versus i.

Fig. 3
figure 4

Top left (a): minimal variational free energy versus γ. The two curves correspond to warm start solution from small to large γ (‘forward’) and from large to small γ (‘backward’) (see also Algorithm 1). Top right (b): training and validation error versus γ. The optimal γ minimizes the validation error. Bottom left (c): solution v 1=m 1 w 1 and max i=2:n |m i w i |. The correct solution is found in the range γ≈−20 to γ≈−5. Bottom right (d): optimal solution v i =w i m i versus i

In Fig. 4 we show the lasso (top row) and ridge regression (bottom row) results for the same data set. The optimal value for λ minimizes the validation error (star). In Fig. 4b, c we see that the lasso selects a number of incorrect features as well. Figure 4b also shows that the lasso solution with a larger λ in the range 0.45<λ<0.95 could select the single correct feature, but would then estimate \(\hat{w}_{1}\) too small due to the large shrinkage effect. Ridge regression gives very bad results. The non-zero feature is too small and the remaining features have large values. Note from Fig. 4e, that ridge regression yields a non-sparse solution for all values of λ.

Fig. 4
figure 5

Regression solution for lasso and ridge regression for same data set as in Fig. 3. Top row (a)(c): lasso. Bottom row (d)(f): ridge regression. Left column (a), (d): training and validation errors versus λ. Middle column (b), (e): solution for the non-zero feature v 1 and the zero-features max i=2:n |v i |. Right column (c)(f): optimal lasso and ridge regression solution v i versus i

Table 1 shows that the VG significantly outperforms the lasso method and ridge regression both in terms of prediction error, the accuracy of the estimation of the parameters and the number of non-zero parameters. In this simple example, there is no significant difference in the prediction error of lasso, PMF and VG, but the lasso solution is significantly less sparse. There is no significant difference between the solutions found by PMF and VG.

Table 1 Results for Example 1 averaged over 20 instances. Train is mean squared error (MSE) on the training set. Val is MSE on the validation set. Test is MSE on the test set. # non-zero is the number of non-zero elements in the lasso solution and \(\sum_{i=1}^{n} (m_{i}>0.5)\) for VG and PMF. \(\|\delta \mathbf {v}\|_{1}=\sum_{i=1}^{n} |v_{i}-\hat{w}_{i}|\)

5.2 Small Example 2

In the second example, we consider the effect of correlations in the input distribution. Following Tibshirani (1996) we generate input data from a multi-variate Gaussian distribution with covariance matrix χ ij =ζ |ij|, with ζ=0.5. In addition, we choose multiple features non-zero: \(\hat{w}_{i}=1, i=1,2,5,10,50\) and all other \(\hat {w}_{i}=0\). We use \(n=100, \hat{\sigma}=1\) and p/p v /p t =50/50/400. In Table 2 we compare the performance of the VG, lasso, ridge regression and PMF on 20 random instances. We see that the VG and PMF significantly outperform the lasso method and ridge regression both in terms of prediction error and accuracy of the estimation of the parameters. Again, there is no significant difference between PMF and VG.

Table 2 Results for Example 2. For definitions see caption of Table 1

5.3 Analysis of consistency: VG vs lasso

It is well-known that the lasso method may yield inconsistent results when input variables are correlated. In Zhao and Yu (2006), necessary and sufficient conditions for consistency are derived. In addition, they give a number of examples where lasso gives inconsistent results. Their simplest example has three input variables, x 1,x 2,x 3. x 1,x 2,ξ,e are independent and Normal distributed random variables, x 3=2/3x 1+2/3x 2+ξ and \(y=\sum_{i=1}^{3} \hat{w}_{i} x_{i} + e\), p=1000. When \(\hat{w}=(-2,3,0)\) (Example b) this example is consistent, but when \(\hat{w}=(2,3,0)\) (Example a) this example violates the consistency condition. The lasso and VG solution for Example a for different values of λ and γ are shown in Fig. 5a, b, respectively. The VG solution v i =m i w i in terms of m i and w i is shown in Fig. 5c, d. The average results over 100 instances for Example a and Example b are shown in Table 3. We see that the VG does not suffer from inconsistency and always finds the correct solution, avoiding sub-optimal local minima.

Fig. 5
figure 6

Lasso and VG solution for the inconsistent Example a of Zhao and Yu (2006). Top left: lasso solution versus λ is called inconsistent because it does not contain a λ for which the correct sparsity (w 1,2≠0,w 3=0) is obtained. Top right: the VG solution for v versus γ contains large range of γ for which the correct solution is obtained. Bottom left: VG solution for m (curves for m 1,2 are identical). Bottom right: VG solution for w (Color figure online)

Table 3 Accuracy of ridge, lasso and VG for Example 1a, b from Zhao and Yu (2006). p=p v =1000. Parameters λ (ridge and lasso) and γ (VG) optimized through cross validation. ∥δ v1 as before, max(|v 3|) is maximum over 100 trials of the absolute value of v 3. Example a is inconsistent for lasso and yields much larger errors than the VG. Example b is consistent and the quality of the lasso and VG are similar. Ridge regression is bad for both examples

5.4 Effect of the noise

In this subsection we show the accuracy VG, lasso and PMF as a function of the noise \(\hat{\sigma}^{2}\). We generate data with n=100,p=100,p v =20 and \(\hat{w}_{i}=1\) for 10 randomly chosen components i. We vary \(\hat{\sigma}^{2}\) in the range 10−8 to 10 for two values of the correlation strength in the inputs ζ=0.5,0.95.

For weakly correlated inputs, Fig. 6a, we distinguish three noise domains: for large noise all methods produce errors of \({ \mathcal{O}}(1)\) and fail to find the predictive features. For intermediate and low noise levels, \(10^{0}>\hat{\sigma}^{2}>10^{-2}\), VG and PMF perform significantly better than lasso. In the limit of zero noise, the error of VG and PMF keeps on decreasing whereas the lasso error saturates to a constant value.

Fig. 6
figure 7

Accuracy as a function of the noise. n=100,p=100,p v =20 and \(\hat{w}_{i}=1\) for 10 randomly chosen components i. (a) For weakly correlated inputs ζ=0.5 VG and PMF show comparable performance superior to lasso. (b) For strongly correlated inputs ζ=0.95 VG performs better than PMF (errorbars for PMF are not shown for clarity) but similarly to PMF-ANNEAL. (c) For ζ=0.95, p=60, p v =5, and \(\hat{w}_{i}\pm1\) and mixed input correlations VG outperforms all methods on average

For strongly correlated inputs, Fig. 6b, we observe that whereas the error of VG scales approximately as before, PMF gets stuck in local minima in some instances, yielding worse average performance than VG. In contrast, the annealed version of PMF is able to avoid these sub-optimal solutions, resulting in average performance comparable to VG.

Finally, we consider a more challenging problem in which the weights have mixed signs \(\hat{w}_{i}=\pm1\), inputs are positively and negatively correlated, and a small number of samples is available (p=60,p v =5). To generate negatively correlated inputs, we select a subset of the predictors and for each predictor we first obtain the indices (sample numbers) of their values sorted in ascending order. Then, we replace the predictor values with the values sorted in descending order using the previous indices. Average results for this setup are shown Fig. 6c. In this case, VG error scales as before, whereas PMF-ANNEAL gets stuck in sub-optimal solutions in some instances.

We can thus conclude that the use of annealing-reheating in the hyper-parameter optimization only explains partially the better performance of the VG compared to PMF. The results on the mixed problem suggest that the combination of a naive mean field variational approximation with a MAP step also helps to avoid local minima.

5.5 Boston-housing dataset: VG vs PMF

We now focus on comparing in more detail the performance of VG with PMF. In Titsias and Lázaro-Gredilla (2011), the Boston-housing datasetFootnote 7 is used to test the accuracy of the PMF approximation compared to a naive mean field approximation.

This is a linear regression problem that consists of 456 training examples with one-dimensional response variable y and 13 predictors that include housing values. We use here the same setup as in Titsias and Lázaro-Gredilla (2011) to compare VG with PMF. For PMF, hyper-parameters were fixed to values \(\sigma^{2}=0.1\times\mathrm{var}(y), \pi= 0.25, \sigma^{2}_{w} = 1\) where \(\operatorname{var}(y)\) denotes the output variance. For the VG, we use β=1/σ 2, γ=log(π/(1−π)). Since γ and β are given, the VG algorithm reduces to iterate Eqs. (8) and (9) starting from a random m. Similarly, the PMF reduces to perform an E-step given the fixed hyperparameter values.

As in Titsias and Lázaro-Gredilla (2011), we use random initial values for the variational parameters between 0 and 1 (soft initialization) and random values equal to 0 or 1 (hard initialization). We considered as ground truth \(\hat{w}\equiv \mathbf {w}^{\mathrm{tr}}\) the result of the efficient paired Gibbs sampler developed in Titsias and Lázaro-Gredilla (2011).

Table 4 shows the results. The first and second rows show the errors reported in Titsias and Lázaro-Gredilla (2011) and the errors that we obtain using their software, respectively. We observe a small discrepancy in the average errors. However, if we consider the percentiles, the results are consistent.

Table 4 Comparison of VG and PMF in the Boston-housing dataset in terms of approximating the ground-truth \(\hat{w}\). Average errors \(\|\delta \mathbf {v}\|_{1}=\sum_{i=1}^{n} |v_{i}-\hat{w}_{i}|\), with v i the approximation of VG or PMF, together with 95 % confidence intervals (given by percentiles) obtained after 300 random initializations for both soft and extreme initializations

PMF finds two local optima depending on the initialization: one is the correct solution (error ≈10−3) whereas the other has error 0.454. These two solutions are found equally often for both soft or hard initializations, showing no dependence on the type of initialization, in agreement with Titsias and Lázaro-Gredilla (2011), and they are illustrated in Fig. 7(left).

Fig. 7
figure 8

Boston-housing results. wtr are the true weights. Left: PMF finds two solutions (in white and red). The red one is suboptimal (predictor 10 is a false negative and predictor 9 is underestimated). Middle: VG always finds the same optimum. Right: hysteresis effect for π in PMF. PMF is initially trapped in the local optimum (beginning of forward pass). The global optimum is found for π>π (π ≈0.7) and continued to be the solution in the backward pass (Color figure online)

The results of VG are shown on the third row of Table 4 and in Fig. 7(middle). Contrary to PMF, the VG shows no dependence on the initialization and always finds a solution with an error of order 10−3.

The result of the annealed version of PMF for a case in which PMF converged to the suboptimal solution is illustrated in Fig. 7(right). The global optimum is found for π>π , (π ≈0.7) during the forward pass and continued to be the solution in the backward pass, showing the hysteresis effect mentioned for γ in the VG. This means that for π>π , conventional PMF always converges to the global optimum, but that may not be case for π<π , depending on the initialization of the weights.

We also perform a similar experiment with VG for fixed values of γ in the corresponding range \(\gamma=\log(\frac{\pi}{1-\pi})\) and for each value of γ we run VG using 100 random initial values. VG never finds a suboptimal solution and always converges to the same solution regardless of the fixed value of γ and the initialization. We thus conclude that the naive mean field variational approximation in combination with the MAP procedure do not suffer from local optima effect in this dataset.

5.6 Dependence on the number of samples

We now analyze the performance of all considered methods as a function of the number of samples available. We first analyze the case when inputs are not correlated and then consider correlations of practical relevance that appear in genetic datasets.

For these experiments, we generate the data for dimension n=500 and noise level β=1. We explore two scenarios: very sparse problems with only 10 % of active predictors and denser problems with 25 % of active predictors. The weights of the nonzero elements take integer values in increasing order starting from 1, i.e. in the sparse case, they take values from 1 to 50. We choose the validation set sizes very small (p v =p/10). Choosing larger validation set sizes worsens the performance of VG compared to the PMF variants. This is due to the difference between using a cross validation or a Bayesian approach (PMF variants use both training and validation sets for learning).

5.6.1 Uncorrelated case

Figure 8 shows results of performance for uncorrelated inputs. Top panels show the area under the Receiver Operating Characteristic (ROC) curve. The ROC curve is calculated by thresholding the weight estimates. Those weights that lie above (below) the threshold are considered as active (inactive) predictors. The ROC curve plots the fraction of true positives versus the fraction of false positives for all threshold values. The area under the curve measures the ability of the method to correctly classify those predictors that are and are not active. A value of 1 for the area represents a perfect classification whereas 0.5 represents random classification. The ROC is plotted as a function of the fraction of samples relative to the number of inputs: p/n.

Fig. 8
figure 9

Uncorrelated case: performance as a function of number of training samples p for two levels of sparsity (10 % and 25 % of non-zero entries). For each value averages over 20 runs are plotted. Top: area under the ROC curves (see text for definition). Bottom: reconstruction error, defined as \(\|\delta \mathbf {v}\|_{1}=\sum_{i=1}^{n} |v_{i}-\hat{w}_{i}|\)

For both VG and PMF, we observe in all performance measures a transition from a regime where solutions are poor to a regime with almost perfect recovery. This transition, not noticeable in the other (convex) methods, occurs at around 35 % of examples for 10 % of sparsity (left column) and shifts to higher values for denser problems (≈60 % for 25 % of sparsity, right column).

If we compare VG with PMF we see that PMF performs slightly better than VG in terms of area under the ROC curve and reconstruction error in the small sample size limit. Above the threshold, VG and PMF show equivalent performance. We observe no difference between PMF and PMF-ANNEAL (results not shown).

We also see that lasso performs better than ridge regression, but the difference between both methods tends to be smaller for denser problems. Both lasso and ridge regression are significantly worse than VG and PMF.

5.6.2 Correlated case: genetic dataset

We now consider input data obtained from a genetic domain, where inputs x i denote single nucleotide polymorphisms (SNPs) that have values x i ={0,1,2}. SNPs typically show correlations structured in blocks, where nearby SNPs are highly correlated, but show no dependence on distant SNPs. An example of such correlation matrix can be seen in Fig. 9. The raw genetic dataset for that experiment included 928 samples of 2399 three-valued SNP predictors {0,1,2}. To generate the dataset used in the analysis, we keep the original correlation structure of the input data but generate the outputs artificially using a randomly chosen set of active/inactive predictors. This allows to quantify the error of the different methods.

Fig. 9
figure 10

Example of input correlation matrix in the genetic dataset (Color figure online)

First, we filter out the less informative predictors (with entropy smaller than ϵ e =0.9). This steps removes 877 predictors. From the remaining set of 1522 predictors, we select incrementally the active ones checking that at each step the correlation between a new active predictor and the rest of active predictors is at most ϵ ζ =0.9. Once the active predictors have been selected, we select randomly the remaining (inactive) predictors to form a set of n=500 total predictors. The values for n,ϵ e and ϵ ζ are chosen in a way that permits the analysis in terms of size of the training and validation sets.

Figure 10 shows the results. Contrary to the uncorrelated case, the existence of strong correlations between some of the predictors prevents a clear distinction between solution regimes as a function of sample size.

Fig. 10
figure 11

Correlated case: performance as a function of number of training samples p for two levels of sparsity (10 % and 25 % of non-zero entries). For each value averages over 20 runs are plotted. Top: area under the ROC curves (see text for definition). Bottom: reconstruction error, defined as \(\|\delta \mathbf {v}\|_{1}=\sum_{i=1}^{n} |v_{i}-\hat{w}_{i}|\)

We observe, as before, that both VG and PMF are the preferable methods for sufficiently large training set size. The difference between ridge regression and lasso is more remarkable and ridge regression can even be a preferable choice than lasso for denser problems when a large number samples is available.

In all performance measures considered, VG performs better or comparable to PMF. In particular, VG significantly outperforms PMF for denser problems, which are harder due to the presence of more local minima. PMF-ANNEAL significantly improves the results of conventional PMF for both sparsity levels. From these results we can conclude that VG shows better or comparable performance than any other method considered.

5.7 Scaling with dimension n

We conclude our empirical study by analyzing how the methods scale, both in terms of the quality of the solution as in terms of CPU times, as a function of the number of features n for a constant number of samples. We use the data as in Example 2 above, with uncorrelated inputs.

Figure 11 shows the results for VG, PMF and lasso. For the VG, we use the dual method described in Appendix B. Figure 11a shows that the VG and PMF have constant quality in terms of the error ∥δ v1, whereas the quality of the lasso deteriorates with n. Figure 11b shows that the VG and PMF have close to optimal norms 0=5 and that the 0 norm of the lasso deteriorates with n. Figure 11c shows that the computation time of all methods scales approximately linear with n. Lasso is significantly faster than VG and PMF, and VG is significantly faster than PMF. Note, however that the VG and the PMF methods are implemented in Matlab whereas the lasso method uses an optimized Fortran implementation.

Fig. 11
figure 12

Scaling with n: performance of VG (dual version), PMF and lasso as a function of the number of features n. (a) Error of the solution vector. (b) 0 of the solution vector. (c) cpu-time in seconds (dashed line corresponds to a linear fit). Data are generated as in Example 2. p=100,p v =100,β=2,ζ=0

6 Discussion

In this paper, we have analyzed the variational method for sparse regression using 0 penalty. We have presented a minimal version of the model with no (hierarchical) prior distributions to highlight some important features: the variational ridge term that dynamically regularizes the regression; the input-output behavior as a smoothed version of hard feature selection; a phase plot that shows when the variational solution is unique in the orthogonal design case for different p,ρ,γ.

The VG suffers from local minima as can be expected for any method that needs to solve a non-convex problem. We have shown evidence that the combined variational/MAP approach together with the annealing procedure that results from increasing γ, followed by a “heating” phase to detect hysteresis works well in practice, helping to avoid local minima. In particular, we have shown that VG can outperform a more complex model such as PMF precisely because of that reason. Further, we also have observed that VG can be still preferable to an improved version of PMF (PMF-ANNEAL) in a practical scenario with strongly correlated inputs and/or moderately sparse problems.

As mentioned in Sect. 3, the approach of Carbonetto and Stephens (2012) shares many similarities with the VG. It would be of interest to compare both approaches. We leave this comparison and other more powerful approximations, such as structured mean field approximation or belief propagation for future work.

We have seen that the performance of the VG is excellent in the zero noise limit. In this limit, the regression problem reduces to a compressed sensing problem (Candes and Tao 2005; Donoho 2006). The performance of compressed sensing with q sparseness penalty was analyzed theoretically in Kabashima et al. (2009), showing the superiority of the 1 penalty in comparison to the 2 penalty and suggesting the optimality of the 0 penalty. Our numerical results are in agreement with this finding.

Our implementation uses parallel updating of Eqs. (8)–(10) or for the dual formulation equations (8), (22), (25), (28). One may consider also a sequential updating. This was done successfully for the lasso based on the idea of the Gauss-Seidel algorithm (Friedman et al. 2010). The advantage of such an approach is that each update is linear in both n and p, since only the non-zero components need to be updated. However, the number of updates to converge will be larger. The proof of convergence for such a coordinate descend method for the VG is likely to be more complex than for the lasso due to non-convexity. As a result, a smoothing parameter η≠1 (see Algorithm 1) may still be required.