The Variational Garrote

In this paper, we present a new variational method for sparse regression using $L_0$ regularization. The variational parameters appear in the approximate model in a way that is similar to Breiman's Garrote model. We refer to this method as the variational Garrote (VG). We show that the combination of the variational approximation and $L_0$ regularization has the effect of making the problem effectively of maximal rank even when the number of samples is small compared to the number of variables. The VG is compared numerically with the Lasso method, ridge regression and the recently introduced paired mean field method (PMF) (M. Titsias&M. L\'azaro-Gredilla., NIPS 2012). Numerical results show that the VG and PMF yield more accurate predictions and more accurately reconstruct the true model than the other methods. It is shown that the VG finds correct solutions when the Lasso solution is inconsistent due to large input correlations. Globally, VG is significantly faster than PMF and tends to perform better as the problems become denser and in problems with strongly correlated inputs. The naive implementation of the VG scales cubic with the number of features. By introducing Lagrange multipliers we obtain a dual formulation of the problem that scales cubic in the number of samples, but close to linear in the number of features.


Introduction
One of the most common problems in statistics is linear regression. Given p samples of n-dimensional input data x µ i , i = 1, . . . , n and 1-dimensional output data y µ , with µ = 1, . . . , p, find weights w i , w 0 that best describe the relation for all µ. ξ µ is zero-mean noise with inverse variance β.
The ordinary least square (OLS) solution is given by w = χ −1 b and w 0 =ȳ − i w ixi , where χ is the input covariance matrix b is the vector of input-output covariances andx i ,ȳ are the mean values. There are several problems with the OLS approach. When p is small, it typically has a low prediction accuracy due to overfitting. In particular, when p < n, χ is not of maximal rank and so its inverse is not defined. In addition, the OLS solution is not sparse: it will find a solution w i = 0 for all i. Therefore, the interpretation of the OLS solution is often dificult.
These problems are well-known, and there exist a number of approaches to overcome these problems. The simplest approach is called ridge regression. It adds a regularization term 1 2 λ i w 2 i with λ > 0 to the OLS criterion. This has the effect that the input covariance matrix χ gets replaced by χ + λI which is of maximal rank for all p. One optimizes λ by cross validation. Ridge regression improves the prediction accuracy but not the interpretability of the solution.
Another approach is Lasso [1]. It solves the OLS problem under the linear constraint i |w i | ≤ t. This problem is equivalent to adding an L 1 regularization term λ i |w i | to the OLS criterion. The optimization of the quadratic error under linear constraints is called a quadratic program. There exist efficient methods to solve this quadratic programming problem. See [2] for a recent account. Again, λ or t are found through cross validation. The advantage of the L 1 regularization is that the solution tends to be sparse. This improves both the prediction accuracy and the interpretability of the solution.
The L 1 or L 2 regularization terms are known as shrinkage priors because their effect is to shrink the size of w i . The idea of shrinkage prior has been generalized by [3] to the form λ i |w i | q with q > 0 and q = 1, 2 corresponding to the Lasso and ridge case, respectively. Better solutions can be obtained for q < 1, however the resulting optimization problem is no longer convex and therefore more difficult.
An alternative Bayesian approach to obtain a sparse solution was proposed by [4]. They introduce n variational selector variables s i such that the prior distribution over w i is a mixture of a narrow (spike) and wide (slab) Gaussian distribution, both centered on zero. The posterior distribution over s i indicates whether the solution w i is non-zero or zero. The computation of the posterior requires the use of Gibbs sampling or any other MCMC method.
Here, we propose a model where we introduce sparseness in the likelihood instead of in the prior: 1 with s i = 0, 1. The idea of this model is that the bits s i will identify the predictive inputs i. When we optimize the posterior distribution with respect to s i or using Bayesian integration, one finds the optimal subset of 1 We assume from here on without loss of generality that 1 p p µ=1 x µ i = 1 p p µ=1 y µ = 0 relevant features. Since the number of subsets is exponential in n one could use MCMC sampling. Here, instead we propose to approximate the posterior distribution over s and w using a variational approximation. Model Eq. 1 has some similarity with Breiman's non-negative Garrote method [5]. The non-negative Garrote method assumes the same model as in Eq. 1 but with 0 ≤ s i ≤ 1 instead of binary. It computes w i using OLS and then finds s i by minimizing Because of this similarity, we refer to our method as the variational Garrote (VG). As we will see, the VG does not require the OLS solution and can find good solutions also when p < n. Using a Bayesian description, and denoting the data by D : { x µ , y µ }, µ = 1, . . . , p, the likelihood term is given by We should also specify prior distributions over s, w, β. For concreteness, we assume that the prior over s is factorized over the individual s i , each with identical prior probability: with γ given which specifies the sparsity of the solution. We denote by p( w, β) the prior over the inverse noise variance β and the feature weights w. We will leave this prior unspecified since it can be chosen arbitrary. The posterior becomes p( s, w, β|D, γ) = p( w, β)p( s|γ)p(D| s, w, β) p(D|γ) Computing the MAP estimate or computing statistics from the posterior is complex due to the discrete nature or s. We propose to compute a variational approximation to the marginal posterior p( w, β|D, γ) = s p( s, w, β|D, γ) and computing the MAP solution with respect to w, β. Since p(D|γ) does not depend on w, β we can ignore it.
Note, that the model as specified by the likelihood Eq. 2 and the prior Eq. 3 is not equivalent to the spike and slab model [4]. The spike and slab model has a quadratic likelihood p(D| w, β) that does not depend on selector variables. Instead, it has a prior p( w| s) that depends on the s of the form with β ± the spike and slab precisions. Thus, the spike and slab model only contains an interaction between w i and s i of the form s i w 2 i , whereas the likelihood Eq. 2 contains in addition interactions of the form w i s i .

The variational approximation
The posterior distribution Eq. 4 for given w, β is a typical Boltzmann distribution involving terms linear and quadratic in s i . It is well-known that one can obtain good approximations using methods that originated in the statistical physics community and where s i denote binary spins. Most prominently, one can use the mean field or variational approximation [6], or belief propagation [7]. For introductions into these methods also see [8,9]. Here, we will develop a solution based on the simplest possible variational approximation and leave the possible improvements using BP or structured mean field approximations to the future.
We approximate the sum by the variational bound using Jensens inequality.
q( s) is called the variational approximation and can be any positive probability distribution on s and F (q, w, β) is called the variational free energy. The optimal q( s) is found by minimizing F (q, w, β) with respect to q( s) so that the tightest bound -best approximation -is obtained.
In order to be able to compute the variational free energy efficiently, q( s) must be a tractable probability distribution, such as a chain or a tree with limited tree-width [10]. Here we consider the simplest case where q( s) is a fully factorized distribution: , so that q is fully specified by the expected values m i = q i (s i = 1), which we collectively denote by m. The expectation values with respect to q can now be easily evaluated and the result is The first line is due to the likelihood term, the second line is due to the prior on s and the third line is the entropy of q( s). The approximate posterior marginal posterior is then We can compute the variational approximation m for given w, β, γ by minimizing F with respect to m. In addition, p( w, β|D, γ) needs to be maximized with respect to w, β. Note, that the variational approximation only depends on the likelihood term and the prior on γ, since these are the only terms that depend on s. Thus, the variational approximation does not depend on the particular choices for the prior p( w, β). For concreteness, we assume a flat prior p( w, β) ∝ 1. We set the derivatives of F with respect m, w, β equal to zero. This gives the following set of fixed point equations: with σ(x) = (1 + exp(−x)) −1 and where in Eq. 10 we have used Eq. 9. Eqs. 8-10 provide the solution to the variational Garrote. They can be solved by fixed point iteration as outlined in Appendix C: Initialize m at random. Compute w by solving the linear system Eq. 9 and β from Eq. 10. Compute a new solution for m from Eq.8. Within the variational/MAP approximation the predictive model is given by with ξ 2 = 1/β and m, w, β as estimated by the above procedure. Let us pause to make some observations about this solution. One might naively expect that the variational approximation would simply consist of replacing w i s i in Eq. 1 by its variational expectation w i m i . If this were the case, m would disappear entirely from the equations and one would expect in Eq. 9 the OLS solution with the normal input covariance matrix χ instead of the new matrix χ ′ (note, that in the special case that m i = 1 for all i, χ ′ = χ and Eq. 9 does reduce to the OLS solution). Instead, m and w are both to be optimized, giving in general a different solution than the OLS solution 2 .
When m i < 1, χ ′ differs from χ by adding a positive diagonal to it. This is similar to the mechanism of ridge regression, but with the important difference that the diagonal term depends on i and is dynamically adjusted depending on the solution for m. Thus, the sparsity prior together with variational approximation provides a mechanism that solves the rank problem. When all m i < 1, χ ′ is of maximal rank. Roughly speaking if χ has rank p < n, χ ′ can be still of rank n when no more than p of the m i = 1, the remaining n − p of the m i < 1 making up for the rank deficiency. Note, that the size of m i (and thus the rank of χ ′ ) is controlled by γ through Eq. 8.

Orthogonal and univariate case
In order to obtain further insight in the solution, consider the case in which the inputs are uncorrelated: χ ij = δ ij . Then χ ′ ij = δ ij and Eq. 9 gives the solution w i = b i . Eqs. 8 and 10 become The term i b 2 i m i is the explained variance and is subtracted from the total output variance to give an estimate of the noise variance 1/β. When m i = 1, the explained variance is maximal. The Garrote solution with 0 ≤ m i ≤ 1 has reduced explained variance with (hopefully) a better prediction accuracy and interpretability.
In the 1-dimensional case these equations become with ρ = b 2 /σ 2 y the squared correlation coefficient. In Eq. 12, we have eliminated β and we must find a solution for m for this non-linear equation. We see that it depends on the input-output correlation ρ, the number of samples p and the sparsity γ. For p = 100, the solution for different ρ, γ is illustrated in figure 1 (see appendix A). Eq. 12 has one or two solutions for m, depending on the values of γ, ρ, p. The two solutions correspond to two local minima of the free energy F , which in the univariate case is given by where we have omitted terms independent of m. The best variational solution is given by the solution with the lowest free energy, indicated by the solid lines in fig. 1right. It is interesting to compare the univariate solution with ridge regression, Lasso or Breiman's Garrote, which was previously done for the latter three methods in [1]. Suppose that data are generated from the model y = wx + ξ with ξ 2 = x 2 = 1. We compare the solutions as a function of w. The OLS solution is approximately given by w ols ≈ xy = w, where we ignore the statistical deviations of order 1/p due to the finite data set size. Similarly, the ridge regression solution is given by w ridge ≈ λw, with 0 < λ < 1 depending on the ridge prior. The Lasso solution (for non-negative w) is given by w lasso = (w − γ) + [1], with γ depending on the L 1 constraint. Breiman's Garrote solution is given by , with γ depending on the L 1 constraint. The solutions are given in fig. 2. These methods are quantitively different. The ridge regression solution is off by a constant multiplicative factor. The Lasso solution is zero for small w and for larger w gives a solution that is shifted downwards by a constant factor. Breiman's Garrote is identical to the Lasso for small w and shrinks less for larger w. The VG gives an almost ideal behavior and can be interpreted as a soft version of variable selection: For small w the solution is close to zero and the variable is ignored, and above a threshold it is identical to the OLS solution.

A dual formulation
The solution of the system of Eqs. 8-10 by fixed point iteration requires the repeated solution of the n dimensional linear system χ ′ w = b. When n > p, we can obtain a more efficient method using a dual formulation.
We define new variables z µ = i m i w i x µ i and add Lagrange multipliers λ µ : We show in the Appendix B that the solution is found by solving the system of equations: together with Eq. 8.

Cross validation
One remaining issue is how to choose the level of sparsity γ. If we have a prior belief about γ we could optimize its value together with w, β. Here, instead, we propose compute the VG solution for fixed γ and choose its optimal value through cross validation on independent data. This has the advantage that our result is independent of our (possibly incorrect) prior belief. But another important advantage is computational. When we increase γ from a negative value to zero in small steps, we obtain a sequence of solutions with decreasing sparseness. These solutions will better fit the data and as a result β increases with γ. Thus, increasing γ implements an annealing approach where we sequentially obtain solutions at lower noise levels. We found empirically that this approach is effective to reduce the problem of local minima. To further deal with the effect of hysteresis we recompute the solution from γ max down to γ min .
The minimal value of γ is chosen as the largest value such that m i = ǫ, with ǫ small. We find from Eqs. 8-10 that We heuristically set the maximal value of γ as well as the step size. The VG algorithm is summarized in Appendix C.

Numerical examples
In the following examples, we compare the Lasso, ridge regression and the VG. For each example, we generate a training set, a validation set and a test set. Inputs are generated from a mean zero multi-variate Gaussian distribution with specified covariance structure. We generate outputs y µ = iŵ i x µ i + dξ µ with dξ µ ∈ N (0,σ). For each value of the hyper parameters (γ in the case of VG, λ in the case of ridge regression and Lasso), we optimize the model parameters on the training set. For the Lasso, we used the method described in [2] 3 . We optimize the hyper parameters that minimize the quadratic error on the validation set. Finally, we report the mean squared error on the training set, validation set and test set as well as the mean squared error and mean absolute error in the estimated parameters. For the VG, we define the solution as v i = m i w i . In the first example (Example 1), we take independent inputs x µ i ∈ N (0, 1) and a teacher weight vector with only one non-zero entry:ŵ = (1, 0, . . . , 0), n = 100 andσ = 1. The training set size p = 50, validation set size p v = 50 and test set size p t = 400. We choose ǫ = 0.001 in Eq. 20, γ max = 0.02γ min , ∆γ = −0.02γ min . For details of the algorithm see Appendix C. Results for a single run of the VG are shown in fig. 3. In fig. 3a, we plot the minimal variational free energy F versus γ for both the forward and backward run. Note, the hysteresis effect due to the local minima. For each γ, we use the solution with the lowest F . In fig. 3b, we plot the training error and validation error versus γ. The optimal γ = −14.8 is denoted by a star and the corresponding σ = 1/ √ β = 1.0845. In fig. 3c, we plot the non-zero component v 1 = m 1 w 1 and the maximum absolute value of the remaining components versus γ. Note the robustness of the VG solution in the sense of the large range of γ values for which the correct solution is found. In fig. 3d, we plot the optimal solution v i = m i w i versus i.
In fig. 4 we show the Lasso (top row) and ridge regression (bottom row) results for the same data set. The optimal value for λ minimizes the validation error (star). In fig. 4b,c we see that the Lasso selects a number of incorrect features as well. Fig. 4b also shows that the Lasso solution with a larger λ in the range 0.45 < λ < 0.95 could select the single correct feature, but would then estimate w too small due to the large shrinkage effect. Ridge regression gives very bad results. The non-zero feature is too small and the remaining features have large values. Note from fig. 4e, that ridge regression yields a non-sparse solution for all values of λ.
In table 1 we compare the performance of the VG, Lasso and ridge regression on 10 random instances. We see that the VG significantly outperforms the Lasso method and ridge regression both in terms of prediction error, the accuracy of the estimation of the parameters and the number of non-zero parameters.
In the second example (Example 2), we consider the effect of correlations in the input distribution. Following [1] we generate input data from a multi-variate Gaussian distribution with covariance matrix χ ij = δ |i−j| , with δ = 0.5. In addition, we choose multiple features non-zero:ŵ i = 1, i = 1, 2, 5, 10, 50 and all otherŵ i = 0. We use n = 100,σ = 1 and p/p v /p t = 50/50/400. In table 2 we compare the performance of the VG, Lasso and ridge regression on 10 random instances. We see that the VG significantly outperforms the Lasso method and ridge regression both in terms of prediction error and accuracy of the estimation of the parameters.
In fig. 5 we show the accuracy of the VG method and the Lasso method as a function of the noiseσ. We generate data according to Example 2 and varyσ 2 in the range 1e − 4 to 10. Fig. 5 shows the error δw 2 as       Table 3: Accuracy of Ridge, Lasso and Garrote for Example 1a,b from [11]. p = p v = 1000. Parameters λ (Ridge and Lasso) and γ (Garrote) optimized through cross validation. dw 1,2 as before, max(|w 3 |) is maximum over 100 trials of the absolute value of w 3 . Example 1a is inconsistent for Lasso and yields much larger errors than the Garrote. Example 1b is consistent and the quality of the Lasso and Garrote are similar. Ridge regression is bad for both examples.
defined above. We distinguish three noise domains: for very large noise both VG and LASSO produce errors of O(1) and fail to find the predictive features. For intermediate and low noise levels, VG is approximately 1 order of magnitude more accurate than Lasso. In the limit of zero noise, the error of VG keeps on decreasing whereas the Lasso error saturates to a constant value. It is well-known that the Lasso method may yield inconsistent results when input variables are correlation. [11] derive a necessary and sufficient condition for consistency. In addition, they give a number of examples where Lasso gives inconsistent results. Their simplest example has three input variables, x 1 , x 2 , x 3 . x 1 , x 2 , ξ, e are independent and Normal distributed random variables, x 3 = 2/3x 1 + 2/3x 2 + ξ and y = 3 i=1ŵ i x i + e, p = 1000. Whenŵ = (2, 3, 0) this example violates the consistency condition. The Lasso and VG solutions for different values of λ and γ, respectively are shown in fig. 6. The average results over 100 instances of these problems is shown in table 3.
In fig. 7 we show how the Lasso, the ridge regression and the VG behave as a function of the number of samples. The VG is significantly more accurate than the Lasso, for almost all values of p. This is evident from the δw 2 and validation error. This is also evident by comparing the L 0 and L 1 norms of the solutions found. The Lasso solution find a solution with the same (correct) L 1 norm as the VG, but spreads this solution over many more features, as is witnessed by the L 0 norm.
In fig. 8 we show how the Lasso, the ridge regression and the VG behave as a function of the number of samples. For the VG we use the dual method described in section 2.2. The VG has constant quality in terms of the error δw 2 and close to optimal norms L 0 = L 1 = 5. The computation time of the VG scales approximately linear with n. The Lasso is significantly faster, but its quality deteriorates with n.

Discussion
In this paper, I have introduced a new model for sparse regression by introducing binary switch variables s that multiply the feature weights. When using a Bayesian framework one obtains a posterior distribution that is intractable due to the binary nature of s. In this paper we have proposed to approximate the variables s using a variational approximation. We have demonstrated that this approach is efficient and accurate and yields results that significantly outperform the Lasso approach.
There are many points that need further investigation. The variational approximation can by replaced by any other approximate inference method. It would be interesting to see whether the results could be further improved by using Belief Propagation, the Cluster Variation Method or Monte Carlo Sampling.
I have not explored the use of different priors on w or on β. In addition, a prior could be imposed on γ. It is likely that for particular problems the use of a suitable prior could further improve the results.
One of the important advantages of the Lasso over the VG is that the Lasso problem is convex whereas the VG may have local minima, as is illustrated in fig. 1. This is clearly a disadvantage of the VG, but is shared by all other methods that use L 0 regularization. We have not much explored the issue of local minima. For the examples that we presented the annealing approach that results from increasing γ, followed by a 'heating' phase to detect hysteresis, seems to work well. We have noted that for problems with a large number of features (O(n)), small p and high noise this approach may fail. The local minima problem depends on the approximation, and may be more or less severe when using other approximation schemes.
For efficient implementation one may use the fact that the data matrix x µ i is sparse. For instance in genetic regression problems this may be the case, when x i denotes single nucleotide polymorphisms (SNPs) that have values x i = 0, 1, 2. For each i one may code them such that the value x µ i = 0 is the most frequent one. In that case, substracting the mean from the data such that µ x µ i = 0 will yield x µ i non-sparse. The model as presented here should then be slightly generalized to estimate a constant bias as well.
In principle, the model Eq. 1 when used in a full Bayesian setting is non-linear: The predictive distribution p(y| x, D) = d wdβ s p(y| s, w, β, x)p( s, w, β|D) is a mixture of 2 n Gaussians. Only in the variational/MAP approximation does this model reduce to the linear model Eq. 11. It would be of interest to explore the Bayesian extension of the VG. However note, that in our experiments we find that m i ≈ 0 or m i ≈ 1 most of the time and encounter very few intermediate or multi-modal solutions. This suggest that the Bayesian integration, at least as far as the binary variables is concerned may be of limited additional value for the examples considered in the paper.
We have seen, that the performance of the VG is excellent also in the zero noise limit. In this limit, the regression problem reduces to a compressed sensing problem [12,13]. The performance of compressed sensing with L q sparseness penalty was analyzed theoretically in [14], showing the superiority of the L 1 penalty in comparison to the L 2 penalty and suggesting the optimality of the L 0 penalty. Our numerical result are in agreement with this finding. Since ρ ≤ 1 there is a positive and a negative solution for p. The latter we disregard.
p * is a decreasing function of ρ.
For p > p * , D > 0 and we find two solutions for m, which we denote by m1,2 = b± √ D 2a . Note, that the solutions in these critical points only depend on ρ, p. For each of these solutions we must find γ such that f (m) = m, which is given by