Bayesian fused lasso modeling via horseshoe prior

Bayesian fused lasso is one of the sparse Bayesian methods, which shrinks both regression coefficients and their successive differences simultaneously. In this paper, we propose a Bayesian fused lasso modeling via horseshoe prior. By assuming a horseshoe prior on the difference of successive regression coefficients, the proposed method enables us to prevent over-shrinkage of those differences. We also propose a Bayesian nearly hexagonal operator for regression with shrinkage and equality selection with horseshoe prior, which imposes priors on all combinations of differences of regression coefficients. Simulation studies and an application to real data show that the proposed method gives better performance than existing methods.


Introduction
Recently, a wide variety of data have come to be used in statistical analysis.Especially, the analysis of high-dimensional data, such as image data and financial data is taking on added significance.To handle these data, it is important to perform variable selection and variable fusion, which correspond to extracting relevant variables and capturing the group structure of data, respectively.To this end, sparse regularization methods such as lasso (Tibshirani 1996), fused lasso (Tibshirani et al. 2005), and a hexagonal operator for regression with shrinkage and equality selection (HORSES) (Jang et al. 2015) have been proposed.These methods allow us to execute variable selection and variable fusion by the estimation of regression coefficients.
Meanwhile, many Bayesian approaches to these regularization methods, in which priors on regression coefficients correspond to regularization terms, have also been proposed.For example, Park and Casella (2008) proposed Bayesian lasso, which shrinks regression coefficients by assuming they follow a Laplace distribution.Furthermore, Park and Casella (2008) developed a Gibbs sampling using a hierarchical expression of the Laplace distribution.Kyung et al. (2010) expanded Bayesian lasso by assuming Laplace distributions not only on regression coefficients but also on their successive differences, which is called Bayesian fused lasso.
A Laplace prior tends to shrink its targets, such as regression coefficients and their successive differences, too much.To overcome this problem, the Student-t prior and the normal-exponential-gamma (NEG) distribution (Griffin and Brown 2005), which have heavier tails than a Laplace prior, have also been used.Song and Cheng (2020) proposed using a Student-t prior to construct Bayesian fusion models.Shimamura et al. (2019) proposed Bayesian fused lasso based on the hierarchical expression of an NEG prior.In addition, a horseshoe prior (Carvalho et al. 2010) is also often used instead of a Laplace prior.A horseshoe prior has an infinite spike at zero and a Cauchy-like tail, which leads to simultaneous weak shrinkage on non-zero elements and strong shrinkage on exactly zero ones.Makalic and Schmidt (2015) introduced a linear regression model in which a horseshoe prior is assumed on the regression coefficients and developed a simple Gibbs sampler for it.However, these existing methods assume a horseshoe prior on only the regression coefficients.
In this paper, we propose Bayesian fused lasso modeling with horseshoe prior under the framework of linear regression models.To formulate the Bayesian model, we assume a Laplace prior on the regression coefficients and a horseshoe prior on their successive differences.We also propose Bayesian HORSES with horseshoe prior, where the horseshoe prior is assumed on every pair of differences of regression coefficients.We develop a Gibbs sampler for the parameters by using the hierarchical expression of the half-Cauchy prior (Wand et al. 2011) shown by Makalic and Schmidt (2015).
We note that Banerjee (2021) proposed imposing a horseshoe prior on differences of coefficients.However, Banerjee (2021) used the model assumed in the one-dimensional fused lasso signal approximation in Friedman et al. (2007), which is a special case of a linear regression model.In addition, Banerjee (2021) did not perform variable selection, unlike our proposed method.
The remainder of the paper is organized as follows.Section 2 describes the Bayesian models and introduces sparse Bayesian modelings with horseshoe prior.In Section 3, we propose Bayesian fused lasso and Bayesian HORSES with horseshoe prior, and then develop Gibbs samplings for them.Section 4 presents Monte Carlo simulations and an application to real data to compare our proposed method with existing methods.We conclude our paper in Section 5.

Sparse Bayesian linear regression modeling
In this section, we review Bayesian linear regression, Bayesian lasso, and Bayesian fused lasso.We also describe Bayesian linear regression via horseshoe prior.
Here 0 n is an n-dimensional vector of zeros, σ 2 is an error variance, and I n is an n × n identity matrix.Without loss of generality, we suppose that the response variable is centered and the explanatory variable is standardized as follows: x 2 ij = n (j = 1, 2, . . ., p). (2) Then, the likelihood function is given by where 2.2.Bayesian lasso Tibshirani (1996) proposed lasso, which performs parameter estimation and variable selection simultaneously in terms of frequentist.He also mentioned that the lasso solution is identical to a posterior mode obtained by imposing the Laplace distribution on the parameter vector β as its prior.
Based on the perspective of Tibshirani (1996), Park and Casella (2008) established a Bayesian estimation for lasso.The estimation is called Bayesian lasso.Bayesian lasso considers a conditional Laplace prior in the form where λ (> 0) is a hyper-parameter.Conditioning β on σ 2 makes the posterior distribution unimodal (for example, see Appendix A in Park and Casella (2008)).The prior distribution in (5) can be rewritten as by using a scale mixture of normals (Andrews and Mallows 1974).This equation means that the Laplace distribution is represented as the convolution of the following two distributions: For the parameter σ 2 , the improper prior distribution π(σ 2 ) ∝ 1/σ 2 or any inverse gamma distribution for σ 2 is assumed.Based on the likelihood and the prior distributions, a Gibbs sampling for Bayesian lasso is developed.We omit the Gibbs samplers.
For details, we refer the reader to Park and Casella (2008).

Bayesian fused lasso
The fused lasso (Tibshirani et al. 2005) encourages sparsity in both the coefficients and their successive differences.Kyung et al. (2010) proposed Bayesian fused lasso as a Bayesian counterpart to fused lasso.Bayesian fused lasso assumes a prior distribution for β of the following form: where λ 1 and λ 2 are positive hyper-parameters.Similar to Bayesian lasso, a scale mixture of normals is applied.Then the prior distribution ( 7) is transformed into Using this hierarchical relationship, Kyung et al. (2010) developed a Gibbs sampling for Bayesian fused lasso.
To perform the fully Bayesian estimation, Kyung et al. (2010) further assumed the gamma distribution for the hyper-parameters λ 1 and λ 2 as where r 1 , r 2 , δ 1 , and δ 2 are positive hyper-parameters.Here, the probability density function of the gamma distribution is given by Ga where m is a shape parameter and c is a rate parameter, both taking positive values.In Kyung et al. (2010), r 1 = 1, r 2 = 1, δ 1 = 10, and δ 2 = 10 are used because the gamma distribution is relatively flat with these parameter values.We omit the full conditional posteriors and the Gibbs samplers.For details, we refer the reader to Kyung et al. (2010).
Next, we explain HORSES by Jang et al. (2015).HORSES was proposed as an extension of fused lasso; HORSES imposes an L 1 penalty on all combinations of differences of regression coefficients.In the Bayesian framework, this corresponds to assuming a Laplace prior of the form for regression coefficients β.Note that HORSES is also known as generalized fused lasso (She 2010).
The Laplace distribution shrinks all of the regression coefficients to the same extent.Shimamura et al. (2019) proposed Bayesian fused lasso and Bayesian HORSES with NEG prior.This method assumes a Laplace prior on the regression coefficients and an NEG prior on their differences.Because an NEG prior has two properties, a spike at zero and extreme flatness of its tail, the method with an NEG prior has the advantage that exactly identical regression coefficients tend to be estimated as identical, while different regression coefficients tend to be estimated as different.

Bayesian linear regression model with horseshoe prior
Makalic and Schmidt (2015) proposed the following Bayesian linear regression model: Here, C + (0, a) (a > 0) is a half-Cauchy distribution, which has the following density function: The hierarchies of priors in (11) represent the horseshoe prior proposed in Carvalho et al. (2010).In the model with horseshoe prior, the half-Cauchy prior distribution is assumed on hyper-parameters λ j and τ .Hyper-parameter λ j adjusts the level of local shrinkage for regression coefficient β j , while hyper-parameter τ determines the degree of global shrinkage for all regression coefficients.Owing to having these two types of hyper-parameters, the horseshoe prior simultaneously enjoys a heavy tail and infinitely tall spike at zero.These properties induce exactly identical regression coefficients to tend to be estimated as identical, while different regression coefficients tend to be estimated as different.
To develop a Gibbs sampling for the parameters, Makalic and Schmidt (2015) used a hierarchical expression of the half-Cauchy distribution (Wand et al. 2011), which means that x follows C + (0, A) when x 2 and a have the following priors: where A is a positive constant.Using ( 12), the priors of the model ( 11) can be expressed as follows: We omit the full conditional posteriors and the Gibbs samplers.For details, we refer the reader to Makalic and Schmidt (2015) and Nalenz and Villani (2018).

Proposed method
In this section, we propose the Bayesian linear regression modeling, which assumes the horseshoe prior on successive differences of regression coefficients.We also extend this approach to HORSES.

Bayesian fused lasso with horseshoe prior
We propose assuming a Laplace prior on regression coefficients and a horseshoe prior on their successive differences as follows: By assuming the prior ( 14), small differences between successive regression coefficients are largely shrunk, while large differences are not much shrunk.Using a scale mixture of normals (Andrews and Mallows 1974), the prior ( 14) can be expressed as follows: Here, the inverse of matrix B is represented by .
(16) The detailed calculation of ( 15) is given in Appendix A. Therefore, the priors on β, τ 2 1 , . . ., τ 2 p , τ 2 , λ 2 2 , . . ., λ 2 p , ν 2 , . . ., ν p , ξ are given by where EXP(x | d) is an exponential prior with density function Here d is positive.In addition, we assume the priors on σ 2 and λ2 1 as By using the likelihood and the priors for the parameters, we can obtain the full conditional distributions as follows: By using the full conditional distributions, we can perform the Gibbs sampling.

Bayesian HORSES with horseshoe prior
Next, we propose assuming a Laplace prior on the regression coefficients and a horseshoe prior on all combinations of their differences as follows: Therefore, the priors on β, τ 2 1 , . . ., τ 2 p , τ 2 , λ 2 1,2 , . . ., λ 2 p−1,p , ν 1,2 . . ., ν p−1,p , and ξ are given by where λ 2 j,k = λ 2 k,j , ν 2 j,k = ν 2 k,j and the (i, j)-element of B −1 is represented as , (i = j). (20) By assuming an inverse gamma prior on σ 2 in (18), the full conditional distributions are represented as By using the full conditional distributions, we can perform the Gibbs sampling for Bayesian HORSES.Note that the hyper-parameter τ 2 is treated as a tuning parameter.The value of the tuning parameter is selected by any model selection criterion such as the widely applicable information criterion (WAIC) (Watanabe 2010).

Numerical studies
In this section, we compare the proposed method with existing methods through Monte Carlo simulations and show its effectiveness.In addition, we apply the proposed method to soil data (Bondell and Reich 2008).

Monte Carlo simulation
We conducted Monte Carlo simulations with artificial data generated from the true model where β * is the p-dimensional true coefficient vector and the error vector ǫ is distributed normally as N n (0 n , σ 2 I n ).In addition, x i (i = 1, 2, . . ., n) is distributed according to the multivariate normal distribution N p (0 p , Σ).We considered the following settings: Case 1: where Σ ij is the (i, j)-th element of Σ.For each case, we considered σ = 0.5, 1.5.We simulated Cases 1 and 2 with β * 1 = (0.0 T 5 , 1.0 T 5 , 0.0 T 5 , 1.0 T 5 ) T and β * 2 = (0.0 T 5 , 2.0 T 5 , 0.0 T 5 , 2.0 T 5 ) T .We considered n = 50 for Cases 1 and 2 and n = 30, 50 for Cases 3 and 4. Therefore, Cases 1 and 2 correspond to n > p cases, whereas Cases 3 and 4 correspond to n = p and n < p cases.We simulated 100 datasets for each case.Cases 1 and 2 are according to example 1 in Shen and Huang (2010), whereas Cases 3 and 4 are respectively according to examples 2 and 3 in the same reference.For each case, the Gibbs sampling was run with 5,000 iterations (where we discarded the first 2,000 iterations as burn-in).
We compared Bayesian fused lasso with horseshoe prior (BFH) to Bayesian fused lasso (BFL) and Bayesian fused lasso with NEG prior (BFNEG).For BFNEG, the shape parameter in the Gamma distribution in the NEG prior was set to 0.5 according to the simulation study in Griffin and Brown (2011), while the rate parameter was selected by WAIC.
To evaluate the accuracy of the estimation of regression coefficients, we used mean squared error (MSE): where T is the regression coefficient vector estimated from the k-th dataset.We also computed MSE diff , expressed as where β * diff is a vector of the non-zero differences of the true regression coefficients and β(k) diff is the estimated value of β * diff from the k-th dataset.MSE diff is an index to assess how close the differences of estimated regression coefficients which are not zero are to the true differences.For example, regression coefficients for Case 1 are given by β * = (0.0 T 5 , 2.0 T 5 , 0.0 T 5 , 2.0 T 5 ) T and the non-zero successive differences are between the 5th and 6th, 10th and 11th, and 15th and 16th elements of β * .Then, MSE diff is calculated as follows: In addition, we computed prediction squared error to evaluate the accuracy of prediction.The results are summarized in Tables 1, 2, 3, and 4. The proposed method BFH shows smaller MSEs and PSEs than BFL in almost all cases.This indicates that BFH outperformed BFL not only when n > p but also when n = p and n < p.In addition, BFH gives smaller MSE diff s than BFL in almost all cases.The reason is that BFH does not shrink non-zero differences of regression coefficients too much compared to BFL.Furthermore, BFH gives smaller values of MSE in 15 cases, PSE in 13 cases, and MSE diff in 11 cases, out of the 16 cases, in comparison to BFNEG.BFH worked better than BFNEG especially in the case that successive regression coefficients had higher correlation than other pairs of regression coefficients.These results show that BFH gives a closer estimation to the true regression coefficients and can capture more non-zero differences of regression coefficients than the existing methods.

Application
We applied Bayesian HORSES with horseshoe prior in Section 3.2 to the Appalachian Mountains Soil Data dataset, which was analyzed in Bondell and Reich (2008) and Jang et al. (2015).This dataset is available from https://blogs.unimelb.edu.au/howard-bondell/#tab25 and was used for showing the relationship between soil characteristics and rich cove forest diversity.The dataset was collected at twenty 500 m 2 plots in the Appalachian Mountains.Forest diversity, which is represented as the number of different plant species, is used for response variables and 15 soil characteristics in 20 plots are used as explanatory variables.The data are the average of five equally spaced measurements in each plot.We standardized soil data before the analysis.
We compared Bayesian HORSES with horseshoe prior (BHH) to Bayesian HORSES (BH) and Bayesian HORSES with NEG prior (BHNEG).For this application, we chose the hyper-parameters λ 2 2 for BH from five candidates between 10 −4 and 10 −2 , τ 2 for BHH from five candidates between 10 4 and 10 6 , and γ 2 2 for BHNEG from five candidates between 1 and 10.We set the hyper-parameter λ 2 for BHNEG as 0.5.We executed a leave-one-out cross-validation to assess the performance of the models.In each estimation, the Gibbs sampling was run with 10,000 iterations and 5,000 iterations were discarded as burn-in.The mean values of cross-validation, CV, are summarized in Table 5.
From Table 5, the value of CV for our proposed method is smaller than that for BH.BHNEG gives the smallest value of CV, but gives the largest value of standard deviation.On the other hand, our proposed method gives the smallest value of standard deviation and the second largest value of CV.Table 6 shows the regression coefficients estimated by all 20 samples.Note that the hyper-parameters were selected by WAIC.The scales of the estimated values are different from those of the results of Jang et al. (2015), but we observe that BHH captures almost the same group structure as found in Jang et al. (2015).Considering these results, BHH gives relatively stable estimation capturing the group structure of variables compared to the existing methods.

Conclusions
We proposed Bayesian fused lasso modeling with horseshoe prior, and then developed the Gibbs sampler for the parameters by using a scale mixture of normals and a hierarchical expression of a half-Cauchy prior.In addition, we extended the method to the Bayesian HORSES.Through numerical studies, we showed our proposed method is superior to the existing methods in terms of prediction and estimation accuracy.
In Bayesian HORSES with horseshoe prior, we select the value of global shrinkage parameter τ 2 by WAIC.It would be interesting to assume any proper prior on τ 2 .Considering how to accelerate the convergence of the Gibbs sampling for our proposed method would also be interesting.We leave these topics as future work.

Table 5 .
: Results for analysis of Appalachian Mountains Soil Data.

Table 6 .
: Estimated values of regression coefficients in analysis of Appalachian Mountains Soil Data.