Bayesian sparse convex clustering via global-local shrinkage priors

Sparse convex clustering is to cluster observations and conduct variable selection simultaneously in the framework of convex clustering. Although the weighted $L_1$ norm as the regularization term is usually employed in the sparse convex clustering, this increases the dependence on the data and reduces the estimation accuracy if the sample size is not sufficient. To tackle these problems, this paper proposes a Bayesian sparse convex clustering via the idea of Bayesian lasso and global-local shrinkage priors. We introduce Gibbs sampling algorithms for our method using scale mixtures of normals. The effectiveness of the proposed methods is shown in simulation studies and a real data analysis.


Introduction
Cluster analysis is an unsupervised learning method that aims to assign observations to several clusters so that similar individuals belong to the same group. It is widely used in the research fields such as biology, genomics, and many other fields of science. In general, conventional clustering methods such as k-means clustering, hierarchical clustering, and Gaussian mixture model are pointed out with instability due to the non-convex optimization.
Convex clustering proposed by Hocking et al. (2011) searches the centers of clusters collectively and allocates each individual to them. Convex relaxation ensures that it achieves a unique global optimization regardless of initialization. The estimates can be obtained by solving a regularization problem, which is similar to sparse regularization in regression analysis. However, the convex clustering does not work well if data contain large amount of irrelevant features.
Sparse regularization is used to exclude irrelevant information. Wang et al. (2018) proposed sparse convex clustering to perform the convex clustering and variable selection simultaneously. The sparse convex clustering estimates sparse models by L 1 norm in addition to regularization term of the convex clustering. Also, Wang et al. (2018) used the L 1 norms as the convex clustering penalties, and then the penalty was assumed to be different weights according to the individual and the feature. However, it is pointed out by Griffin and Brown (2011) that the penalty used in the sparse convex clustering is dependent on the data and may lead to model estimation accuracy degradation when the sample size is small.
Our proposed methods overcome the problem that penalties in the sparse convex clustering depend heavily on weight. Therefore, even when the sample size is small, estimation is possible without depending on the weight. For this reason, we first introduce a Bayesian formulation of sparse convex clustering, and then propose a Bayesian sparse convex clustering based on a global-local (GL) prior distribution. As the GL prior, we consider three types of distributions: a normal-exponential-gamma distribution (Griffin and Brown 2005), a horseshoe distribution (Carvalho et al. 2010), and a Dirichlet-Laplace distribution (Bhattacharya et al. 2015). The Gibbs sampling algorithm for our proposed models is derived by using scale mixtures of normals (Andrews and Mallows 1974).
The rest of this paper is organized as follows. Section 2 devotes the convex clustering method. In Section 3, we propose a sparse convex clustering in a Bayesian formulation. In Section 4, we propose a Bayesian convex clustering with the GL shrinkage prior distributions. The performances of the proposed methods are compared with existing method by conducting Monte Carlo simulation in Section 5 and real data analysis in Section 6. Concluding remarks are given in Section 7.

Preliminaries
In this section, we describe the convex clustering. This is a convex relaxation of clustering such as k-means clustering and hierarchical clustering. The convexity overcomes the instability of conventional clustering methods. In addition, we describe the sparse convex clustering which simultaneously clusters observations and performs variable selection.

Convex clustering
Let X ∈ R n×p be a data matrix with n observations and p variables, and x i (i = 1, 2, · · · , n) be an i-th row of X. The convex clustering for these n observations is formulated by the following minimization problem using an n × p feature matrix A = (a 1 , · · · , a n ) T : where a i is a p dimensional vector corresponding to the x i , · q is the L q norm of a vector and γ (≥ 0) is a regularization parameter. Ifâ i 1 =â i 2 for the estimated valueâ i , the i 1 -th individual and the i 2 -th individual belong to the same cluster. The γ controls the number of same rows ofÂ = (â 1 , · · · ,â n ) T , which is corresponding to the estimated number of clusters. Both k-means clustering and hierarchical clustering are equivalent to considering the L 0 norm for the second term in the problem (1), and hence a non-convex optimization problem occurs (Hocking et al. 2011). The convex clustering can be viewed as a convex relaxation of k-means Among the estimates of the same observation, the lines connect the estimates whose values of the regularization parameter are close.
clustering and hierarchical clustering. The convex relaxation also guarantees to achieve a unique global minimization. Hocking et al. (2011) proposed cluster path to visualize steps of clustering. The cluster path can be regarded as a continuous regularization path (Efron et al. 2004) of the optimal solution formed by changing γ. Figure 1 shows the cluster path of two interlocking half moons in Section 5.1. It shows the relationship between values of the regularization parameter and estimates of the feature vectors a i (i = 1, · · · , n). The estimates exist near the corresponding observations when the value of the regularization parameter is small, while the estimates concentrate on a one point when the value is large. The characteristics of the data can be considered from the grouping order and positional relationship of the estimates.

Sparse convex clustering
In the conventional convex clustering, when irrelevant information is included in the data, the accuracy of estimating clusters tends to be low. On the other hand, the sparse convex clustering (Wang et al. 2018) is an effective method for such data as irrelevant information can be eliminated using sparse estimation.
The sparse convex clustering considers the following optimization problem: where γ 1 (≥ 0) and γ 2 (≥ 0) are regularization parameters, w i 1 ,i 2 (≥ 0) and u j (≥ 0) are weights, q ∈ {1, 2, ∞}, E = {(i 1 , i 2 ); w i 1 ,i 2 = 0, i 1 < i 2 }, and a ·j = (a 1j , · · · , a nj ) T is the column vector of the feature matrix A. The third term imposes a penalty similar to group lasso (Yuan and Lin 2006) and has the effect that â ·j 1 = 0. When â ·j 1 = 0, the j-th column of X is removed from the model. It means variable selection. γ 1 and w i 1 i 2 adjust the cluster size, while γ 2 and u j adjust the number of features. The weight w i 1 ,i 2 plays an important role to impose a penalty adaptively to features. Wang et al. (2018) used a following weight parameter where the ι m i 1 ,i 2 equals 1 if the observation x i 1 is included in m nearest neighbors of the observation x i 2 , and 0 otherwise. This choice of weights works well for a wide range of φ when m is small. In our numerical studies, m is fixed at 5 and φ is fixed at 0.5, as in Wang et al. (2018).
Similar to the adaptive lasso (Zou 2006) in a regression problem, the penalty for the sparse convex clustering can be adjusted flexibly by using weight parameters. However, it is pointed out by Griffin and Brown (2011) that such penalties are strongly dependent on data. Depending on the data, such as the small number of samples, the accuracy of model estimation may be degradationed.

Sparse convex clustering in Bayesian formulation
By extending the sparse convex clustering into a Bayesian formulation, we use the entire posterior distribution to provide a probabilistic measure of uncertainty.

Bayesian sparse convex clustering
In this section, we reformulate the sparse convex clustering in terms of Bayesian approach. Similar to Bayesian lasso (Park and Casella 2008), which extends lasso to a Bayesian formulation, we regard regularized maximum likelihood estimates as MAP estimates.
We consider the following model: where ε is the p dimensional error vector distributed as N p (0 p , σ 2 I p ), a is a feature vector, and σ 2 (> 0) is a variance parameter. Then, the likelihood function is given by Next, we specify the prior distribution of feature matrix A as where λ 1 (> 0), w i 1 ,i 2 (> 0), λ 2 (> 0), u j (> 0) are hyperparameters, E = {(i 1 , i 2 ) : 1 ≤ i 1 < i 2 ≤ n}, and #E is the number of elements in E. λ 1 and λ 2 are corresponding to γ 1 and γ 2 in (2). This prior distribution is an extension of Bayesian group lasso in linear regression models (Xu and Ghosh 2015). The estimate of the sparse convex clustering corresponds to the MAP estimate in the following joint posterior distribution: where π(σ 2 ) is the non-informative scale-invariant prior π(σ 2 ) = 1/2σ 2 or inverse-gamma prior π(σ 2 ) = IG(ν 1 /2, η 0 /2). An inverse-gamma probability density function is given by where ν (> 0) is a shape parameter and η (> 0) is a scale parameter. We obtain estimates of each parameter by performing MCMC algorithm by Gibbs sampling. Therefore, it is necessary to derive the full conditional distribution for each parameter. Because it is difficult to derive full conditional distributions from (4), we provide a hierarchical representation of the prior distribution. To derive the hierarchical representation, the prior distribution π(A|σ 2 ) is rewritten as follows: This is based on the hierarchical representation of the Laplace distribution in the form For details, we refer to Andrews and Mallows (1974). From this relationship, we assume following priors: As a result, it enables us to carry out Bayesian estimation by Gibbs sampling. The details of the sampling procedure are described in Appendix A.1.

Unimodality of joint posterior distribution
In Bayesian modeling, theoretical and computational problems arise when there exist multiple posterior modes. Theoretically, it is doubtful whether a single posterior mean, median, or mode will represent an appropriate summary of the bimodal posterior distribution. The convergence speed of Gibbs sampling is a computational problem. Although it is possible to perform Gibbs sampling, the convergence is practically too slow. Park and Casella (2008) showed that the joint posterior distribution has single peak in Lasso type Bayes sparse modeling. We demonstrate that the joint posterior distribution of (4) is unimodal. Similar to Park and Casella (2008), we use a continuous transformation with a continuous inverse to show the unimodality of the logarithmic concave density.
The logarithm of the posterior (4) is: Consider the transformation defined by which is continuous when 0 < σ 2 < ∞. We define Φ = (φ 1 , · · · , φ n ) T = (φ ·1 , · · · , φ ·p ). The log posterior (6) is transformed by performing variable conversion in the form The second and fifth terms are clearly concave in (Φ, ρ), and the third and fourth terms are a second concave surface in (Φ, ρ). Therefore, if log π(·), which is the logarithm of prior for σ 2 , is concave, (7) becomes concave. Assuming a prior distribution such as inverse gamma distribution (5) for σ 2 , log π(·) is a concave function. Therefore, the whole log posterior distribution is concave.

MAP estimate by weighted posterior means
In Bayesian sparse modeling, unweighted posterior mean is often used as a substitute for MAP estimates, but the accuracy is not high and sometimes it is far from the MAP estimates. Then we introduce weighted posterior mean in this section. We define a vector θ containing all the parameters as follows: where τ i = (τ i1 , · · · , τ in ) and τ = ( τ 1 , · · · , τ p ). For example, θ 1 = a 1 and θ n+1 = τ 1 . In addition, we assume the parameter vector corresponding to the b-th MCMC sample is 2n+2 ), where the range of b is from 1 to B. We introduce weights corresponding to the b-th MCMC sample as follows: where the L(X|θ) is the likelihood function, the π(θ) is the prior,θ l ,θ l+1 , · · · ,θ 2n+2 }, and theθ l is an estimate of θ l . It can be seen that this weight corresponds to the value of the posterior probability according to Bayes' theorem. This weight was also used in the sparsified algorithm by Shimamura et al. (2019).
Using this weight, we obtain the posterior average as follows: . Therefore, we adoptθ l as an estimate of θ l . The performance of this estimate is examined by numerical studies in Sections 5.1.

Bayesian sparse convex clustering via global-local shrinkage priors
Polson and Scott (2010) proposed a global-local (GL) shrinkage prior distribution. Generally speaking, when we use the Laplace prior distribution, it is necessary to pay attention to how to handle contraction for irrelevant parameters and robustness against not irrelevant parameters. The important features of the GL shrinkage prior distribution are that it has peaks at the origin and heavy tails. This features make it possible to handle shrinkage of entire variables and individual variables shrinkage estimated to be zero. Therefore, irrelevant parameters are sparsifed, and not irrelevant ones are robustly estimated. The penalty for the sparse convex clustering has similar characteristics. The penalty of the sparse convex clustering is weighted on individual and feature quantities. This weighted penalty is one of the key factors improving accuracy. However, this penalty has the problem that is highly dependent on data. By using the GL prior distribution, it is possible to properly control the dependence on data by the Bayesian approach. Polson and Scott (2010) formulated the GL scale mixtures of normal for vector a = (a 1 , · · · , a p ) as follows: Each τ 2 j (> 0) is called a local shrinkage parameter and ν (> 0) is called a global shrinkage parameter. This leads to efficient Gibbs sampling based on block update of parameters.
We need to specify the priors π(τ 2 j ) and π(ν 2 ). In the next subsections, we provide some concrete formulations for π(τ 2 j ) and π(ν 2 ). Griffin and Brown (2005) proposed using an NEG distribution as an alternative to a Laplace distribution for prior distribution of regression coefficients. By using an NEG distribution, we can perform more flexible sparse modeling than a Laplace distribution. The NEG density function is given by

NEG prior distribution
where κ = (2 λ λ)/(γ √ π)Γ(λ + 1/2) is a normalization constant, D −2λ−1 is a parabolic cylinder function, and λ (> 0) and γ (> 0) are hyperparameters that control the sparsity of the θ. The parabolic cylinder function is a solution of the second-order linear ordinary differential equation and its integral representation is given by The NEG density function can be expressed as a hierarchical representation NEG (θ|λ, γ) where Exp(·|µ) is a exponential distribution and Ga(·|k, λ) is a gamma distribution. Therefore, the prior distribution of each parameter is as follows: Using the NEG distribution on the feature matrix A, we propose the following prior: By using hierarchical representation of the NEG distribution, the prior distribution π(A|σ 2 ) is decomposed into: This result allows us to develop a Gibbs sampling algorithm for a Bayesian sparse convex clustering with the NEG prior distribution. The details of the algorithm are given in Appendix A.2.

Horseshoe prior distribution
The horseshoe prior distribution was proposed to overcome the unstable sparse estimation of other shrinkage priors. For more details of the horseshoe prior distribution, we refer to Carvalho et al. (2010). The horseshoe density function is given by The prior distribution of each parameter is as follows: Here ν (> 0) is a hyperparameter that controls the sparsity of the θ j , C + (x 0 , γ) is a half Cauchy distribution on the positive reals, where x 0 is a location parameter and γ is a scale parameter. The smaller the value of the hyperparameter ν, the more the number of parameters {θ j } are estimated to be zero. Using the horseshoe distribution on the feature matrix A, we propose the following prior: π(A|σ 2 ) ∝ (σ 2 ) −(#E+p)/2 Hor 1 2σ a ν 1 Hor 1 2σ a ν 2 , where a = ( a i 1 − a i 2 2 ; (i 1 , i 2 ) ∈ E) and a = ( a ·j 2 ; j = 1, · · · , p). By using hierarchical representation of the horseshoe distribution, the prior distribution π(A|σ 2 ) is a variant as follows: Then we can estimate posterior distribution by Gibbs sampling. The details of the algorithm are given in Appendix A.3.

Dirichlet-Laplace prior distribution
The Dirichlet-Laplace prior was proposed to provide simple sufficient conditions for posterior consistency (Bhattacharya et al. 2015). It is known that a Bayesian regression model with this prior distribution has the consistent property of variable selection asymptotically. Also, we can obtain joint posterior distributions for a Bayesian regression model when we employ this prior. This is an advantage because most of prior distributions induce not a joint posterior distribution but a marginal posterior distribution and a joint posterior distribution has more information than a marginal distribution in general. The Dirichlet-Laplace density function is given by where τ = (τ 1 , · · · , τ p ) T . The prior distribution of each parameter is where α (> 0) is a hyperparameter that controls the sparsity of the θ j and Dir(α, · · · , α) is a Dirichlet distribution. The sum of the Dirichlet distribution random variables is one. Also, the mean is E[τ j ] = 1/p and the variance is Var(τ j ) = (p − 1)/{p 2 (pα + 1)}. When the value of α is small, most of the parameters {τ j } are close to zero and the remaining parameters are close to one. If {τ j } is close to zero, {θ j } is also close to zero.
Using the Dirichlet-Laplace distribution on the feature matrix A, we propose the following prior: By using hierarchical representation of the Dirichlet-Laplace distribution, the prior distribution π(A|σ 2 ) is a variant as follows.
Then we can estimate posterior distribution by Gibbs sampling. The details of the algorithm are given in Appendix A.4.

Artificial data analysis
In this section, we conduct numerical studies to evaluate the performance of the proposed methods using artificial data. First, clustering performance is evaluated by a illustrative example that includes no irrelevant features. Next, we evaluate the accuracy of the sparsity by performing simulations including irrelevant features.

Illustrative example
We demonstrated our proposed methods with artificial data. The data were generated according to two interlocking half moons with n = 50 observations, K = 2 clusters, and p = 2 features. Figure 2 shows one example of two interlocking half moons. In this setting, we did not perform sparse estimation. The cluster formation was discussed by comparing cluster paths of each method. For each generated dataset, the estimates were obtained by using 50, 000 iterations of Gibbs sampler. Candidates of the hyperparameters were set by λ min exp{(log λ max − log λ min ) · (i/m)} for i = 1, · · · , m. For the hyperparameter λ in a Bayesian convex clustering by Lapalce prior distribution (Bscvc), we set m = 50, λ min = 0.05, and λ max = 90.0. For the hyperparameters λ 1 and γ 1 in a Bayesian convex clustering by NEG prior distribution (Bsnegcvc). For the hyperparameters λ 1 , we set m = 30, λ min = 0.0001, and λ max = 2.75. For the hyperparameters Figure 2: The two interlocking half moons with n = 50 observations. γ 1 , we set m = 2, λ min = 0.4, and λ max = 0.5. The weighted posterior means introduced in Section 3.3 was used for Bscvc and Bsnegcvc estimates. Figure 3 shows the results. The outline of cluster formation is the same between two methods. The order in which the samples form clusters was also the same. If the distance between estimated feature values of different clusters does not decrease, the accuracy of cluster estimation will improve in the concept of the convex clustering. However, the distance between all features is close due to the effect of sparse regularization. Scvc used weights to bring only features belonging to the same cluster closer. Bsnegcvc used NEG distribution instead of weights. In the cluster path in Figure 3(b), the estimated feature values are merged at a position more further from the origin than other methods. This can be seen especially in the upper right and lower left of the figure. This result shows that the close feature values were merged while the distance between the distant feature values kept. This is a factor that improves the accuracy of Bsnegcvc's clustering estimation.

Simulation studies
We demonstrated our proposed methods with artificial data including irrelevant features. The data were generated according to two interlocking half moons with n = 100 observations, K = 2 clusters, and p = 40 features. The features consists of 38 irrelevant features and 2 relevant features. The irrelevant features were independently generated from N(0, 0.5 2 ). We considered three methods: the sparse convex clustering (Scvc), Bscvc, and Bnegscvc.
As the estimation accuracy, we used the RAND index which is a correctness of cluster estimation. The RAND index ranges between 0 and 1, and a higher value indicates better performance. The RAND index is given by Here C * = {C * 1 , · · · , C * r } is the true clusters set and C = { C 1 , · · · , C s } is the estimated clusters set. In addition, we used the false negative rate (FNR) and the false positive rate (FPR) for the accuracy of sparse estimation: where, {a * j |j = 1, · · · , p} are the true feature vectors and {â j |j = 1, · · · , p} are the estimated feature vectors. Estimated indicators are calculated by 50 times. The setting of the iteration count and the hyperparameter candidate was the same as in Section 5.1. To ensure fair comparisons, we used the results with hyperparameters that maximize the RAND index.
The simulation results are summarized in Table 1. Bscvc and Bnegscvc outperform Scvc in terms of the RAND index and FNR. Because Bscvc has high accuracy in the RAND index, it can be seen that the cluster is accurately estimated. On the other hand, the Bnegscvc has high FNR accuracy, and it can be seen that irrelevant features are correctly reduced.

Application
We applied our proposed methods to a real dataset: the LIBRAS movement data from the Machine Learning Repository (Lichman 2013). The LIBRAS movement dataset has 15 classes. Each classes are divided by the type of the hand movement. There are 24 observations in each class, and each observation has 90 features consisting of hand movement coordinates. In this numerical experiment, 5 classes were randomly selected from 15 classes. This procedure is repeated 25 times to calculate and compare accuracies. Accuracies of each method is evaluated using the RAND index, the estimated number of clusters, and the number of selected features. This is the same as Wang et al. (2018). We computed the mean and standard deviation of each accuracy. As in Section 5.2, we used the results with hyperparameters that maximize the RAND index for comparisons. The results are summarized in Table 2. Bsnegcvc shows the best RAND index. The RAND index of Bsnegcvc is slightly higher than that of other methods. The estimated number of clusters in Scvc is the fewest among all methods. Bsnegcvc selects the smallest number of features.

Conclusion
We proposed the sparse convex clustering in a Bayesian formulation. Using the global-local shrinkage prior distribution, we constructed a Bayesian model to various data with more flexible constraints than the ordinary L 1 type convex clustering. We overcame the problem such that the sparse convex clustering depends on weights in the regularization term. Furthermore, we proposed a weighted posterior mean based on a posteriori probability to provide more accurate MAP estimation. In Section 6, the computational time in our proposed methods is about 20 minutes at each hyperparameter. Using the global-local shrinkage prior increases the computational cost, and hence we need to balance the feasibility of the calculation with the accuracy of the estimation. In numerical experiment, the hyperparameters with the best accuracy were selected as with Wang et al. (2018). It is also interesting to develop information criteria to select the hyperparameters. We leave these interesting topics as future work.

Appendix Formulation of Gibbs sampling
This appendix introduces a specific Gibbs sampling method for a Bayesian sparse convex clustering.

A.1 Bayesian sparse convex clustering
The prior distribution is transformed as follows: The full conditional distribution is obtained as follows: where IGauss(x|µ, λ) denotes the inverse-Gaussian distribution with a density function

A.2 Bayesian NEG sparse convex clustering
The prior distribution is transformed as follows: The full conditional distribution is obtained as follows: