Robust Bayesian model selection for variable clustering with the Gaussian graphical model

Variable clustering is important for explanatory analysis. However, only few dedicated methods for variable clustering with the Gaussian graphical model have been proposed. Even more severe, small insignificant partial correlations due to noise can dramatically change the clustering result when evaluating for example with the Bayesian information criteria (BIC). In this work, we try to address this issue by proposing a Bayesian model that accounts for negligible small, but not necessarily zero, partial correlations. Based on our model, we propose to evaluate a variable clustering result using the marginal likelihood. To address the intractable calculation of the marginal likelihood, we propose two solutions: one based on a variational approximation and another based on MCMC. Experiments on simulated data show that the proposed method is similarly accurate as BIC in the no noise setting, but considerably more accurate when there are noisy partial correlations. Furthermore, on real data the proposed method provides clustering results that are intuitively sensible, which is not always the case when using BIC or its extensions.


Introduction
The Gaussian graphical model (GGM) has become an invaluable tool for detecting partial correlations between variables.Assuming the variables are jointly drawn from a multivariate normal distribution, the sparsity pattern of the precision matrix reveals which pairs of variables are independent given all other variables [Anderson, 2004].In particular, we can find clusters of variables that are mutually independent, by grouping the variables according their entries in the precision matrix.
However, in practice, it can be difficult to find a meaningful clustering due to the noise of the entries in the partial correlations.The noise can be due to the sampling, this is in particular the case when n the number of observations is small, or due to small non-zero partial correlations in the true precision matrix that might be considered as insignificant.Here in this work, we are particularly interested in the latter type of noise.In the extreme, small partial correlations might lead to a connected graph of variables, where no grouping of variables can be identified.For an exploratory analysis such a result might not be desirable.
As an alternative, we propose to find a clustering of variables, such that the partial correlation between two variables in different groups is negligibly small, but not necessarily zero.The open question, which we try to address here, is whether there is a principled model selection criteria for this scenario.
For example, the Bayesian Information Criteria (BIC) [Schwarz, 1978] is a popular model selection criteria for the Gaussian graphical model.However, in the noise setting it does not have any formal guarantees.As a solution, we propose here a Bayesian model that explicitly accounts for small partial correlations between variables in different clusters.
Under our proposed model, the marginal likelihood of the data can then be used to identify the correct (if there is a ground truth in theory), or at least a meaningful clustering (in practice) that helps analysis.The marginal likelihood of our model does not have an analytic solution.Therefore, we provide two approximations.The first is a variational approximation, the second is based on MCMC.
Experiments on simulated data show that the proposed method is similarly accurate as BIC in the no noise setting, but considerably more accurate when there are noisy partial correlations.The proposed method also compares favorable to two previously proposed methods for variable clustering and model selection, namely the Clustered Graphical Lasso (CGL) [Tan et al., 2015] and the Dirichlet Process Variable Clustering (DPVC) [Palla et al., 2012] method.
Our paper is organized as follows.In Section 2, we discuss previous works related to variable clustering and model selection.In Section 3, we introduce a basic Bayesian model for evaluating variable clusterings, which we then extend in Section 4 to handle noise on the precision matrix.For the proposed model, which can handle noise in the precision matrix, the calculation of the marginal likelihood is infeasible and we describe our approximation strategy in Section 5. Since enumerating all possible clusterings is intractable, we describe in Section 6 an heuristic based on spectral clustering to limit the number of candidate clusterings.We evaluate the proposed method on synthetic and real data in Sections 7 and 8, respectively.Finally, we discuss our findings in Section 9.

Related Work
Finding a clustering of variables is equivalent to finding an appropriate block structure of the covariance matrix.Recently, Tan et al. [2015] and Devijver and Gallopin [2016] suggested to detect block diagonal structure by thresholding the absolute values of the covariance matrix.Their methods perform model selection using the mean squared error of randomly left out elements of the covariance matrix [Tan et al., 2015], and a slope heuristic [Devijver and Gallopin, 2016].
Also several Bayesian latent variable models have been proposed for this task [Marlin and Murphy, 2009, Sun et al., 2014, Palla et al., 2012].Each clustering, including the number of clusters, is either evaluated using the variational lower bound [Marlin and Murphy, 2009], or by placing a Dirichlet Process prior over clusterings [Palla et al., 2012, Sun et al., 2014].However, all of the above methods assume that the partial correlations of variables across clusters are exactly zero.
An exception is the work in [Marlin et al., 2009] which proposes to regularize the precision matrix such that partial correlations of variables that belong to the same cluster are penalized less than those belonging to different clusters.For that purpose they introduce three hyper-parameters, λ 1 (for within cluster penalty), λ 0 (for across clusters), with λ 0 > λ 1 , and λ D for a penalty of the diagonal elements.The clusters do not need to be known a-priori and are estimated by optimizing a lower bound on the marginal likelihood.As such their method can also find variable clusterings, even when the true partial correlation of variables in different clusters is not exactly zero.However, the clustering result is influenced by three hyperparameters λ 0 , λ 1 , and λ D which have to be determined using cross-validation.
Recently, the work in [Sun et al., 2015, Hosseini andLee, 2016] relaxes the assumption of a clean block structure by allowing some variables to correspond to two clusters.The model selection issue, in particular, determining the number of clusters, is either addressed with some heuristics [Sun et al., 2015] or crossvalidation [Hosseini and Lee, 2016].

The Bayesian Gaussian Graphical Model for Clustering
Our starting point for variable clustering is the following Bayesian Gaussian graphical model.Let us denote by p the number of variables, and n the number of observations.We assume that each observation x ∈ R p is generated i.i.d.from a multivariate normal distribution with zero mean and covariance matrix Σ.
Assuming that there are k groups of variables that are mutually independent, we know that, after appropriate permutation of the variables, Σ has the following block structure where Σ j ∈ R pj ×pj , and p j is the number of variables in cluster j.
By placing an inverse Wishart prior over each block Σ j , we arrive at the following Bayesian model where ν j and Σ j,0 , are the degrees of freedom and the scale matrix, respectively.We set ν j = p j + 1, Σ j = I pj leading to a non-informative prior on Σ j .C denotes the variable clustering which imposes the block structure on Σ.We will refer to this model as the basic inverse Wishart prior model.Assuming we are given a set of possible variable clusterings C , we can then choose the clustering C * that maximizes the posterior probability of the clustering, i.e.
where we denote by X the observations x 1 , ..., x n , and p(C) is a prior over the clusterings which we assume to be uniform.Here, we refer to p(X |C) as the marginal likelihood (given the clustering).For the basic inverse Wishart prior model the marginal likelihood can be calculated analytically, see e.g.[Lenkoski and Dobra, 2011].

Proposed Model
In this section, we extend the Bayesian model from Equation (1) to account for non-zero partial correlations between variables in different clusters.For that purpose we introduce the matrix Σ ∈ R p×p that models the noise on the precision matrix.The full joint probability of our model is given as follows: where Ξ := (Σ −1 + βΣ −1 ) −1 , and As before, the block structure of Σ is given by the clustering C. The proposed model is the same model as in Equation ( 1), with the main difference that the noise term βΣ −1 is added to the precision matrix of the normal distribution.1 β > 0 is a hyper-parameter that is fixed to a small positive value accounting for the degree of noise on the precision matrix.Furthermore, we assume non-informative priors on Σ j and Σ by setting ν j = p j + 1, Σ j = I pj and ν = p + 1, Σ ,0 = I p .
Remark on the parameterization We note that as an alternative parameterization, we could have defined Ξ := (Σ −1 + Σ −1 ) −1 , and instead place a prior on Σ that encourages Σ −1 to be small in terms of some matrix norm.For example, we could have set Σ ,0 = 1 β I p .

Estimation of the Marginal Likelihood
The marginal likelihood of the data given our proposed model can be expressed as follows: where Ξ := (Σ −1 + βΣ −1 ) −1 .Clearly, if β = 0, we recover the basic inverse Wishart prior model, as discussed in Section 3, and the marginal likelihood has a closed form solution due to the conjugacy of the covariance matrix of the Gaussian and the inverse Wishart prior.However, if β > 0, there is no analytic solution anymore.Therefore, we propose to either use an estimate based on a variational approximation (Section 5.2) or on MCMC (Section 5.3).Both of our estimates require the calculation of the maximum a posterior solution which we explain first in Section 5.1.
Remark on BIC type approximation of the marginal likelihood We note that for our proposed model an approximation of the marginal likelihood using BIC is not sensible.To see this, recall that BIC consists of two terms: the data log-likelihood under the model with the maximum likelihood estimate, and a penalty depending on the number of free parameters.The maximum likelihood estimate is where S is the sample covariance matrix.Note that without the specification of a prior, it is valid that Σ, Σ are not positive definite as long as the matrix and the data likelihood under the model with the maximum likelihood estimate is simply n i=1 log Normal(x i |0, S), which is independent of the clustering.The number of free parameters is (p 2 − p)/2 which is also independent of the clustering.That means, for any clustering we end up with the same BIC.
Furthermore, a Laplacian approximation as used in the generalized Bayesian information criterion [Konishi et al., 2004] is also not suitable, since in our case the parameter space is over the positive definite matrices.
Solution using a 3-Block ADMM Finding the MAP can be formulated as a convex optimization problem by a change of parameterization: by defining X := Σ −1 , X j := Σ −1 j , and X := Σ −1 , we get the following convex optimization problem: where, for simplifying notation, we introduced the following constants: From this form, we see immediately that the problem is strictly convex jointly in X and X. 1  We further reformulate the problem by introducing an additional variable Z: minimize f (X , X 1 , . . ., X k , Z) It is tempting to use a 2-Block ADMM algorithm, like e.g. in [Boyd et al., 2011], which leads to two optimization problems: update of X, X and update of Z.However, unfortunately, in our case the resulting optimization problem for updating X, X does not have an analytic solution.Therefore, instead, we suggest the use of a 3-Block ADMM, which updates the following sequence: where U is the Lagrange multiplier, and X t , Z t , U t , denotes X, Z, U at iteration t; ρ > 0 is the learning rate. 2ach of the above sub-optimization problem can be solved efficiently via the following strategy.The zero gradient condition for the first optimization problem with variable X is The zero gradient condition for the second optimization problem with variable X is The zero gradient condition for the third optimization problem with variable Z is Each of the above three optimization problem can be solved via an eigenvalue decomposition as follows.We need to solve V such that it satisfies: Since R is a symmetric matrix (not necessarily positive or negative semi-definite), we have the eigenvalue decomposition: where Q is an orthonormal matrix and L is a diagonal matrix with real values.
Since the solution Y must also be a diagonal matrix, we have Y ij = 0, for j = i, and we must have that Then, Equation ( 5) is equivalent to and therefore one solution is Note that for λ > 0, we have that Y ii > 0. Therefore, we have that the resulting Y solves Equation (4) and moreover That means, we can solve the semi-definite problem with only one eigenvalue decomposition, and therefore is in O(p 3 ).Finally, we note that in contrast to the 2-block ADMM, a general 3-block ADMM does not have a convergence guarantee for any ρ > 0. However, using a recent result from [Lin et al., 2015], we can show in Appendix A that in our case the conditions for convergence are met for any ρ > 0.

Variational Approximation of the Marginal Likelihood
Here we explain our strategy for the calculation of a variational approximation of the marginal likelihood.For simplicity, let θ denote the vector of all parameters, X the observed data, and η the vector of all hyper-parameters.
Let θ denote the posterior mode.Furthermore, let g(θ) be an approximation of the posterior distribution p(θ|X , η, C) that is accurate around the mode θ.
Then we have Note that for the Laplace approximation we would use g(θ) = N (θ| θ, V ), where V is an appropriate covariance matrix.However, here the posterior p(θ|X , η, C) is a probability measure over the positive definite matrices and not over R d , which makes the Laplace approximation inappropriate.
Instead, we suggest to approximate the posterior distribution p(Σ , Σ 1 , . . .Σ k |x 1 , ..., x n , ν , Σ ,0 , {ν j } j , {Σ j,0 } j , C) by the factorized distribution We define g (Σ ) and g j (Σ j ) as follows: where Σ is the mode of the posterior probability p(Σ |X , η, C) (as calculated in the previous section).Note that this choice ensures that the mode of g is the same as the mode of p(Σ |x 1 , ..., x n , η, C).Analogously, we set where Σj is the mode of the posterior probability p(Σ j |X , η, C).The remaining parameters ν g, ∈ R and ν g,j ∈ R are optimized by minimizing the KLdivergence between the the factorized distribution g and the posterior distribution p(Σ , Σ 1 , . . .Σ k |x 1 , ..., x n , η, C).The details of the following derivations are given in Appendix B. For simplicity let us denote g J := k j=1 g j , then we have where c is a constant with respect to g and g j .However, the term E g J ,g [log |Σ −1 + βΣ −1 |] cannot be solved analytically, therefore we need to resort to some sort of approximation.
We assume that where we used that and c is a constant with respect to g and g j .
From the above expression, we see that we can optimize the parameters of g and g j independently from each other.The optimal parameter νg, for g is And analogously, we have Each is a one dimensional non-convex optimization problem that we solve with Brent's method [Brent, 1971].

MCMC Estimation of Marginal Likelihood
As an alternative to the variational approximation, we investigate an MCMC estimation based on Chib's method [Chib, 1995, Chib andJeliazkov, 2001].
To simplify the description, we introduction the following notations Furthermore, we define θ <i := {θ 1 , . . ., θ i−1 } and θ >i := {θ i+1 , . . ., θ k+1 }.For simplicity, we also suppress in the notation the explicit conditioning on the hyper-parameters η and the clustering C, which are both fixed.Following the strategy of Chib [1995], the marginal likelihood can be expressed as In order to approximate p(X ) with Equation ( 7), we need to estimate p( θi |X , θ1 , . . .θi−1 ).First, note that we can express the value of the conditional posterior distribution at θi , as follows (see Chib and Jeliazkov [2001], Section 2.3): where q i (θ i ) is a proposal distribution for θ i , and the acceptance probability of moving from state θ i to state θ i , holding the other states fixed is defined as Next, using Equation ( 8), we can estimate p( θi |X , θ1 , . . .θi−1 ) with a Monte Carlo approximation with M samples: , and θ q,m i ∼ q(θ i ).Finally, in order to sample from p(θ ≥i |X , θ<i ), we propose to use the Metropolis-Hastings within Gibbs sampler as shown in Algorithm 1. M H j (θ t j , ψ) denotes the Metropolis-Hastings algorithm with current state θ t j , and acceptance probability α(θ j , θ j |ψ), Equation ( 9), and θ 0 ≥i is a sample after the burn-in.For the proposal distribution q i (θ i ), we use Here κ > 0 is a hyper-parameter of the MCMC algorithm that is chosen to control the acceptance probability.Note that if we choose κ = 1 and β is 0, then the proposal distribution q i (θ i ) equals the posterior distribution p(θ i |X , θ1 , . . .θi−1 ).However, in practice, we found that the acceptance probabilities can be too small, leading to unstable estimates and division by 0 in Equation ( 10).Therefore, for our experiments we chose κ = 10.

Restricting the hypotheses space
The number of possible clusterings follow the Bell numbers, and therefore it is infeasible to enumerate all possible clusterings, even if the number of variables p is small.It is therefore crucial to restrict the hypotheses space to a subset of all clusterings that are likely to contain the true clustering.We denote this subset as C * .
Algorithm 1 Metropolis-Hastings within Gibbs sampler for sampling from p(θ ≥i |X , θ<i ). for t from 1 to M do for j from i to k + 1 do We suggest to use spectral clustering on different estimates of the precision matrix to acquire the set of clusterings C * .A motivation for this heuristic is given in Appendix C.
First, for an appropriate λ, we estimate the precision matrix using In our experiments, we take q = 1, which is equivalent to the Graphical Lasso [Friedman et al., 2008] with an l1-penalty on all entries of X except the diagonal.
In the next step, we then construct the Laplacian L as defined in the following.
Finally, we use k-means clustering on the eigenvectors of the Laplacian L. The details of acquiring the set of clusterings C * using the spectral clustering method are summarized below: In Section 7.1 we confirm experimentally that, even in the presence of noise, C * often contains the true clustering, or clusterings that are close to the true clustering.

Posterior distribution over number of clusters
In principle, the posterior distribution for the number of clusters can be calculated using where C k denotes the set of all clusterings with number of clusters being equal to k.Since this is computationally infeasible, we use the following approximation where C * k is the set of all clusterings with k clusters that are in the restricted hypotheses space C * .Algorithm 2 Spectral Clustering for variable clustering with the Gaussian graphical model.
J := set of regularization parameter values.K max := maximum number of considered clusters.
(e 1 , . . ., e Kmax ) := determine the eigenvectors corresponding to the K max lowest eigenvalues of the Laplacian L as defined in Equations ( 13).
C * := C * ∪ C λ,k end for end for return restricted hypotheses space C *

Simulation Study
In this section, we evaluate the proposed method on simulated data for which the ground truth is available.In sub-section 7.1, we evaluate the quality of the restricted hypotheses space C * , followed by sub-section 7.2, where we evaluated the proposed method's ability to select the best clustering in C * .
In all experiments the number of variables is p = 40, and the ground truth is 4 clusters with 10 variables each.
For generating positive-definite covariance matrices, we consider the following two distributions: InvW(p + 1, I p ), and Uniform p , with dimension p.We denote by U ∼ Uniform p the positive-definite matrix generated in the following way where λ min (A) is the smallest eigenvalue of A, and A is drawn as follows For generating Σ, we either sample each block j from InvW(p j + 1, I pj ) or from Uniform pj .
For generating the noise matrix Σ , we sample either from InvW(p + 1, I p ) or from Uniform p .The final data is then sampled as follows where η defines the noise level.
For evaluation we use the adjusted normalized mutual information (ANMI), where 0.0 means that any correspondence with the true labels is at chance level, and 1.0 means that a perfect one-to-one correspondence exists [Vinh et al., 2010].We repeated all experiments 5 times and report the average ANMI score.

Evaluation of the restricted hypotheses space
First, independent of any model selection criteria, we check here the quality of the clusterings that are found with the spectral clustering algorithm from Section 6.We also compare to single and average linkage clustering as used in [Tan et al., 2015].
The set of all clusterings that are found is denoted by C * (the restricted hypotheses space).
In order to evaluate the quality of the restricted hypotheses space C * , we report the oracle performance calculated by max C∈C * ANMI(C, C T ), where C T denotes the true clustering, and ANMI(C, C T ) denotes the ANMI score when comparing clustering C with the true clustering.In particular, a score of 1.0 means that the true clustering is contained in C * .
The results of all experiments with noise level η ∈ {0.0, 0.01, 0.1} are shown in Tables 1, for balanced clusters, and Table 2, for unbalanced clusters.
From these results we see that the restricted hypotheses space of spectral clustering is around 100, considerably smaller than the number of all possible clusterings.More importantly, we also see that that C * acquired by spectral clustering either contains the true clustering or a clustering that is close to the truth.In contrast, the hypotheses space restricted by single and average linkage is smaller, but more often misses the true clustering.

Evaluation of clustering selection criteria
Here, we evaluate the performance of our proposed method for selecting the correct clustering in the restricted hypotheses space C * .We compare our proposed method (variational) with several baselines and two previously proposed methods [Tan et al., 2015, Palla et al., 2012].Except for the two previously proposed methods, we created C * with the spectral clustering algorithm from Section 6.
As a cluster selection criteria, we compare our method to the Extended Bayesian Information Criterion (EBIC) with γ ∈ {0, 0.5, 1} [Chen andChen, 2008, Foygel andDrton, 2010], Akaike Information Criteria [Akaike, 1973], and the Calinski-Harabaz Index (CHI) [Caliński and Harabasz, 1974].Note that EBIC and AIC are calculated based on the basic Gaussian graphical model (i.e. the model in Equation 1, but ignoring the prior specification). 3Furthermore, we note that EBIC is model consistent, and therefore, assuming that the true precision matrix contains non-zero entries in each element, will choose asymptotically the clustering that has only one cluster with all variables in it.However, as an advantage for EBIC, we exclude that clustering.Furthermore, we note that in contrast to EBIC and AIC, the Calinski-Harabaz Index is not a model-based cluster evaluation criterion.The Calinski-Harabaz Index is an heuristic that uses as clustering criterion the ratio of the variance within and across clusters.As such it is expected to give reasonable clustering results if the noise is considerably smaller in magnitude than the within-cluster variable partial correlations.
We remark that EBIC and AIC is not well defined if the sample covariance matrix is singular, in particular if n < p or n ≈ p.As an ad-hod remedy, which works well in practice4 , we always add 0.001 times the identity matrix to the covariance matrix (see also Ledoit and Wolf [2004]).
Finally, we also compare the proposed method to two previous approaches for variable clustering: the clustered graphical lasso (CGL) as proposed in [Tan et al., 2015], and the Dirichlet process variable clustering (DPVC) model as proposed in [Palla et al., 2012], for which the implementation is available.DPVC models the number of clusters using a Dirichlet process.CGL uses for model selection the mean squared error for recovering randomly left-out elements of the covariance matrix.CGL uses for clustering either the single linkage clustering (SLC) or the average linkage clustering (ALC) method.For conciseness, we show only the results for ALC, since they tended to be better than SLC.
The results of all experiments with noise level η ∈ {0.0, 0.01, 0.1} are shown in Tables 3 and 4, for balanced clusters, and Tables 5 and 6, for unbalanced clusters.
The tables also contain the performance of the proposed method for β ∈ {0, 0.01, 0.02, 0.03}.Note that β = 0.0 corresponds to the basic inverse Wishart prior model for which we can calculate the marginal likelihood analytically.
Comparing the proposed method with different β, we see that β = 0.02 offers good clustering performance in the no noise and noisy setting.In contrast, model selection with EBIC and AIC performs, as expected, well in the no noise scenario, however, in the noisy setting they tend to select incorrect clusterings.In particular for large sample sizes EBIC tends to fail to identify correct clusterings.
The Calinski-Harabaz Index performs well in the noisy settings, whereas in the no noise setting it performs unsatisfactory.
In Figures 1 and 2, we show the posterior distribution with and without noise on the precision matrix, respectively.5In both cases, given that the sample size n is large enough, the proposed method is able to estimate correctly the number of clusters.In contrast, the basic inverse Wishart prior model underestimates the number of clusters for large n and existence of noise in the precision matrix.

Comparison of variational and MCMC estimate
Here, we compare our variational approximation with MCMC on a small scale simulated problem where it is computationally feasible to estimate the marginal likelihood with MCMC.We generated synthetic data as in the previous section, only with the difference that we set the number of variables p to 12.
The number of samples M for MCMC was set to 10000, where we used 10% as burn in.For two randomly picked clusterings for n = 12, and n = 1200000, we checked the acceptance rates and convergence using the multivariate extension of the Gelman-Rubin diagnostic [Brooks and Gelman, 1998].The average acceptance rates were around 80% and the potential scale reduction factor was 1.01.
The runtime of MCMC was around 40 minutes for evaluating one clustering, whereas for the variational approximation the runtime was around 2 seconds. 6he results are shown in Table 7, suggesting that the quality of the selected clusterings using the variational approximation is similar to MCMC.

Real Data Experiments
In this section, we investigate the properties of the proposed model selection criterion on three real data sets.In all cases, we use the spectral clustering algorithm from Appendix C to create cluster candidates.All variables were normalized to have mean 0 and variance 1.For all methods, except DPVC, the number of clusters is considered to be in {2, 3, 4, . . ., min(p − 1, 15)}.DPVC automatically selects the number of clusters by assuming a Dirichlet process prior.We evaluated the proposed method with β = 0.02 using the variational approximation.

Mutual Funds
Here we use the mutual funds data, which has been previously analyzed in [Scott andCarvalho, 2008, Marlin et al., 2009].The data contains 59 mutual funds (p = 59) grouped into 4 clusters: U.S. bond funds, U.S. stock funds, balanced funds (containing U.S. stocks and bonds), and international stock funds.The number of observations is 86.
The results of all methods are visualized in Table 8.It is difficult to interpret the results produced by EBIC (γ = 1.0),AIC and the Calinski-Harabaz Index.In contrast, the proposed method and EBIC (γ = 0.0) produce results that are easier to interpret.In particular, our results suggest that there is a considerable correlation between the balanced funds and the U.S. stock funds which was also observed in Marlin et al. [2009].
In Figure 3 we show a two dimensional representation of the data, that was found using Laplacian Eigenmaps [Belkin and Niyogi, 2003].The figure supports the claim that balanced funds and the U.S. stock funds have similar behavior.

Gene Regulations
We tested our method also on the gene expression data that was analyzed in [Hirose et al., 2017].The data consists of 11 genes with 445 gene expressions.The true gene regularizations are known in this case and shown in Figure 4, adapted from [Hirose et al., 2017].The most important fact is that there are  two independent groups of genes and any clustering that mixes these two can be considered as wrong.
We show the results of all methods in Figure 5, where we mark each cluster with a different color superimposed on the true regularization structure.Here only the clustering selected by the proposed method, EBIC (γ = 1.0) and Calinski-Harabaz correctly divide the two group of genes.

Aviation Sensors
As a third data set, we use the flight aviation dataset from NASA7 .The data set contains sensor information sampled from airplanes during operation.We extracted the information of 16 continuous-valued sensors that were recorded for different flights with in total 25032364 samples.
The clustering results are shown in Table 9.The data set does not have any ground truth, but the clustering result of our proposed method is reasonable: Cluster 9 groups sensors that measure or affect altitude8 , Cluster 8 correctly clusters the left and right sensors for measuring the rotation around the axis pointing through the noise of the aircraft, in Cluster 2 all sensors that measure the angle between chord and flight direction are grouped together.It also appears reasonable that the yellow hydraulic system of the left part of the plane has little direct interaction with the hydraulic system of the right part (Cluster 1 and Cluster 4).And the sensor for the rudder, influencing the direction of the plane, is mostly independent of the other sensors (Cluster 5).
In contrast, the clustering selected by the basic inverse Wishart prior, EBIC, and AIC is difficult to interpret.We note that we did not compare to DPVC, since the large number of samples made the MCMC algorithm of DPVC infeasible.

Discussion and Conclusions
We have introduced a new method for evaluating variable clusterings based on the marginal likelihood of a Bayesian model that takes into account noise on the precision matrix.Since the calculation of the marginal likelihood is analytically intractable, we proposed two approximations: a variational approximation and an approximation based on MCMC.Experimentally, we found that the variational approximation is considerably faster than MCMC and also leads to accurate model selections.
We compared our proposed method to several standard model selection criteria.In particular, we compared to BIC and extended BIC (EBIC) which are often the method of choice for model selection in Gaussian graphical models.However, we emphasize that EBIC was designed to handle the situation where p is in the order of n, and has not been designed to handle noise.As a consequence, our experiments showed that in practice its performance depends highly on the choice of the γ parameter.In contrast, the proposed method, with fixed hyper-parameters, shows better performance on various simulated and real data.
We also compared our method to other two previously proposed methods, namely Cluster Graphical Lasso (CGL) [Tan et al., 2015], and Dirichlet Process Variable Clustering (DPVC) [Palla et al., 2012] that performs jointly clustering and model selection.However, it appears that in many situations the model selection algorithm of CGL is not able to detect the true model, even if there is no noise.On the other hand, the Dirichlet process assumption by DPVC appears to be very restrictive, leading again to many situations where the true model (clustering) is missed.Overall, our method performs better in terms of selecting the correct clustering on synthetic data with ground truth, and selects meaningful clusters on real data.
The python source code for variable clustering and model selection with the proposed method and all baselines is available at https://github.com/andrade-stats/robustBayesClustering.

A Convergence of 3-block ADMM
We can write the optimization problem in (3) as First note that the functions f 1 , f 2 and f 3 are convex proper closed functions.Since X , X 1 , . . ., X k 0, we have due to the equality constraint that Z 0. Assuming that the global minima is attained, we can assume that Z σI, for some large enough σ > 0. As a consequence, we have that ∇ 2 f 3 (Z) = Z −1 ⊗ Z −1 σ −2 I, and therefore f 3 is a strongly convex function.Analogously, we have that f 1 and f 2 are strongly convex functions, and therefore also coercive.This allows us to apply Theorem 3.2 in [Lin et al., 2015] which guarantees the convergence of the 3-block ADMM.

B Derivation of variational approximation
Here, we give more details of the KL-divergence minimization from Section 5.2.Recall, that the remaining parameters ν g, ∈ R and ν g,j ∈ R are optimized by minimizing the KL-divergence between the the factorized distribution g and the posterior distribution p(Σ , Σ 1 , . . .Σ k |x 1 , ..., x n , η, C).We have where c is a constant with respect to g and g j .However, the term E g J ,g [log |Σ −1 + βΣ −1 |] cannot be solved analytically, therefore we need to resort to some sort of approximation.Assuming that we get and c is a constant with respect to g and g j .From the above expression, we see that we can optimize the parameters of g and g j independently from each other.The optimal parameter νg, for g is Proposition 1. Optimization problem 1 and 2 have the same solution.Moreover, the m dimensional null space of L can be chosen such that each basis vector is the indicator vector for one variable block of X.
Proof.First let us define the matrix X, by Xij := |X ij | q .Then clearly, iff X is block sparse with m blocks, so is X.Furthermore, Xij ≥ 0, and L is the unnormalized Laplacian as defined in [Von Luxburg, 2007].We can therefore apply Proposition (2) of [Von Luxburg, 2007], to find that the dimension of the eigenspace of L corresponding to eigenvalue 0, is exactly the number of blocks in X.Also from Proposition (2) of [Von Luxburg, 2007] it follows that each such eigenvector e k ∈ R p can be chosen such that it indicates the variables belonging to the same block, i.e. e k (i) = 0, iff variable i belongs to block k.
Using the nuclear norm as a convex relaxation for the rank constraint, we have minimize with an appropriately chosen λ m .By the definition of L, we have that L is positive semi-definite, and therefore ||L|| * = trace(L).As a consequence, we can rewrite the above problem as Finally, for the purpose of learning the Laplacian L, we ignore the term βX and set it to zero.This will necessarily lead to an estimate of X * that is not a clean block matrix, but has small non-zero entries between blocks.Nevertheless, spectral clustering is known to be robust to such violations [Ng et al., 2002].This leads to Algorithm 2 in Section 6.

Figure 1 :
Figure 1: Posterior distribution of the number of clusters of the proposed method (top row) and the basic inverse Wishart prior model (bottom row).Ground truth is 4 clusters; there is no noise on the precision matrix.

Figure 2 :
Figure 2: Posterior distribution of the number of clusters of the proposed method (top row) and the basic inverse Wishart prior model (bottom row).Ground truth is 4 clusters; noise was added to the precision matrix.

Figure 3 :
Figure 3: Two dimensional representation of the mutual funds data.

Table 9 :
Evaluation of selected clusterings of the Aviation Sensor Data with 16 variables.Here the size of the restricted hypotheses space |C * | found by spectral clustering was 28.