1 Introduction

The Gaussian graphical model (GGM) has become an invaluable tool for detecting partial correlations between variables. Assuming the variables are jointly drawn from a multivariate normal distribution, the sparsity pattern of the precision matrix reveals which pairs of variables are independent given all other variables (Anderson 2004). In particular, we can find clusters of variables that are mutually independent, by grouping the variables according their entries in the precision matrix.

For example, in gene expression analysis, variable clustering is often considered to be helpful for data exploration (Palla et al. 2012; Tan et al. 2015).

However, in practice, it can be difficult to find a meaningful clustering due to the noise of the entries in the partial correlations. The noise can be due to the sampling, this is in particular the case when n the number of observations is small, or due to small nonzero partial correlations in the true precision matrix that might be considered as insignificant. Here in this work, we are particularly interested in the latter type of noise. In the extreme, small partial correlations might lead to a connected graph of variables, where no grouping of variables can be identified. For an exploratory analysis, such a result might not be desirable.

As an alternative, we propose to cluster variables, such that the partial correlation between any two variables in different clusters is negligibly small, but not necessarily zero. The open question, which we try to address here, is whether there is a principled model selection criteria for this scenario.

For example, the Bayesian information criterion (BIC) (Schwarz 1978) is a popular model selection criterion for the Gaussian graphical model. However, in the noise setting it does not have any formal guarantees. As a solution, we propose here a Bayesian model that explicitly accounts for small partial correlations between variables in different clusters.

Under our proposed model, the marginal likelihood of the data can then be used to identify the correct (if there is a ground truth in theory), or at least a meaningful clustering (in practice) that helps analysis. Since the marginal likelihood of our model does not have an analytic solution, we provide two approximations: the first is a variational approximation, and the second is based on MCMC.

Experiments on simulated data show that the proposed method is similarly accurate as BIC in the no noise setting, but considerably more accurate when there are noisy partial correlations. The proposed method also compares favorable to two previously proposed methods for variable clustering and model selection, namely the Clustered Graphical Lasso (CGL) (Tan et al. 2015) and the Dirichlet Process Variable Clustering (DPVC) (Palla et al. 2012) method.

Our paper is organized as follows. In Sect. 2, we discuss previous works related to variable clustering and model selection. In Sect. 3, we introduce a basic Bayesian model for evaluating variable clusterings, which we then extend in Sect. 4.1 to handle noise on the precision matrix. For the proposed model, the calculation of the marginal likelihood is infeasible and we describe two approximation strategies in Sect. 4.2. Furthermore, since enumerating all possible clusterings is also intractable, we describe in Sect. 4.3 an heuristic based on spectral clustering to limit the number of candidate clusterings. We evaluate the proposed method on synthetic and real data in Sects. 5 and 6, respectively. Finally, we discuss our findings in Sect. 7.

2 Related work

Finding a clustering of variables is equivalent to finding an appropriate block structure of the covariance matrix. Recently, Tan et al. (2015) and Devijver and Gallopin (2018) suggested to detect block diagonal structure by thresholding the absolute values of the covariance matrix. Their methods perform model selection using the mean squared error of randomly left-out elements of the covariance matrix (Tan et al. 2015), and a slope heuristic (Devijver and Gallopin 2018).

Also several Bayesian latent variable models have been proposed for this task (Marlin and Murphy 2009; Sun et al. 2014; Palla et al. 2012). Each clustering, including the number of clusters, is either evaluated using the variational lower bound (Marlin and Murphy 2009), or by placing a Dirichlet process prior over clusterings (Palla et al. 2012; Sun et al. 2014). However, all of the above methods assume that the partial correlations of variables across clusters are exactly zero.

An exception is the work in Marlin et al. (2009) which proposes to regularize the precision matrix such that partial correlations of variables that belong to the same cluster are penalized less than those belonging to different clusters. For that purpose they introduce three hyper-parameters, \(\lambda _1\) (for within-cluster penalty), \(\lambda _0\) (for across clusters), with \(\lambda _0 > \lambda _1\), and \(\lambda _D\) for a penalty of the diagonal elements. The clusters do not need to be known a priori and are estimated by optimizing a lower bound on the marginal likelihood. As such their method can also find variable clusterings, even when the true partial correlation of variables in different clusters is not exactly zero. However, the clustering result is influenced by three hyper-parameters \(\lambda _0, \lambda _1\), and \(\lambda _D\) which have to be determined using cross-validation.

Recently, the work in Sun et al. (2015) and Hosseini and Lee (2016) relaxes the assumption of a clean block structure by allowing some variables to correspond to two clusters. The model selection issue, in particular, determining the number of clusters, is either addressed with some heuristics (Sun et al. 2015) or cross-validation (Hosseini and Lee 2016).

3 The Bayesian Gaussian graphical model for clustering

Our starting point for variable clustering is the following Bayesian Gaussian graphical model. Let us denote by d the number of variables, and n the number of observations. We assume that each observation \({\mathbf {x}} \in {\mathbb {R}}^d\) is generated i.i.d. from a multivariate normal distribution with zero mean and covariance matrix \(\varSigma \). Assuming that there are k groups of variables that are mutually independent, we know that, after appropriate permutation of the variables, \(\varSigma \) has the following block structure

$$\begin{aligned} \varSigma = \left( \begin{array}{ccc} \varSigma _1 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad \ddots &{}\quad 0 \\ 0 &{}\quad 0 &{}\quad \varSigma _k \end{array} \right) , \end{aligned}$$

where \(\varSigma _j \in {\mathbb {R}}^{d_j \times d_j}\), and \(d_j\) is the number of variables in cluster j.

By placing an inverse Wishart prior over each block \(\varSigma _j\), we arrive at the following Bayesian model

$$\begin{aligned} \begin{aligned}&p({\mathbf {x}}_1, \ldots , {\mathbf {x}}_n, \varSigma | \{ \nu _{j} \}_j, \{\varSigma _{j,0}\}_j, {\mathcal {C}}) \\&\quad = \prod _{i = 1}^n \text {Normal} ({\mathbf {x}}_i | {\mathbf {0}}, \varSigma ) \prod _{j=1}^k \text {InvW}(\varSigma _j | \nu _{j}, \varSigma _{j,0}), \end{aligned} \end{aligned}$$
(1)

where \(\nu _{j}\) and \(\varSigma _{j,0}\), are the degrees of freedom and the scale matrix, respectively. We set \(\nu _{j} =d_j +1, \varSigma _j = I_{d_j}\) leading to a non-informative prior on \(\varSigma _j\). \({\mathcal {C}}\) denotes the variable clustering which imposes the block structure on \(\varSigma \). We will refer to this model as the basic inverse Wishart prior model.

Assuming we are given a set of possible variable clusterings \({\mathscr {C}}\), we can then choose the clustering \(\mathcal {{\hat{C}}}\) that maximizes the posterior probability of the clustering, i.e.,

$$\begin{aligned} \mathcal {{\hat{C}}} = \mathop {{{\,\mathrm{arg\,max}\,}}}_{{\mathcal {C}} \in {\mathscr {C}}} p({\mathcal {C}} | {\mathscr {X}}) = \mathop {{{\,\mathrm{arg\,max}\,}}}_{{\mathcal {C}} \in {\mathscr {C}}} p({\mathscr {X}} | {\mathcal {C}}) \cdot p({\mathcal {C}}), \end{aligned}$$
(2)

where we denote by \({\mathscr {X}}\) the observations \({\mathbf {x}}_1, \ldots , {\mathbf {x}}_n\), and \(p({\mathcal {C}})\) is a prior over the clusterings which we assume to be uniform. Here, we refer to \(p({\mathscr {X}} | {\mathcal {C}})\) as the marginal likelihood (given the clustering). For the basic inverse Wishart prior model, the marginal likelihood can be calculated analytically, see, e.g., (Lenkoski and Dobra 2011).

4 Proposed method

In this section, we introduce our proposed method for finding variable clusters.

First, in Sect. 4.1, we extend the basic inverse Wishart prior model from Eq. (1) in order to account for nonzero partial correlations between variables in different clusters. Given the proposed model, the marginal likelihood \(p({\mathscr {X}} | {\mathcal {C}})\) does not have a closed form solution anymore. Therefore, in Sects. 4.2.2 and 4.2.3, we discuss two different methods for approximating the marginal likelihood. The first method is based on a variational approximation around the maximum a posteriori (MAP) solution. The second method is an MCMC method based on Chib’s method (Chib 1995; Chib and Jeliazkov 2001). The latter has the advantage of being asymptotically correct for large number of posterior samples, but at considerably high computational costs. The former is considerably faster to evaluate and experimentally produces solutions similar to the MCMC method (see comparison in Sect. 5.3).

Finally, in Sect. 4.3, we propose to use a spectral clustering method to limit the clustering candidates to a set \({\mathscr {C}}^*\), where \({\mathscr {C}}^* \subseteq {\mathscr {C}}\). Based on this subset \({\mathscr {C}}^*\), we can then select the model maximizing the posterior probability [as in Eq. (2)], or can also calculate approximate posterior distributions over clusterings. We restrict the hypotheses space to \({\mathscr {C}}^*\), since even for a moderate number of variables, say \(d =40\), the size of the hypotheses space \(|{\mathscr {C}}|\) is \(>10^{36}\). Therefore, MCMC sampling over the hypotheses space could also only explore a small subset of the whole hypotheses space, but at higher computational costs [see also Hans et al. (2007), Scott and Carvalho (2008) for a discussion on related high-dimensional problems].

4.1 A Bayesian Gaussian graphical model for clustering under noisy conditions

In this section, we extend the Bayesian model from Eq. (1) to account for nonzero partial correlations between variables in different clusters. For that purpose, we introduce the matrix \(\varSigma _{\epsilon } \in {\mathbb {R}}^{d \times d}\) that models the noise on the precision matrix. The full joint probability of our model is given as follows:

$$\begin{aligned} \begin{aligned}&p({\mathbf {x}}_1, \ldots , {\mathbf {x}}_n, \varSigma , \varSigma _{\epsilon } | \nu _{\epsilon }, \varSigma _{\epsilon ,0}, \{ \nu _{j} \}_j, \{\varSigma _{j,0}\}_j, {\mathcal {C}}) \\&\quad = \prod _{i = 1}^n \text {Normal} ({\mathbf {x}}_i | {\mathbf {0}}, \varXi ) \\&\quad \cdot \text {InvW}(\varSigma _{\epsilon } | \nu _{\epsilon }, \varSigma _{\epsilon ,0}) \prod _{j=1}^k \text {InvW}(\varSigma _j | \nu _{j}, \varSigma _{j,0}) , \end{aligned} \end{aligned}$$
(3)

where \(\varXi := (\varSigma ^{-1} + \beta \varSigma _{\epsilon }^{-1})^{-1}\), and

$$\begin{aligned} \varSigma := \left( \begin{array}{ccc} \varSigma _1 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad \ddots &{}\quad 0 \\ 0 &{}\quad 0 &{}\quad \varSigma _k \end{array} \right) . \end{aligned}$$

As before, the block structure of \(\varSigma \) is given by the clustering \({\mathcal {C}}\). The proposed model is the same model as in Eq. (1), with the main difference that the noise term \(\beta \varSigma _{\epsilon }^{-1}\) is added to the precision matrix of the normal distribution.

\(1 \gg \beta > 0\) is a hyper-parameter that is fixed to a small positive value accounting for the degree of noise on the precision matrix. Furthermore, we assume non-informative priors on \(\varSigma _j\) and \(\varSigma _{\epsilon }\) by setting \(\nu _{j} =d_j +1, \varSigma _j = I_{d_j}\) and \(\nu _{\epsilon } = d + 1, \varSigma _{\epsilon ,0} = I_d\).

Remark on the parameterization We note that as an alternative parameterization, we could have defined \(\varXi := (\varSigma ^{-1} + \varSigma _{\epsilon }^{-1})^{-1}\), and instead place a prior on \(\varSigma _{\epsilon }\) that encourages \(\varSigma _{\epsilon }^{-1}\) to be small in terms of some matrix norm. For example, we could have set \(\varSigma _{\epsilon ,0} = \frac{1}{\beta } I_d\). We chose the parameterization \(\varXi := (\varSigma ^{-1} + \beta \varSigma _{\epsilon }^{-1})^{-1}\), since it allows us to set \(\beta \) to 0, which recovers the basic inverse Wishart prior model.

4.2 Estimation of the marginal likelihood

The marginal likelihood of the data given our proposed model can be expressed as follows:

$$\begin{aligned}&p({\mathbf {x}}_1, \ldots , {\mathbf {x}}_n | \nu _{\epsilon }, \varSigma _{\epsilon ,0}, \{ \nu _{j} \}_j, \{\varSigma _{j,0}\}_j, {\mathcal {C}}) \\&\quad = \int \text {Normal} ({\mathbf {x}}_1, \ldots , {\mathbf {x}}_n | {\mathbf {0}}, \varXi ) \\&\qquad \cdot \prod _{j=1}^k \text {InvW}(\varSigma _j | \nu _{j}, \varSigma _{j,0}) d(\varSigma _{j} \succ 0) \\&\qquad \cdot \text {InvW}(\varSigma _{\epsilon } | \nu _{\epsilon }, \varSigma _{\epsilon ,0}) d(\varSigma _{\epsilon } \succ 0) . \end{aligned}$$

where \(\varXi := (\varSigma ^{-1} + \beta \varSigma _{\epsilon }^{-1})^{-1}\).

Clearly, if \(\beta = 0\), we recover the basic inverse Wishart prior model, as discussed in Sect. 3, and the marginal likelihood has a closed form solution due to the conjugacy of the covariance matrix of the Gaussian and the inverse Wishart prior. However, if \(\beta > 0\), there is no analytic solution anymore. Therefore, we propose to either use an estimate based on a variational approximation (Sect. 4.2.2) or on MCMC (Sect. 4.2.3). Both of our estimates require the calculation of the maximum a posterior (MAP) solution which we explain first in Sect. 4.2.1.

Remark on BIC type approximation of the marginal likelihood We note that for our proposed model an approximation of the marginal likelihood using BIC is not sensible. To see this, recall that BIC consists of two terms: the data log-likelihood under the model with the maximum likelihood estimate, and a penalty depending on the number of free parameters. The maximum likelihood estimate is

$$\begin{aligned} {\hat{\varSigma }}, {\hat{\varSigma }}_{\epsilon } = \mathop {{{\,\mathrm{arg\,max}\,}}}_{\varSigma , \varSigma _{\epsilon }} \sum _{i = 1}^n \log \text {Normal} ({\mathbf {x}}_i | {\mathbf {0}}, (\varSigma ^{-1} + \beta \varSigma _{\epsilon }^{-1})^{-1}), \end{aligned}$$

where S is the sample covariance matrix. Note that without the specification of a prior, it is valid that \({\hat{\varSigma }}, {\hat{\varSigma }}_{\epsilon }\) are not positive definite as long as the matrix \({\hat{\varSigma }}^{-1} + \beta {\hat{\varSigma }}_{\epsilon }^{-1}\) is positive definite. Therefore, \({\hat{\varSigma }}^{-1} + \beta {\hat{\varSigma }}_{\epsilon }^{-1} = S^{-1}\), and the data likelihood under the model with the maximum likelihood estimate is simply \(\sum _{i = 1}^n \log \text {Normal} ({\mathbf {x}}_i | {\mathbf {0}}, S)\), which is independent of the clustering. Furthermore, the number of free parameters is \((d^2 - d) / 2\) which is also independent of the clustering. That means, for any clustering we end up with the same BIC.

Furthermore, a Laplacian approximation as used in the generalized Bayesian information criterion (Konishi et al. 2004) is also not suitable, since in our case the parameter space is over the positive-definite matrices.

4.2.1 Calculation of maximum a posterior solution

Finding the exact MAP is crucial for the quality of the marginal likelihood approximation that we will describe later in Sects. 4.2.2 and 4.2.3. In this section, we explain in detail how the corresponding optimization problem can be solved with a 3-block ADMM method, which is guaranteed to converge to the global optimum.

First note that

$$\begin{aligned}&p(\varSigma , \varSigma _{\epsilon } | {\mathbf {x}}_1, \ldots , {\mathbf {x}}_n, \nu _{\epsilon }, \varSigma _{\epsilon ,0}, \{ \nu _{j} \}_j, \{\varSigma _{j,0}\}_j, {\mathcal {C}}) \\&\quad \propto \text {Normal} ({\mathbf {x}}_1, \ldots , {\mathbf {x}}_n | {\mathbf {0}}, \varXi ) \\&\qquad \cdot \prod _{j=1}^k \text {InvW}(\varSigma _j | \nu _{j}, \varSigma _{j,0}) \\&\qquad \cdot \text {InvW}(\varSigma _{\epsilon } | \nu _{\epsilon }, \varSigma _{\epsilon ,0}) \end{aligned}$$

where \(\varXi := (\varSigma ^{-1} + \beta \varSigma _{\epsilon }^{-1})^{-1}\).

Therefore,

$$\begin{aligned}&\log p(\varSigma , \varSigma _{\epsilon } | {\mathbf {x}}_1, \ldots , {\mathbf {x}}_n, \nu _{\epsilon }, \varSigma _{\epsilon ,0}, \{ \nu _{j} \}_j, \{\varSigma _{j,0}\}_j, {\mathcal {C}}) \\&\quad = -\frac{n}{2} \log |\varXi | -\frac{n}{2} \text {trace} (S \varXi ^{-1} ) \\&\qquad -\frac{\nu _{\epsilon } + d + 1}{2} \log |\varSigma _{\epsilon }| - \frac{1}{2} \text {trace} (\varSigma _{\epsilon ,0} \varSigma _{\epsilon }^{-1}) \\&\qquad + \sum _{j = 1}^{k} \left( -\frac{\nu _{j} + d_j + 1}{2} \log |\varSigma _{j}| - \frac{1}{2} \text {trace} (\varSigma _{j,0} \varSigma _{j}^{-1}) \right) \\&\qquad + const \\&\quad = \frac{1}{2} \Big (n \cdot \log |\varXi ^{-1}| - n \cdot \text {trace} (S \varXi ^{-1} ) \\&\qquad + (\nu _{\epsilon } + d + 1) \cdot \log |\varSigma _{\epsilon }^{-1}| - \text {trace} (\varSigma _{\epsilon ,0} \varSigma _{\epsilon }^{-1}) \\&\qquad + \sum _{j = 1}^{k} \Big ( (\nu _{j} + d_j + 1) \cdot \log |\varSigma _{j}^{-1}| - \text {trace} (\varSigma _{j,0} \varSigma _{j}^{-1}) \Big ) \Big ) \\&\qquad + const , \end{aligned}$$

where the constant is with respect to \(\varSigma _{\epsilon }, \varSigma _{1}, \ldots \varSigma _{k}\), and \(d_j\) denotes the number of variables in cluster j.

Solution using a 3-Block ADMM Finding the MAP can be formulated as a convex optimization problem by a change of parameterization: by defining \(X := \varSigma ^{-1}\), \(X_j := \varSigma ^{-1}_j\), and \(X_{\epsilon } := \varSigma _{\epsilon }^{-1}\), we get the following convex optimization problem:

$$\begin{aligned} \begin{aligned}&\mathop {{{\,\mathrm{minimize}\,}}}_{X \succ 0, X_{\epsilon } \succ 0} \; n \cdot \text {trace}(S (X + \beta X_{\epsilon })) - n \cdot \log |X + \beta X_{\epsilon }| \\&\quad + \text {trace}(A_{\epsilon } X_{\epsilon }) - a_{\epsilon } \cdot \log |X_{\epsilon }| \\&\quad + \sum _{j = 1}^{k} \Big ( \text {trace}(A_j X_j) - a_j \cdot \log |X_j| \Big ) , \end{aligned} \end{aligned}$$
(4)

where, for simplifying notation, we introduced the following constants:

$$\begin{aligned}&A_{\epsilon } := \varSigma _{\epsilon ,0} , \\&a_{\epsilon } := \nu _{\epsilon } + d + 1 , \\&A_{j} := \varSigma _{j,0} , \\&a_{j} := \nu _{j} + d_j + 1 . \end{aligned}$$

From this form, we see immediately that the problem is strictly convex jointly in \(X_{\epsilon }\) and X.Footnote 1

We further reformulate the problem by introducing an additional variable Z:

$$\begin{aligned}&\text {minimize} \; f(X_{\epsilon }, X_1, \ldots , X_k, Z) \\&\quad \text {subject to} \; \\&\quad \; Z = X + \beta X_{\epsilon } , \\&\quad X_{\epsilon }, X_1, \ldots , X_k, Z \succeq 0 , \end{aligned}$$

with

$$\begin{aligned} f(X_{\epsilon }, X_1, \ldots , X_k, Z)&:= n \cdot \text {trace}(S Z) - n \cdot \log |Z| \\&\quad + \text {trace}(A_{\epsilon } X_{\epsilon }) - a_{\epsilon } \cdot \log |X_{\epsilon }| \\&\quad + \sum _{j = 1}^{k} \Big ( \text {trace}(A_j X_j) - a_j \cdot \log |X_j| \Big ) . \end{aligned}$$

It is tempting to use a 2-Block ADMM algorithm, e.g., in Boyd et al. (2011), which leads to two optimization problems: update of \(X, X_{\epsilon }\) and update of Z. However, unfortunately, in our case the resulting optimization problem for updating \(X, X_{\epsilon }\) does not have an analytic solution. Therefore, instead, we suggest the use of a 3-Block ADMM, which updates the following sequence:

$$\begin{aligned} X^{t + 1}&:= \mathop {{{\,\mathrm{arg\,min}\,}}}_{X_1, \ldots , X_k \succ 0} \; \sum _{j = 1}^{k} \Big ( \text {trace}(A_j X_j) - a_j \cdot \log |X_j| \Big )\\&\quad + \text {trace}(U^t (X + \beta X_{\epsilon }^t - Z^t)) \\&\quad + \frac{\rho }{2} || X + \beta X_{\epsilon }^t - Z^t ||_F^2 , \\ X_{\epsilon }^{t + 1}&:= \mathop {{{\,\mathrm{arg\,min}\,}}}_{X_{\epsilon } \succ 0} \; \text {trace}(A_{\epsilon } X_{\epsilon }) - a_{\epsilon } \cdot \log |X_{\epsilon }| \\&\quad + \text {trace}(U^t (X^{t+1} + \beta X_{\epsilon } - Z^t)) \\&\quad + \frac{\rho }{2} || X^{t+1} + \beta X_{\epsilon } - Z^t ||_F^2 , \\ Z^{t+1}&:= \mathop {{{\,\mathrm{arg\,min}\,}}}_{Z \succ 0} \; n \cdot \text {trace}(S Z) - n \cdot \log |Z| \\&\quad + \text {trace}(U^t (X^{t+1} + \beta X_{\epsilon }^{t+1} - Z)) \\&\quad + \frac{\rho }{2} || X^{t+1} + \beta X_{\epsilon }^{t + 1} - Z ||_F^2 , \\ U^{t+1}&:= \rho (X^{t+1} + \beta X_{\epsilon }^{t+1} - Z^{t + 1}) + U^t , \end{aligned}$$

where U is the Lagrange multiplier, and \(X^t, Z^t, U^t\), denotes XZU at iteration t; \(\rho > 0\) is the learning rate.Footnote 2

Each of the above sub-optimization problem can be solved efficiently via the following strategy. The zero gradient condition for the first optimization problem with variable X is

$$\begin{aligned} - X_j^{-1} + \frac{\rho }{a_j} X_j = - \frac{1}{a_j} (A_j + U_j + \rho (\beta X_{\epsilon ,j} - Z_j)) . \end{aligned}$$

The zero gradient condition for the second optimization problem with variable \(X_{\epsilon }\) is

$$\begin{aligned} - X_{\epsilon }^{-1} + \frac{\rho \beta ^2}{a_{\epsilon }} X_{\epsilon } = - \frac{1}{a_{\epsilon }} ( A_{\epsilon } + \beta U + \rho \beta (X - Z)) . \end{aligned}$$

The zero gradient condition for the third optimization problem with variable Z is

$$\begin{aligned} - Z^{-1} + \frac{\rho }{n} Z = \frac{1}{n} ( U - nS + \rho (X + \beta X_{\epsilon })) . \end{aligned}$$

Each of the above three optimization problem can be solved via an eigenvalue decomposition as follows. We need to solve V such that it satisfies:

$$\begin{aligned} - V^{-1} + \lambda V = R \; \; \wedge \; \; V \succeq 0 \end{aligned}$$

Since R is a symmetric matrix (not necessarily positive or negative semi-definite), we have the eigenvalue decomposition:

$$\begin{aligned} QLQ^T = R , \end{aligned}$$

where Q is an orthonormal matrix and L is a diagonal matrix with real values. Denoting \(Y := Q^T V Q\), we have

$$\begin{aligned} - Y^{-1} + \lambda Y = L , \end{aligned}$$
(5)

Since the solution Y must also be a diagonal matrix, we have \(Y_{ij} = 0\), for \(j \ne i\), and we must have that

$$\begin{aligned} - (Y_{ii})^{-1} + \lambda Y_{ii} = L_{ii} . \end{aligned}$$
(6)

Then, Eq. (6) is equivalent to

$$\begin{aligned} \lambda Y_{ii}^2 - L_{ii} Y_{ii} -1 = 0 , \end{aligned}$$

and therefore, one solution is

$$\begin{aligned} Y_{ii} = \frac{L_{ii} + \sqrt{L_{ii}^2 + 4 \lambda }}{2 \lambda } . \end{aligned}$$

Note that for \(\lambda > 0\), we have that \(Y_{ii} > 0\). Therefore, we have that the resulting Y solves Eq. (5) and moreover

$$\begin{aligned} V = Q Y Q^T \succ 0 . \end{aligned}$$

That means, we can solve the semi-definite problem with only one eigenvalue decomposition, and therefore is in \(O(d^3)\).

Finally, we note that in contrast to the 2-block ADMM, a general 3-block ADMM does not have a convergence guarantee for any \(\rho > 0\). However, using a recent result from (Lin et al. 2018), we can show in “Appendix A” that in our case the conditions for convergence are met for any \(\rho > 0\).

4.2.2 Variational approximation of the marginal likelihood

Here, we explain our strategy for the calculation of a variational approximation of the marginal likelihood. For simplicity, let \(\varvec{\theta }\) denote the vector of all parameters, \({\mathscr {X}}\) the observed data, and \(\varvec{\eta }\) the vector of all hyper-parameters.

Let \(\hat{\varvec{\theta }}\) denote the posterior mode. Furthermore, let \(g(\varvec{\theta })\) be an approximation of the posterior distribution \(p(\varvec{\theta } | {\mathscr {X}}, \varvec{\eta }, {\mathcal {C}})\) that is accurate around the mode \(\hat{\varvec{\theta }}\).

Then, we have

$$\begin{aligned} \begin{aligned} p({\mathscr {X}} | \varvec{\eta }, {\mathcal {C}})&= \frac{p(\varvec{\theta } , {\mathscr {X}} | \varvec{\eta }, {\mathcal {C}})}{p(\varvec{\theta } | {\mathscr {X}}, \varvec{\eta }, {\mathcal {C}})} \\&= \frac{p(\hat{\varvec{\theta }}, {\mathscr {X}} | \varvec{\eta }, {\mathcal {C}})}{p(\hat{\varvec{\theta }} | {\mathscr {X}}, \varvec{\eta }, {\mathcal {C}})} \approx \frac{p(\hat{\varvec{\theta }} , {\mathscr {X}} | \varvec{\eta }, {\mathcal {C}})}{g(\hat{\varvec{\theta }})} . \end{aligned} \end{aligned}$$
(7)

Note that for the Laplace approximation we would use \(g(\varvec{\theta }) = N(\varvec{\theta } | \hat{\varvec{\theta }}, V)\), where V is an appropriate covariance matrix. However, here the posterior \(p(\varvec{\theta } | {\mathscr {X}}, \varvec{\eta }, {\mathcal {C}})\) is a probability measure over the positive-definite matrices and not over \({\mathbb {R}}^d\), which makes the Laplace approximation inappropriate.

Instead, we suggest to approximate the posterior distribution

\(p(\varSigma _{\epsilon }, \varSigma _{1}, \ldots \varSigma _{k} | {\mathbf {x}}_1, \ldots , {\mathbf {x}}_n, \nu _{\epsilon }, \varSigma _{\epsilon ,0}, \{ \nu _{j} \}_j, \{\varSigma _{j,0}\}_j, {\mathcal {C}})\) by the factorized distribution

$$\begin{aligned} g := g_{\epsilon }(\varSigma _{\epsilon }) \cdot \prod _{j=1}^{k} g_{j}(\varSigma _{j}) . \end{aligned}$$

We define \(g_{\epsilon }(\varSigma _{\epsilon })\) and \(g_{j}(\varSigma _j)\) as follows:

$$\begin{aligned} g_{\epsilon }(\varSigma _{\epsilon }) := \text {InvW}(\varSigma _{\epsilon } | \nu _{g, \epsilon }, \varSigma _{g, \epsilon }) , \end{aligned}$$

with

$$\begin{aligned} \varSigma _{g,\epsilon } := (\nu _{g, \epsilon } + d + 1) \cdot {\hat{\varSigma }}_{\epsilon } , \end{aligned}$$

where \({\hat{\varSigma }}_{\epsilon }\) is the mode of the posterior probability \(p(\varSigma _{\epsilon } | {\mathscr {X}}, \varvec{\eta }, {\mathcal {C}})\) (as calculated in the previous section). Note that this choice ensures that the mode of \(g_{\epsilon }\) is the same as the mode of \(p(\varSigma _{\epsilon } | {\mathbf {x}}_1, \ldots , {\mathbf {x}}_n, \varvec{\eta }, {\mathcal {C}})\). Analogously, we set

$$\begin{aligned} g_{j}(\varSigma _{j}) := \text {InvW}(\varSigma _{j} | \nu _{g, j}, \varSigma _{g, j}) , \end{aligned}$$

with

$$\begin{aligned} \varSigma _{g,j} := (\nu _{g, j} + d_j + 1) \cdot {\hat{\varSigma }}_{j} , \end{aligned}$$

where \({\hat{\varSigma }}_{j}\) is the mode of the posterior probability \(p(\varSigma _{j} | {\mathscr {X}}, \varvec{\eta }, {\mathcal {C}})\). The remaining parameters \(\nu _{g, \epsilon } \in {\mathbb {R}}\) and \(\nu _{g,j} \in {\mathbb {R}}\) are optimized by minimizing the KL-divergence between the factorized distribution g and the posterior distribution \(p(\varSigma _{\epsilon }, \varSigma _{1}, \ldots \varSigma _{k} | {\mathbf {x}}_1, \ldots , {\mathbf {x}}_n, \varvec{\eta }, {\mathcal {C}})\). The details of the following derivations are given in “Appendix B”. For simplicity, let us denote \(g_J := \prod _{j=1}^{k} g_{j}\), then we have

$$\begin{aligned} KL(g || p)&= - \int g_{\epsilon }(\varSigma _{\epsilon }) \cdot \prod _{j=1}^{k} g_{j}(\varSigma _{j}) \\&\quad \log \frac{p(\varSigma _{\epsilon }, \varSigma _{1}, \ldots \varSigma _{k}, {\mathbf {x}}_1, \ldots , {\mathbf {x}}_n | \varvec{\eta }, {\mathcal {C}})}{ g_{\epsilon }(\varSigma _{\epsilon }) \cdot \prod _{j=1}^{k} g_{j}(\varSigma _{j})} d \varSigma _{\epsilon } d\varSigma \\&\quad + c \\&= - \frac{1}{2} n {{\,\mathrm{{\mathbb {E}}}\,}}_{g_J, g_{\epsilon }}[\log |\varSigma ^{-1} + \beta \varSigma _{\epsilon }^{-1}|] \\&\quad + \frac{1}{2} (\nu _{\epsilon } + d + 1) {{\,\mathrm{{\mathbb {E}}}\,}}_{g_{\epsilon }}[ \log |\varSigma _{\epsilon }| ] \\&\quad + \frac{1}{2} \text {trace} ((\varSigma _{\epsilon ,0} + \beta nS) {{\,\mathrm{{\mathbb {E}}}\,}}_{g_{\epsilon }}[ \varSigma _{\epsilon }^{-1} ] ) \\&\quad - \text {Entropy}[g_{\epsilon }] \\&\quad + \frac{1}{2} \sum _{j = 1}^{k} (\nu _{j} + d_j + 1) {{\,\mathrm{{\mathbb {E}}}\,}}_{g_j}[ \log |\varSigma _{j}|] \\&\quad + \frac{1}{2} \sum _{j = 1}^{k} \text {trace} ((\varSigma _{j,0} + nS_j) {{\,\mathrm{{\mathbb {E}}}\,}}_{g_j}[\varSigma _{j}^{-1}]) \\&\quad - \sum _{j = 1}^{k} \text {Entropy}[g_{j}] + c , \end{aligned}$$

where c is a constant with respect to \(g_{\epsilon }\) and \(g_j\). However, the term \(E_{g_J, g_{\epsilon }}[\log |\varSigma ^{-1} + \beta \varSigma _{\epsilon }^{-1}|]\) cannot be solved analytically; therefore, we need to resort to some sort of approximation.

We assume that \(E_{g_J, g_{\epsilon }}[\log |\varSigma ^{-1} + \beta \varSigma _{\epsilon }^{-1}|]\)

\( \approx E_{g_J, g_{\epsilon }}[\log |\varSigma ^{-1}|]\). This way, we get

$$\begin{aligned} KL(g || p)&\approx KL(g_{\epsilon } \, || \, \text {InvW}(\nu _{\epsilon }, \varSigma _{\epsilon ,0} + \beta nS)) \\&\quad + \sum _{j = 1}^{k} KL(g_j \, || \, \text {InvW} (\nu _{j} + n, \varSigma _{j,0} + nS_j)) \\&\quad + c' , \end{aligned}$$

where we used that

$$\begin{aligned} {{\,\mathrm{{\mathbb {E}}}\,}}_{g_J, g_{\epsilon }}[\log |\varSigma ^{-1}|] = - \sum _{j = 1}^{k} {{\,\mathrm{{\mathbb {E}}}\,}}_{g_j}[\log |\varSigma _j|] , \end{aligned}$$

and \(c'\) is a constant with respect to \(g_{\epsilon }\) and \(g_j\).

From the above expression, we see that we can optimize the parameters of \(g_{\epsilon }\) and \(g_j\) independently from each other. The optimal parameter \({\hat{\nu }}_{g, \epsilon }\) for \(g_{\epsilon }\) is

$$\begin{aligned} {\hat{\nu }}_{g, \epsilon }&= \mathop {{{\,\mathrm{arg\,min}\,}}}_{\nu _{g, \epsilon }} KL(g_{\epsilon } \, || \, \text {InvW}(\nu _{\epsilon }, \varSigma _{\epsilon ,0} + \beta nS)) \\&= \mathop {{{\,\mathrm{arg\,min}\,}}}_{\nu _{g, \epsilon }} \frac{\nu _{g, \epsilon }}{\nu _{g, \epsilon } + d + 1} \text {trace} \Big (\Big (\varSigma _{\epsilon ,0} + \beta nS\Big ) {\hat{\varSigma }}_{\epsilon }^{-1}\Big ) \\&\quad - 2 \log \varGamma _d\left( \frac{ \nu _{g, \epsilon }}{2}\right) - \nu _{g, \epsilon } d + d \nu _{\epsilon } \log (\nu _{g, \epsilon } + d + 1) \\&\quad + (\nu _{g, \epsilon } - \nu _{\epsilon }) \sum _{i=1}^{d} \psi \left( \frac{\nu _{g, \epsilon } - d + i}{2} \right) . \end{aligned}$$

And analogously, we have

$$\begin{aligned} {\hat{\nu }}_{g, j}&= \mathop {{{\,\mathrm{arg\,min}\,}}}_{\nu _{g, j}} \, \frac{\nu _{g, j}}{\nu _{g, j} + d_j + 1} \text {trace} \Big (\Big (\varSigma _{j,0} + nS_j\Big ) {\hat{\varSigma }}_{j}^{-1}\Big ) \\&\quad - 2 \log \varGamma _{d_j}\left( \frac{ \nu _{g, j}}{2}\right) - \nu _{g, j} d_j \\&\quad + d_j (\nu _{j} + n) \log (\nu _{g, j} + d_j + 1) \\&\quad + (\nu _{g, j} - \nu _{j} - n) \sum _{i=1}^{d_j} \psi \left( \frac{\nu _{g, j} - d_j + i}{2} \right) . \end{aligned}$$

Each is a one-dimensional non-convex optimization problem that we solve with Brent’s method (Brent 1971).

Discussion: Advantages over full variational approaches We described here an approximation to the marginal likelihood that can be considered as a blending of the ideas of the Laplace approximation (using the MAP) and a variational approximation where all parameters are learned by minimizing the Kullback–Leibler divergence between a variational distribution and the true posterior distribution. We refer to the latter as a full variational approximation. For simplicity, here, let us denote by \(\varSigma \) the positive-definite matrix for which we seek the posterior distribution, and let \(\varSigma _g\) denote the parameter matrix of the variational distribution.

An obvious limitation of the full variational approach is that the expectation involving \(\varSigma \) cannot be calculated analytically anymore. As a solution, recent works on black-box variational inference propose to use a Monte Carlo estimate of the expectation of the gradient. In order to address high variance of the estimator, several techniques have been proposed (e.g., control variates and Rao–Blackwellization) among which the reparameterization trick appears to be the most promising (Ranganath et al. 2014; Kingma and Welling 2013; Kucukelbir et al. 2017). In particular, Stan (Carpenter et al. 2017) provides a readily available implementation of the reparameterization trick (Kucukelbir et al. 2017) which is named automatic differentiation variational inference (ADVI). In ADVI, the transformation is \(\varSigma _g := L^T L\) with L being a triangular matrix where each component is sampled from N(0, 1). And the matrix L is the parameter of the variational distribution that is optimized with stochastic gradient descent. However, note that this optimization problem is a stochastic non-convex problem. In contrast, finding the MAP is a non-stochastic convex optimization problem and the proposed solution has a guarantee of converging to the global minima. Apart from that, we note that a full variational approximation does not have any theoretic quality guarantees, including the case where \(\beta \rightarrow 0\). In the general case, our approach also does not have such guarantees. However, in the special case where \(\beta \rightarrow 0\), we know that the true posterior distribution is an inverse Wishart distribution and therefore matches our choice of the variational distribution.

4.2.3 MCMC estimation of marginal likelihood

As an alternative to the variational approximation, we investigate an MCMC estimation based on Chib’s method (Chib 1995; Chib and Jeliazkov 2001).

To simplify the description, we introduce the following notations

$$\begin{aligned}&\varvec{\theta }_1 := \varSigma _{\epsilon } , \\&\varvec{\theta }_{2}, \ldots , \varvec{\theta }_{k+1} := \varSigma _1, \ldots , \varSigma _k. \end{aligned}$$

Furthermore, we define \(\varvec{\theta }_{< i} := \{\varvec{\theta }_{1}, \ldots , \varvec{\theta }_{i - 1} \}\) and \(\varvec{\theta }_{> i} := \{\varvec{\theta }_{i+1}, \ldots , \varvec{\theta }_{k+1} \}\). For simplicity, we also suppress in the notation the explicit conditioning on the hyper-parameters \(\varvec{\eta }\) and the clustering \({\mathcal {C}}\), which are both fixed.

Following the strategy of Chib (1995), the marginal likelihood can be expressed as

$$\begin{aligned} \begin{aligned} p({\mathscr {X}})&= \frac{p(\hat{\varvec{\theta }}_1, \ldots , \hat{\varvec{\theta }}_{k+1} , {\mathscr {X}})}{p(\hat{\varvec{\theta }}_1, \ldots , \hat{\varvec{\theta }}_{k+1} | {\mathscr {X}}) } \\&= \frac{p(\hat{\varvec{\theta }}_1, \ldots , \hat{\varvec{\theta }}_{k+1} , {\mathscr {X}})}{ \prod _{i = 1}^{k+1} p(\hat{\varvec{\theta }}_i | {\mathscr {X}}, \hat{\varvec{\theta }}_{1} \ldots , \hat{\varvec{\theta }}_{i-1}) } \end{aligned} \end{aligned}$$
(8)

In order to approximate \(p({\mathscr {X}})\) with Eq. (8), we need to estimate \(p(\hat{\varvec{\theta }}_i | {\mathscr {X}}, \hat{\varvec{\theta }}_1, \ldots \hat{\varvec{\theta }}_{i-1})\). First, note that we can express the value of the conditional posterior distribution at \(\hat{\varvec{\theta }}_i\), as follows (see Chib and Jeliazkov (2001), Section 2.3):

$$\begin{aligned} \begin{aligned}&p(\hat{\varvec{\theta }}_i | {\mathscr {X}}, \hat{\varvec{\theta }}_1, \ldots \hat{\varvec{\theta }}_{i-1})\\&\quad = \frac{{{\,\mathrm{{\mathbb {E}}}\,}}_{\varvec{\theta }_{\ge i} \sim p(\varvec{\theta }_{\ge i} | {\mathscr {X}}, \hat{\varvec{\theta }}_{< i})} [ \alpha (\varvec{\theta }_i, \hat{\varvec{\theta }}_i | \hat{\varvec{\theta }}_{< i}, \varvec{\theta }_{> i}) q_i(\hat{\varvec{\theta }}_i)] }{{{\,\mathrm{{\mathbb {E}}}\,}}_{\varvec{\theta }_{\ge i} \sim p(\varvec{\theta }_{> i} | {\mathscr {X}}, \hat{\varvec{\theta }}_{\le i}) q(\varvec{\theta }_i)} [ \alpha (\hat{\varvec{\theta }}_i, \varvec{\theta }_i | \hat{\varvec{\theta }}_{< i}, \varvec{\theta }_{> i})] } , \end{aligned} \end{aligned}$$
(9)

where \(q_i(\varvec{\theta }_i)\) is a proposal distribution for \(\varvec{\theta }_i\), and the acceptance probability of moving from state \(\varvec{\theta }_i\) to state \(\varvec{\theta }_i'\), holding the other states fixed is defined as

$$\begin{aligned} \alpha (\varvec{\theta }_i, \varvec{\theta }_i' | \varvec{\theta }_{< i}, \varvec{\theta }_{> i}) := \min \left\{ 1, \frac{p({\mathscr {X}}, \varvec{\theta }_{< i}, \varvec{\theta }_{> i}, \varvec{\theta }_i') \cdot q_i(\varvec{\theta }_i) }{p({\mathscr {X}}, \varvec{\theta }_{< i}, \varvec{\theta }_{> i}, \varvec{\theta }_i) \cdot q_i(\varvec{\theta }_i') } \right\} . \end{aligned}$$
(10)

Next, using Eq. (9), we can estimate

\(p(\hat{\varvec{\theta }}_i | {\mathscr {X}}, \hat{\varvec{\theta }}_1, \ldots \hat{\varvec{\theta }}_{i-1})\) with a Monte Carlo approximation with M samples:

$$\begin{aligned} \begin{aligned}&p(\hat{\varvec{\theta }}_i | {\mathscr {X}}, \hat{\varvec{\theta }}_1, \ldots \hat{\varvec{\theta }}_{i-1})\\&\quad \approx \frac{ \frac{1}{M} \sum _{m=1}^M \alpha \left( \varvec{\theta }_i^{i,m}, \hat{\varvec{\theta }}_i | \hat{\varvec{\theta }}_{< i}, \varvec{\theta }_{> i}^{i,m}\right) q_i(\hat{\varvec{\theta }}_i) }{\frac{1}{M} \sum _{m=1}^M \alpha \left( \hat{\varvec{\theta }}_i, \varvec{\theta }_i^{q,m} | \hat{\varvec{\theta }}_{< i}, \varvec{\theta }_{> i}^{i+1, m} \right) } \end{aligned} \end{aligned}$$
(11)

where \(\varvec{\theta }_{i}^{a, m} \sim p(\varvec{\theta }_{i} | {\mathscr {X}}, \hat{\varvec{\theta }}_{< a})\), \(\varvec{\theta }_{> i}^{a, m} \sim p(\varvec{\theta }_{> i} | {\mathscr {X}}, \hat{\varvec{\theta }}_{< a})\), and \(\varvec{\theta }_{i}^{q,m} \sim q(\varvec{\theta }_i)\).

Finally, in order to sample from \(p(\varvec{\theta }_{\ge i} | {\mathscr {X}}, \hat{\varvec{\theta }}_{< i})\), we propose to use the Metropolis–Hastings within Gibbs sampler as shown in Algorithm 1. \(MH_j (\varvec{\theta }_j^{t}, \varvec{\psi })\) denotes the Metropolis–Hastings algorithm with current state \(\varvec{\theta }_j^{t}\), and acceptance probability \(\alpha (\varvec{\theta }_j, \varvec{\theta }_j' | \varvec{\psi })\), Eq. (10), and \(\varvec{\theta }_{\ge i}^{0}\) is a sample after the burn-in. For the proposal distribution \(q_i(\varvec{\theta }_i)\), we use

$$\begin{aligned} q_i := \left\{ \begin{array}{ll} \text {InvW}(\nu , {\hat{\varSigma }}_{\epsilon } \cdot (\nu + d + 1)) \\ \hbox { with}\ \nu = \beta \kappa \cdot n + \nu _{\epsilon } &{} \text {if } i = 1,\\ \text {InvW}(\nu , {\hat{\varSigma }}_{i-1} \cdot (\nu + d_{i-1} + 1)) \\ \hbox { with}\ \nu = (1 - \beta ) \kappa \cdot n + \nu _{i-1} &{} \text {else. } \end{array} \right. \end{aligned}$$
(12)

Here, \(\kappa >0\) is a hyper-parameter of the MCMC algorithm that is chosen to control the acceptance probability. Note that if we choose \(\kappa = 1\) and \(\beta \) is 0, then the proposal distribution \(q_i(\varvec{\theta }_i)\) equals the posterior distribution \(p(\varvec{\theta }_i | {\mathscr {X}}, \hat{\varvec{\theta }}_1, \ldots \hat{\varvec{\theta }}_{i-1})\). However, in practice, we found that the acceptance probabilities can be too small, leading to unstable estimates and division by 0 in Eq. (11). Therefore, for our experiments we chose \(\kappa = 10\).

figure a
Table 1 Evaluation of restricted hypotheses space for \(d =40\), \(n \in \{20, 40, 400, 4000, 40{,}000, 4{,}000{,}000\}\)
Table 2 Same setting as in Table 1 but with unbalanced clusters
Fig. 1
figure 1

The ANMI scores of the clustering selected by the proposed method (blue), EBIC (orange), and Calinski–Harabasz Index (green) on synthetic data sets with \(d=40\) and ground truth being 4 balanced clusters. Upper row and lower row shows results where the true precision matrix was generated from an inverse Wishart distribution, and a uniform distribution, respectively. No noise setting (left column), small noise (middle column), large noise (right column). ANMI score of 0.0 means correspondence with true clustering at pure chance level and 1.0 means perfect correspondence. In both settings, with and without noise, the proposed method tends to be among the best. In contrast, EBIC tends to suffer in the noise setting for large n and Calinski–Harabasz Index performs sub-optimal in the no noise setting. (Color figure online)

Fig. 2
figure 2

Same settings as in Fig. 1, but ground truth being 4 unbalanced clusters

4.3 Restricting the hypotheses space

The number of possible clusterings follows the Bell numbers, and therefore, it is infeasible to enumerate all possible clusterings, even if the number of variables d is small. It is therefore crucial to restrict the hypotheses space to a subset of all clusterings that are likely to contain the true clustering. We denote this subset as \({\mathscr {C}}^*\).

Table 3 Evaluation of clustering results for \(d =40\), \(n \in \{20, 40, 400, 4000, 40{,}000, 4{,}000{,}000\}\)
Table 4 Evaluation of clustering results with \(d =40\), \(n \in \{20, 40, 400, 4000, 40{,}000, 4{,}000{,}000\}\)
Table 5 Evaluation of clustering results for \(d =40\), \(n \in \{20, 40, 400, 4000, 40{,}000, 4{,}000{,}000\}\)
Table 6 Evaluation of clustering results with \(d =40\), \(n \in \{20, 40, 400, 4000, 40{,}000, 4{,}000{,}000\}\)

We suggest to use spectral clustering on different estimates of the precision matrix to acquire the set of clusterings \({\mathscr {C}}^*\). A motivation for this heuristic is given in “Appendix C”.

First, for an appropriate \(\lambda \), we estimate the precision matrix using

$$\begin{aligned} X^{*} := \mathop {{{\,\mathrm{arg\,min}\,}}}_{X \succeq 0} - \log |X| + \text {trace} (X S ) + \lambda \sum _{i \ne j} |X_{ij}|^q . \end{aligned}$$
(13)

In our experiments, we take \(q = 1\), which is equivalent to the Graphical Lasso (Friedman et al. 2008) with an \(\ell _1\)-penalty on all entries of X except the diagonal. In the next step, we then construct the Laplacian L as defined in the following.

$$\begin{aligned} \begin{aligned}&L_{ii} = \sum _{k \ne i} |X_{ik}^*|^q , \\&L_{ij} = - |X_{ij}^*|^q \quad \text {for } i \ne j . \end{aligned} \end{aligned}$$
(14)

Finally, we use k-means clustering on the eigenvectors of the Laplacian L. The details of acquiring the set of clusterings \({\mathscr {C}}^*\) using the spectral clustering method are summarized below:

figure b

In Sect. 5.1 we confirm experimentally that, even in the presence of noise, \({\mathscr {C}}^*\) often contains the true clustering, or clusterings that are close to the true clustering.

4.3.1 Posterior distribution over number of clusters

In principle, the posterior distribution for the number of clusters can be calculated using

$$\begin{aligned} p(k | {\mathscr {X}}) \propto \sum _{{\mathcal {C}} \in {\mathscr {C}}_{k}} p({\mathscr {X}} | {\mathcal {C}}) , \end{aligned}$$

where \({\mathscr {C}}_{k}\) denotes the set of all clusterings with number of clusters being equal to k. Since this is computationally infeasible, we use the following approximation

$$\begin{aligned} P(k | X) \propto \sum _{{\mathcal {C}} \in {\mathscr {C}}_{k}} p(X | {\mathcal {C}}) \approx \sum _{{\mathcal {C}} \in {\mathscr {C}}^*_{k}} p(X | {\mathcal {C}}) , \end{aligned}$$

where \({\mathscr {C}}^*_{k}\) is the set of all clusterings with k clusters that are in the restricted hypotheses space \({\mathscr {C}}^*\).

5 Simulation study

In this section, we evaluate the proposed method on simulated data for which the ground truth is available. In Sect. 5.1, we evaluate the quality of the restricted hypotheses space \({\mathscr {C}}^*\), followed by Sect. 5.2, where we evaluated the proposed method’s ability to select the best clustering in \({\mathscr {C}}^*\).

For the number of clusters, we consider the range from 2 to 15. For the set of regularization parameters of the spectral clustering method, we use \(J := \{0.0001, 0.0005, 0.001, 0.002, 0.003, 0.004, 0.005, 0.006, 0.007, 0.008, 0.009, 0.01\}\) (see Algorithm 2).

In all experiments, the number of variables is \(d =40\), and the ground truth is 4 clusters with 10 variables each.

For generating positive-definite covariance matrices, we consider the following two distributions: \(\text {InvW}(d + 1, I_{d})\), and \(\text {Uniform}_d\), with dimension d. We denote by \(U \sim \text {Uniform}_d\) the positive-definite matrix generated in the following way

$$\begin{aligned}&U = A + (0.001 - \lambda _{min} (A)) I_d , \end{aligned}$$

where \(\lambda _{min} (A)\) is the smallest eigenvalue of A, and A is drawn as follows:

$$\begin{aligned}&A_{i,j} = A_{j,i} \sim \text {Uniform}(-1, 1) \, , i \ne j \\&A_{i,i} = 0 . \end{aligned}$$

For generating \(\varSigma \), we either sample each block j from \(\text {InvW}(d_j + 1, I_{d_j})\) or from \(\text {Uniform}_{d_j}\).

For generating the noise matrix \(\varSigma _{\epsilon }\), we sample either from \(\text {InvW}(d + 1, I_{d})\) or from \(\text {Uniform}_{d}\). The final data are then sampled as follows:

$$\begin{aligned} x \sim N(0, (\varSigma ^{-1} + \eta \varSigma _{\epsilon }^{-1})^{-1}) , \end{aligned}$$

where \(\eta \) defines the noise level.

For evaluation we use the adjusted normalized mutual information (ANMI), where 0.0 means that any correspondence with the true labels is at chance level, and 1.0 means that a perfect one-to-one correspondence exists (Vinh et al. 2010). We repeated all experiments 5 times and report the average ANMI score.

5.1 Evaluation of the restricted hypotheses space

First, independent of any model selection criteria, we check here the quality of the clusterings that are found with the spectral clustering algorithm from Sect. 4.3. We also compare to single and average linkage clustering as used in (Tan et al. 2015).

The set of all clusterings that are found is denoted by \({\mathscr {C}}^*\) (the restricted hypotheses space).

In order to evaluate the quality of the restricted hypotheses space \({\mathscr {C}}^*\), we report the oracle performance calculated by \(\max _{{\mathcal {C}} \in {\mathscr {C}}^*} \text {ANMI}({\mathcal {C}}, {\mathcal {C}}_T)\), where \({\mathcal {C}}_T\) denotes the true clustering, and \(\text {ANMI}({\mathcal {C}}, {\mathcal {C}}_T)\) denotes the ANMI score when comparing clustering \({\mathcal {C}}\) with the true clustering. In particular, a score of 1.0 means that the true clustering is contained in \({\mathscr {C}}^*\).

The results of all experiments with noise level \(\eta \in \{0.0, 0.01, 0.1\}\) are shown in Table 1, for balanced clusters, and Table 2, for unbalanced clusters.

From these results, we see that the restricted hypotheses space of spectral clustering is around 100, considerably smaller than the number of all possible clusterings. More importantly, we also see that that \({\mathscr {C}}^*\) acquired by spectral clustering either contains the true clustering or a clustering that is close to the truth. In contrast, the hypotheses space restricted by single and average linkage is smaller, but more often misses the true clustering.

5.2 Evaluation of clustering selection criteria

Here, we evaluate the performance of our proposed method for selecting the correct clustering in the restricted hypotheses space \({\mathscr {C}}^*\). We compare our proposed method (variational) with several baselines and two previously proposed methods (Tan et al. 2015; Palla et al. 2012). Except for the two previously proposed methods, we created \({\mathscr {C}}^*\) with the spectral clustering algorithm from Sect. 4.3.

As a cluster selection criteria, we compare our method to the extended Bayesian information criterion (EBIC) with \(\gamma \in \{0, 0.5, 1\}\) (Chen and Chen 2008; Foygel and Drton 2010), Akaike information criteria (Akaike 1973), and the Calinski–Harabasz Index (CHI) (Caliński and Harabasz 1974). Note that EBIC and AIC are calculated based on the basic Gaussian graphical model (i.e., the model in Eq. 1, but ignoring the prior specification).Footnote 3 Furthermore, we note that EBIC is model consistent, and therefore, assuming that the true precision matrix contains nonzero entries in each element, will choose asymptotically the clustering that has only one cluster with all variables in it. However, as an advantage for EBIC, we exclude that clustering. Furthermore, we note that in contrast to EBIC and AIC, the Calinski–Harabasz Index is not a model-based cluster evaluation criterion. The Calinski–Harabasz Index is an heuristic that uses as clustering criterion the ratio of the variance within and across clusters. As such it is expected to give reasonable clustering results if the noise is considerably smaller in magnitude than the within-cluster variable partial correlations.

Fig. 3
figure 3

Posterior distribution of the number of clusters of the proposed method (top row) and the basic inverse Wishart prior model (bottom row). Ground truth is 4 clusters; there is no noise on the precision matrix

Fig. 4
figure 4

Posterior distribution of the number of clusters of the proposed method (top row) and the basic inverse Wishart prior model (bottom row). Ground truth is 4 clusters; noise was added to the precision matrix

We remark that EBIC and AIC is not well defined if the sample covariance matrix is singular, in particular if \(n < d\) or \(n \approx d\). As an ad hoc remedy, which works well in practice,Footnote 4 we always add 0.001 times the identity matrix to the covariance matrix (see also Ledoit and Wolf (2004)).

Finally, we also compare the proposed method to two previous approaches for variable clustering: the Clustered Graphical Lasso (CGL) as proposed in (Tan et al. 2015), and the Dirichlet process variable clustering (DPVC) model as proposed in (Palla et al. 2012), for which the implementation is available. DPVC models the number of clusters using a Dirichlet process. CGL uses for model selection the mean squared error for recovering randomly left-out elements of the covariance matrix. CGL uses for clustering either the single linkage clustering (SLC) or the average linkage clustering (ALC) method. For conciseness, we show only the results for ALC, since they tended to be better than SLC.

A summary of the experiments, with noise level \(\eta \in \{0.0, 0.01, 0.1\}\), limited to the proposed method, EBIC, and Calinski–Harabasz Index, is shown in Figs. 1 and 2, for balanced and unbalanced clusters, respectively. Detailed results of all experiments are shown in Tables 3 and 4, for balanced clusters, and Tables 5 and 6, for unbalanced clusters. The tables also contain the performance of the proposed method for \(\beta \in \{0, 0.01, 0.02, 0.03\}\). Note that \(\beta = 0.0\) corresponds to the basic inverse Wishart prior model for which we can calculate the marginal likelihood analytically.

Comparing the proposed method with different \(\beta \), we see that \(\beta = 0.02\) offers good clustering performance in the no noise and noisy setting. In contrast, model selection with EBIC and AIC performs, as expected, well in the no noise scenario; however, in the noisy setting they tend to select incorrect clusterings. In particular, for large sample sizes EBIC tends to fail to identify correct clusterings.

The Calinski–Harabasz Index performs well in the noisy settings, whereas in the no noise setting it performs unsatisfactory.

Table 7 Comparison of variational and MCMC estimate. Evaluation of clustering results for \(d =12\), \(n \in \{12, 120, 1200, 1{,}200{,}000\}\)
Table 8 Evaluation of selected clusterings of the mutual funds data

In Figs. 3 and 4, we show the posterior distribution with and without noise on the precision matrix, respectively.Footnote 5 In both cases, given that the sample size n is large enough, the proposed method is able to estimate correctly the number of clusters. In contrast, the basic inverse Wishart prior model underestimates the number of clusters for large n and existence of noise in the precision matrix.

5.3 Comparison of variational and MCMC estimate

Here, we compare our variational approximation with MCMC on a small scale simulated problem where it is computationally feasible to estimate the marginal likelihood with MCMC. We generated synthetic data as in the previous section, only with the difference that we set the number of variables d to 12.

The number of samples M for MCMC was set to 10,000, where we used 10% as burn-in. For two randomly picked clusterings for \(n = 12\), and \(n = 1{,}200{,}000\), we checked the acceptance rates and convergence using the multivariate extension of the Gelman–Rubin diagnostic (Brooks and Gelman 1998). The average acceptance rates were around \(80\%\), and the potential scale reduction factor was 1.01.

The runtime of MCMC was around 40 minutes for evaluating one clustering, whereas for the variational approximation the runtime was around 2 seconds.Footnote 6 The results are shown in Table 7, suggesting that the quality of the selected clusterings using the variational approximation is similar to MCMC.

6 Real data experiments

In this section, we investigate the properties of the proposed model selection criterion on three real data sets. In all cases, we use the spectral clustering algorithm from “Appendix C” to create cluster candidates. All variables were normalized to have mean 0 and variance 1. For all methods, except DPVC, the number of clusters is considered to be in \(\{2, 3, 4, \ldots , \min (p - 1,15) \}\). DPVC automatically selects the number of clusters by assuming a Dirichlet process prior. We evaluated the proposed method with \(\beta = 0.02\) using the variational approximation.

6.1 Mutual funds

Here, we use the mutual funds data, which has been previously analyzed in (Scott and Carvalho 2008; Marlin et al. 2009). The data contain 59 mutual funds (d =59) grouped into 4 clusters: US bond funds, US stock funds, balanced funds (containing US stocks and bonds), and international stock funds. The number of observations is 86.

The results of all methods are visualized in Table 8. It is difficult to interpret the results produced by EBIC (\(\gamma = 1.0\)), AIC, and the Calinski–Harabasz Index. In contrast, the proposed method and EBIC (\(\gamma = 0.0\)) produce results that are easier to interpret. In particular, our results suggest that there is a considerable correlation between the balanced funds and the US stock funds which was also observed in Marlin et al. (2009).

In Fig. 5, we show a two-dimensional representation of the data, that was found using Laplacian eigenmaps (Belkin and Niyogi 2003). The figure supports the claim that balanced funds and the US stock funds have similar behavior.

Fig. 5
figure 5

Two-dimensional representation of the mutual funds data suggesting that balanced funds and US stock funds are difficult to separate (one cluster), whereas US bond funds and international stock funds appear to form mostly separate clusters

Fig. 6
figure 6

Gene regulations of E. coli as given in (Hirose et al. 2017; Albersts et al. 2014) suggesting that the gene groups {lexA, uvrA, uvrB, uvrC, uvrD, recA} and {crp, lacl, lacZ, lacY, lacA} should be separated

6.2 Gene regulations

We tested our method also on the gene expression data that was analyzed in (Hirose et al. 2017). The data consist of 11 genes with 445 gene expressions. The true gene regularizations are known in this case and shown in Fig. 6, adapted from (Hirose et al. 2017). The most important fact is that there are two independent groups of genes and any clustering that mixes these two can be considered as wrong.

We show the results of all methods in Fig. 7, where we mark each cluster with a different color superimposed on the true regularization structure. Here, only the clustering selected by the proposed method, EBIC (\(\gamma = 1.0\)) and Calinski–Harabasz correctly divides the two group of genes.

6.3 Aviation sensors

As a third data set, we use the flight aviation data set from NASA.Footnote 7 The data set contains sensor information sampled from airplanes during operation. We extracted the information of 16 continuous-valued sensors that were recorded for different flights with in total 25,032,364 samples.

The clustering results are shown in Table 9. The data set does not have any ground truth, but the clustering result of our proposed method is reasonable: Cluster 9 groups sensors that measure or affect altitude,Footnote 8 Cluster 8 correctly clusters the left and right sensors for measuring the rotation around the axis pointing through the noise of the aircraft, in Cluster 2 all sensors that measure the angle between chord and flight direction are grouped together. It also appears reasonable that the yellow hydraulic system of the left part of the plane has little direct interaction with the green hydraulic system of the right part (Cluster 1 and Cluster 4). And the sensor for the rudder, influencing the direction of the plane, is mostly independent of the other sensors (Cluster 5).

In contrast, the clustering selected by the basic inverse Wishart prior, EBIC, and AIC is difficult to interpret. We note that we did not compare to DPVC, since the large number of samples made the MCMC algorithm of DPVC infeasible.

Fig. 7
figure 7

Clusterings of gene regulations network of E. coli. The clustering results are visualized by different colors. Here, the size of the restricted hypotheses space \(|{\mathscr {C}}^*|\) found by spectral clustering was 18. Only the proposed method, EBIC (\(\gamma = 1.0\)), and Calinski–Harabasz correctly divide the gene groups {lexA, uvrA, uvrB, uvrC, uvrD, recA} and {crp, lacl, lacZ, lacY, lacA}

7 Discussion and conclusions

We have introduced a new method for evaluating variable clusterings based on the marginal likelihood of a Bayesian model that takes into account noise on the precision matrix. Since the calculation of the marginal likelihood is analytically intractable, we proposed two approximations: a variational approximation and an approximation based on MCMC. Experimentally, we found that the variational approximation is considerably faster than MCMC and also leads to accurate model selection.

We compared our proposed method to several standard model selection criteria. In particular, we compared to BIC and extended BIC (EBIC) which are often the method of choice for model selection in Gaussian graphical models. However, we emphasize that EBIC was designed to handle the situation where d is in the order of n, and has not been designed to handle noise. As a consequence, our experiments showed that in practice its performance depends highly on the choice of the \(\gamma \) parameter. In contrast, the proposed method, with fixed hyper-parameters, shows better performance on various simulated and real data.

We also compared our method to other two previously proposed methods, namely Cluster Graphical Lasso (CGL) (Tan et al. 2015) and Dirichlet Process Variable Clustering (DPVC) (Palla et al. 2012) that performs jointly clustering and model selection. However, it appears that in many situations the model selection algorithm of CGL is not able to detect the true model, even if there is no noise. On the other hand, the Dirichlet process assumption by DPVC appears to be very restrictive, leading again to many situations where the true model (clustering) is missed. Overall, our method performs better in terms of selecting the correct clustering on synthetic data with ground truth, and selects meaningful clusters on real data.

Table 9 Evaluation of selected clusterings of the Aviation Sensor Data with 16 variables

The python source code for variable clustering and model selection with the proposed method and all baselines is available at https://github.com/andrade-stats/robustBayesClustering.