Optimal Bayesian estimators for latent variable cluster models

In cluster analysis interest lies in probabilistically capturing partitions of individuals, items or observations into groups, such that those belonging to the same group share similar attributes or relational profiles. Bayesian posterior samples for the latent allocation variables can be effectively obtained in a wide range of clustering models, including finite mixtures, infinite mixtures, hidden Markov models and block models for networks. However, due to the categorical nature of the clustering variables and the lack of scalable algorithms, summary tools that can interpret such samples are not available. We adopt a Bayesian decision theoretical approach to define an optimality criterion for clusterings and propose a fast and context-independent greedy algorithm to find the best allocations. One important facet of our approach is that the optimal number of groups is automatically selected, thereby solving the clustering and the model-choice problems at the same time. We consider several loss functions to compare partitions and show that our approach can accommodate a wide range of cases. Finally, we illustrate our approach on both artificial and real datasets for three different clustering models: Gaussian mixtures, stochastic block models and latent block models for networks.


Introduction
Cluster analysis plays a central role in statistics and machine learning, yet it is not immediately clear how one can appropriately summarise the output of partitions from a Bayesian clustering model.This article seeks to address this impasse, proposing an optimality criterion for clusterings derived from decision theory, and a greedy algorithm to estimate the optimal partition and number of groups.Clustering models are often represented as discrete latent variable models: each of the data objects corresponds to the elements of V = {1, 2, . . ., N } and is characterised by a categorical latent variable z = {1, 2, . . ., K} denoting its group label.Such variables are often called clustering variables or allocations.Notable examples of latent variable clustering models include: product partition models (Hartigan 1990; Barry and Hartigan 1992), finite mixtures (McLachlan and Peel 2004), infinite mixtures (Quintana (2006) and references therein), latent block models for networks (Nowicki and Snijders 2001;Govaert 1995), hidden Markov models (MacDonald and Zucchini 1997).
The motivation for this paper ensues from the introduction within the statistical community of the so-called trans-dimensional samplers.One well known and widely used sampler is the reversible jump algorithm of Green (1995), extended to the context of finite mixtures by Richardson and Green (1997) and to hidden Markov models by Robert et al. (2000).Reversible jump Markov chain Monte Carlo allows one to explore a number of models with a single Markov chain that "jumps" between them, thereby estimating both the model parameters and the posterior model probabilities.A more recent transdimensional Markov Chain Monte Carlo algorithm is the allocation sampler introduced by Nobile and Fearnside (2007).This takes advantage of the fact that, in some finite mixture models, the marginal posterior distribution of the allocation variables can be obtained by analytically integrating out all of the model parameters.This allows one to use a collapsed Gibbs sampler and obtain a posterior marginal sample for the clustering variables.One advantage of this method is that the number of groups can be inferred at each step from the clustering variables automatically, hence obtaining posterior probabilities for the different models.The core idea of the allocation sampler has been recently extended to a number of frameworks, including latent class analysis (White et al. 2016), latent block models (Wyse and Friel 2012), stochastic block models (McDaid et al. 2013), latent position models (Friel et al. 2013), and change point analysis (Benson and Friel 2016).In Bayesian nonparametrics a similar approach has been proposed by (Neal 2000;Favaro and Teh 2013) for Dirichlet process mixture models.
Both reversible jump and allocation sampler return a trans-dimensional sample for the allocations.Theoretically, such a sample contains all of the posterior information needed for the clustering of the data, however, interpreting such information is a very challenging task.Since the allocations are categorical variables, usual summary statistics such as the mean, median and quantiles are not well defined.In addition, these Markov chain Monte Carlo algorithms are sensitive to label-switching issues (Stephens 2000), in fact, when using the latent variable representation, all mixture models are non-identifiable up to a permutation of the cluster labels.In addition, the sample itself may be computationally impractical to handle, since even basic operations may require a cost that grows with N 2 or the square of the size of the sample.
The problem described really boils down to a very simple research question: we want to summarise the information provided by a sample of partitions into an optimal partition.This issue has been addressed in several previous works, such as Strehl and Ghosh (2003), Gionis et al. (2007), Dahl (2009), and Fritsch and Ickstadt (2009), where the authors propose a number of approaches that define a theoretical optimal partition and introduce algorithms to find it.One critique to these contributions is that the proposed methodologies lack a sound theoretical background and they may be seen as ad-hoc.
In this work we use a Bayesian decision theoretic framework to define an optimality criterion for partitions, as previously proposed by Binder (1978), Lau and Green (2007), and Wade and Ghahramani (2015).From the Bayesian theoretical point of view, our approach defines the best possible solution to the partitioning problem using the information contained in the sample.Also, an important facet of this methodology is that it builds upon recent adaptations of the allocation sampler (Wyse and Friel 2012;McDaid et al. 2013;Friel et al. 2013;White et al. 2016), making up for one important shortcoming of these samplers: the interpretation of the results.
The essence of the decision theoretic framework lies in the definition of a loss function in the space of partitions, which is often a metric measuring how different two partitions are.Then, the optimal partition is estimated as the one minimising the average loss with respect to the sample given.In the Bayesian perspective, this is equivalent to adopting a Bayes estimator (or Bayes action), which is the decision minimising the Expected Posterior Loss (EPL).
We propose a greedy algorithm as means to find the optimal partition, focusing on its computational complexity and scalability.The algorithm can deal with a wide family of loss functions and requires only the sample of partitions as input.Hence our methodology has wide applicability and is the only scalable procedure that can be used to perform Bayesian clustering for a relatively arbitrary loss function.One important advantage of our algorithmic frameworks is that the resulting optimal clustering automatically determines the optimal number of groups.
Previous works (Lau and Green 2007;Wade and Ghahramani 2015) were confined to the case of Bayesian nonparametric models.Here we stress that this approach is automatically extended to a very general clustering context, and hence we propose applications to several different frameworks.
The plan of the paper is summarised as follows: Section 2 describes the theoretical foundations of Bayesian clustering; in Section 3 we describe the properties of several loss functions to compare partitions, and we characterise the wide breadth to which our method extends; in Section 4 we introduce our greedy algorithm and analyse its complexity and features, whereas Section 5 shows an interesting procedure that can be used to potentially save an amount of computational time.Finally, three applications to real datasets are proposed in Section 6: the galaxies' dataset for univariate Gaussian finite mixtures, the French political blogosphere for stochastic block models, and the congressional voting data for latent block models.Section 7 closes the paper with some final comments.

Bayesian clustering: the theory
Let Z be a T ×N matrix, where, for every t = 1, . . ., T and i = 1, . . ., N , z ti is a categorical variable (typically z ti ∈ {1, 2, . . ., N }) indicating the cluster label of observation i at iteration t.The rows of Z determine a sample of partitions of the same set V = {1, 2, . . ., N }, and we assume that such sample is drawn from the posterior distribution of a clustering model, given the observed data Y.An alternative representation of the sample would be z (1) , . . ., z (T ) , where z (t) = {z t1 , . . ., z tN } ∈ Z corresponds to the t-th row of Z, and Z is the space of all partitions of V.
Interest lies in conveying the information provided by the posterior sample into a single optimal partition.Bayesian decision theory offers an elegant approach to tackle this task, essentially recasting the clustering problem into one of decision making.
The first step consists of choosing a loss function L : Z × Z → R. For any two partitions (hereafter also called decisions) a and z, the quantity L (a, z) indicates the loss occurring when the decision a is chosen while z is the correct partition.The choice of the loss function adopted is completely arbitrary and supposedly situational, nonetheless some loss functions have interesting features and tend to work well in many contexts.A loss function is not necessarily a distance in the space of partitions although this is often regarded as a desirable property, since it helps particularly in the interpretation and representation of the results.
An optimal decision (also called Bayes action) is one minimising the expected posterior loss, defined as: Considering that the posterior sample z (1) , . . ., z (T ) ∼ π ( • |Y) is available, for every decision a ∈ Z, an unbiased estimator of the associated expected posterior loss results as: We aim then at finding the decision â minimising the approximate expected posterior loss: â = arg min a∈Z ψ (a) . (3) 3 Choice of the loss function

Common loss functions
Given the sample Z, a naive but fast method to obtain an optimal clustering would be to consider the single partition that obtained the highest posterior value during the sampling, i.e.: In a decision theoretic context, this is equivalent to choosing a 0−1 loss defined as: since the Bayes action minimising (3) would simply be the mode of the sample.The sign "≡" here means that there exists a label permutation σ such that σ Reading the definition in (5), the loss is zero iff the partitions are equivalent.In all of the other cases, the loss is 1 regardless of how different the partitions actually are.This peculiar behaviour makes the 0−1 loss rather unappealing as means to compare partitions.
Note that all of the clustering algorithms that return a MAP estimate can be interpreted in this context as tools minimising the expected 0−1 loss, although they normally do not require the sample Z, and are computationally cheap.Hence MAP estimates may be criticised since in the Bayesian paradigm the corresponding loss is not particularly sensible.
Another loss function that is commonly used is the quadratic loss, which gives the posterior mean as Bayes action.However, in a clustering context this has little meaning due to the categorical nature of the variables, which makes any sort of averaging of allocations not particularly meaningful.

Loss functions to compare partitions
A more sensible approach would be to choose a loss function that is specifically designed to compare partitions.In recent years, many measures to compare partitions have been proposed, each with very different properties and characteristics.The works of Meilă (2007), Vinh et al. (2010), and Wade and Ghahramani (2015) and references therein offer an excellent overview.
A common approach used to compare partitions (here a and z denote two arbitrary partitions with K a and K z groups, respectively) relies on the K a ×K z contingency matrix (or confusion matrix), whose entries are defined as: where g varies among the groups of a, and h among those of z.The entries of such a matrix simply count the number of items that a classifies in group g and z classifies in group h, for every g and h.
Here, we focus on loss functions that depend on a and z only through the entries of n a,z .This is a fairly general and reasonable assumption which is in line with the theory developed by Binder (1978); in fact, most metrics can be transformed into functions of the counts (see Vinh et al. (2009) and references therein).
We assume that the loss function has the following representation: where f 0 , f 1 , f 2 , f 3 are real valued functions that can be evaluated in constant time and n a g and n z h indicate the sizes of group g and h, respectively, i.e.: for every g = 1, . . ., K a and h = 1, . . ., K z .The assumption determined by ( 7) is actually not restrictive: most of the commonly used loss functions for partitions satisfy this condition.We note that the arguments of the function f 0 include the following quantities as special cases: • The entropies of a and z, describing the uncertainty associated to a and z, respectively: • The joint entropy of a and z: This describes instead the uncertainty of the random variable with pdf given by the quantities n a,z gh /N , for every g and h.
• The mutual information, which can be evaluated from the entropies and joint entropy: This quantity is particularly meaningful and has been advocated in a normalised version by Strehl and Ghosh (2003) as a distance measure between partitions.
Note the common convention that x log 2 x = 0 if x = 0. Evidently these informationbased quantities can be obtained as special cases of the functions f 1 , f 2 and f 3 , making our assumption rather general and broadly satisfied.
Here follows a brief description of some well-known loss functions that can be considered with our approach.
Binder's loss (B).We use a special case of a more general formula first introduced by Binder (1978): This loss is equivalent to the Hamming distance (Meilă 2012) and to the Rand index (Rand 1971).Binder's loss has an interesting property that simplifies greatly the minimisation of (3).One can in fact easily construct a so-called posterior similarity matrix of size N × N , whose entries b ij denote the estimated posterior probability of i and j being allocated to the same group, for every i and j in V.Then, the Binder Bayes action satisfies: where 1 A is equal to 1 if the event A is true or zero otherwise.This simplifies the minimisation problem since (13) depends on the sample only through the posterior similarity matrix, which can be effectively computed beforehand.

The variation of information (VI)
. This loss is one we particularly focus on in this paper, and is defined as: The VI loss, first studied in Meilă (2007), has received an increasing amount of attention in the last decade, mainly due to its strong mathematical foundations and practical efficiency.In the paper by Meilă (2007) as well as in subsequent works such as Wade and Ghahramani (2015), the mathematical properties and behaviour of the VI loss have been deeply studied.We mention that this loss is a metric, that it forms a lattice and that it is horizontally and vertically aligned in the space of partitions.In addition, it is invariant to label-switching, i.e. switching labels for either a or z will not affect the value L V I (a, z).More details regarding the theoretical properties of the VI loss can be found in Meilă (2007).
The normalised variation of information (NVI).This loss is defined as: The normalised version of the VI loss takes values in [0, 1].This scale-invariance may facilitate the interpretation and the comparisons of partitions under different conditions.Since we adopt an optimisation approach, this feature is not crucial in our framework due to the partitions always referring to the same set of individuals.
The normalised information distance (NID).This loss is defined as: The NID loss has been advocated in Vinh et al. (2010) as a general purpose -context independent -loss function with desirable behaviours.

Minimisation of the expected posterior loss
An exhaustive search within Z becomes impractical even for very small N (the cardinality of Z is a number with more than 100 digits if N = 100).Therefore, the minimisation can be seen as a binary programming optimisation problem which is known to be NP-hard, and hence not solvable through exact methods.Also, the objective function requires the calculation of the sum in (2) at each evaluation.Getting a new posterior sample at each step is not a practical option, hence the same sample is used for all of the evaluations of (2).Nonetheless, even a single evaluation of the objective function can become computationally burdensome when the size of the sample is large.Therefore, the decision theoretic approach becomes soon impractical as N and T increase, and finding scalable procedures is crucial.In this section we introduce a new algorithm that, using greedy updates, is able to estimate the Bayes action for the wide family of loss functions satisfying (7), requiring in input only the posterior sample of partitions.

Greedy algorithm
Heuristic greedy algorithms have been recently rediscovered as a means to maximise the so-called exact Integrated Complete Likelihood in various contexts: stochastic block models (Côme and Latouche 2015), latent block models (Wyse et al. 2014), Gaussian finite mixtures (Bertoletti et al. 2015).Similar approaches have also been proposed in Bayesian nonparametrics for Dirichlet prior mixtures (Raykov et al. 2014) although in this case they did not cast the clustering problem into the optimisation of an exact model-based clustering criterion.Among the many papers adopting types of greedy optimisation, we find the approaches of Besag (1986), Strehl and Ghosh (2003), and Newman (2004) particularly related to ours.
We propose a greedy algorithm that updates a partition by changing the cluster memberships of single observations using a greedy heuristic, hence decreasing the expected posterior loss of the partition at each step.As input, the algorithm only requires a starting partition, the posterior sample Z and a user-specified parameter K up , equal to the maximum number of groups allowed (a reasonable default value would be K up = N ).The algorithm cycles over the observations in random order, and, for each of these, it tries all of the possible reallocations, eventually choosing the one giving the best decrease in the objective function.The notation a i:r→s denotes the partition a where the observation i has been reallocated from group r to s.At each move, the number of groups may increase (if the observation is reallocated to an empty group) or decrease (if a group is left empty), although the latter scenario is much more frequent.Due to the low probability of creating new groups, it is generally advisable to start with a partition made of close to K up groups.The procedure stops when a complete sweep over all observations yields no change in the expected posterior loss.The pseudo-code for the algorithm is shown in Algorithm 1.

Algorithm 1 Greedy algorithm
1: Let a be the starting partition.

7:
while V is not empty do 8: Pick i at random from V and delete it from V.
Due to the greedy nature of this procedure, the algorithm is bound to return a local optimum, rather than a global one.Consequently, several restarts with different initial partitions may be required.However, convergence is usually reached in very few iterations, in each run.Regarding the starting partition, this may either be chosen at random or it may be set to be the clustering yielding the highest posterior value as in (4).A possible alternative lies in between the two cases, i.e. the MAP partition may be changed to some extent by reallocating some observations at random.
One interesting feature of the greedy algorithm is that the whole space of partitions is explored, hence the optimal partitions may differ substantially from all of the clusterings in the sample.In fact, many non-optimal solutions may have higher posterior values than the optimal one.In contrast to Côme and Latouche (2015) and Wyse et al. (2014), we do not perform any final merge step, as in most cases this did not improve the results.

Complexity
The basic operation that determines the complexity of the greedy optimisation is the evaluation of the variation in the objective function when a possible reallocation is tested (line 9 in the pseudo-code 1).Assume that the move from a to a i:r→s is being tested, for some groups r and s.The following quantity needs to be evaluated: which in turn requires, ∀t = 1, . . ., T : For a certain t, the move only affects two entries of n a (i.e.n a r and n a s ) and two entries of n a,z (t) (i.e.n a,z (t) rv and n a,z (t)

sv
, where v = z ti ).This means that the change in the arguments of f 0 can be evaluated in a constant time, hence making the cost of evaluating ∆ψ ∼ O (T ).
Since the algorithm tries all possible moves for each observation, the overall computational cost is O (T N K up ).

Comparisons with other algorithms
Both Lau and Green (2007) and Wade and Ghahramani (2015) propose original algorithmic frameworks to minimise an expected posterior loss.While Lau and Green (2007) only focus on Binder's loss, Wade and Ghahramani (2015) also extend the procedure to the VI loss, albeit resorting to an approximation of the objective function.Both methodologies take advantage of the posterior similarity matrix representation, briefly pointed out in (13).Note that this representation is exclusive to the Binder's loss, hence these approaches lack the possibility to be generalised to other loss functions, unless approximations are introduced.
The computational cost for an evaluation of the objective function ( 13) does not depend on T , since the posterior information contained in the sample is summarised in the posterior similarity matrix.The calculation of the posterior similarity matrix itself requires O (T N 2 ) operations, yet this can be performed offline and it is unlikely to impact the overall computing time.
On the other hand, our algorithm does not require a N 2 cost at any stage, hence it should be preferrable when the number of observations to classify is very large.We note that, due to the dependence of the complexity on T , our algorithm will benefit if the sample is small and thinned with a large lag.A trade-off between the reliability of the posterior sample and computing time should be assessed, in that one should provide a sample that is as small as possible but not so small that the approximation to the posterior distribution is not reliable.As concerns the dependency on K up , ideally one should choose K up = N , but this evidently would make the procedure impractical in large N scenarios.
More generally, the computational cost of the algorithm may be compared to the complexity of the sampler used to get the posterior sample.In fact, one key advantage of the collapsed Gibbs samplers proposed in Nobile and Fearnside (2007), McDaid et al. (2013), and Wyse and Friel (2012) is their computational efficiency.The posterior sample returned by these samplers is necessary to perform the minimisation of the expected posterior loss.Hence, an ideal complexity for the optimisation problem should be not higher than that required by the sampler in the first place.Unfortunately, when analysing these samplers, new quantities (the number of dimensions for Gaussian finite mixtures, or the number of edges in block models) come into play, making a strict comparison of the complexity not possible.However, in our applications we noticed that for stochastic block models and latent block models the computational bottleneck was set by the samplers, and not by the greedy algorithm.

Classes of equivalences in the posterior sample
Since the sample space Z is discrete, the posterior sample Z may contain repetitions, due to the sampler returning to the same partition during the sampling procedure.This suggests that, regardless of the partition a, a number of the calculations required to obtain L (a, z) is redundant.In fact, given a partition z, the following holds: for all i = 1, . . ., N and g = 1, . . ., K up and ∀t : z (t) ≡ z.
It follows that the posterior sample can be summarised into the sample of its unique rows Z = z(1) , . . ., z( T ) and a vector of counts ω = ω (1) , . . ., ω ( T ) describing how many times the corresponding partition appears in the original sample Z. Therefore the approximate expected posterior loss can be equivalently written as: A similar reasoning can be used to make the calculation of ψ (a i→g ) more efficient.
The main difficulty in applying the technique just described lies in identifying the new representation efficiently.One problem consists in the implementation of the operator "≡" since partitions should be compared up to a permutation of the labels.To solve this, we use a procedure described in Strehl and Ghosh (2003) that defines a unique labelling for all partitions: the first item is assigned to cluster 1, and then iteratively the next item is assigned either to an existing cluster or to the next empty cluster.Using this re-labelling, any two equivalent partitions will be transformed into the same sequence of digits in a computational time O (T N ).
Furthermore, the same vector can be seen as a number in base−K up representation which uniquely identifies the corresponding partition and the equivalence class imposed by "≡".Hence a sorting algorithm can be used to reorder the sample according to such identifiers, for a computational cost of O (N T log T ), where N is the cost of a single comparison of partitions.Once the partitions are sorted, the unique set and the corresponding weights can be obtained in O (T N ).
The advantage provided by this representation heavily depends on the dataset and on the corresponding marginal posterior distribution: less repetitions will appear if the posterior is flat and the partitioning very uncertain.On the other hand, the computational savings may be substantial in cases where only few partitions have a high posterior value.
Note that the sorting procedure creates a new computational bottleneck in the case where log T > K up .However, we found this is not relevant in practical terms and negligible when compared to the computational time demanded by the actual optimisation.
In a machine learning context, the weighted sample Z may be interpreted as a cluster ensemble problem, whereby each partition corresponds to the output of a clustering algorithm and the counts are weights describing the relative (possibly subjective) importance of the solution.Our methodology may be applied in this scenario without further modifications, providing a sound background to the decision making process.

Real data examples
In this Section, we provide three applications of our methodology to different clustering contexts, and compare the results obtained with previous analyses.To avoid confusion, we show the results only for the VI loss, and note that the other losses lead to similar partitions.

The data
The dataset considered is composed of the velocities of 82 distant galaxies diverging from the Milky Way.Interest lies in understanding whether velocity can be used to discriminate clusters of galaxies.The dataset has been first analysed from a statistical point of view in Roeder (1990), and has been re-proposed in numerous papers dealing with mixture models, including Richardson and Green (1997), Stephens (2000), and Wade and Ghahramani (2015).

The model
The observed data is denoted by Y = {y 1 , . . ., y N }, where N = 82 and y i ∈ R for every i = 1, . . ., N .As in Bertoletti et al. (2015), a Gaussian finite mixture model is adopted: where λ 1 , . . ., λ g are the mixture weights and N • ; µ, 1 r denotes the univariate Gaussian distribution with mean µ and variance 1/r.The number K of Gaussian components in the mixture is unknown and hence to be inferred.
Following a latent variable framework, an allocation variable z i is associated to each observation, denoting which Gaussian component has generated the corresponding y i : This allows a more tractable expression for the likelihood, conditionally on the allocations: We specify a Bayesian hierarchical structure on both the likelihood parameters and the allocation variables.For every group g, the parameters r g and µ g are independent realisations of a Gamma(γ, δ) and a Gaussian 0, [τ r g ] −1 , respectively.As concerns the allocations, these are distributed as independent Multinomials with parameters θ = (θ 1 , . . ., θ K ), where θ is a Dirichlet distributed random vector.The hyperparameters are set as in Bertoletti et al. (2015): τ = 0.01, γ = 0.5, δ = 0.5, and Dirichlet hyperparameter α = 4.
Since conjugate priors are used, most of the model parameters can be integrated out analytically.Hence the following marginal distributions can be obtained in exact form for the data and allocations: More details on the integrations can be found in Nobile and Fearnside (2007) and Bertoletti et al. (2015).A consequence of these results is that the marginal posterior for the allocations can be obtained analytically, too: Such marginal posterior distribution can be used as target in a Markov chain Monte Carlo sampler, thereby obtaining the posterior sample Z.Note that, since all of the model parameters have been integrated out, trans-dimensional moves can be easily implemented, so that the chain effectively explores all of the possible models.In Appendix A.1, a general algorithm to sample from this distribution is described.The same sampler is used to get the posterior sample Z for the galaxies' dataset.

Results
We obtained a sample for the allocations using the collapsed sampler described in Appendix A.1.One million observations were first discarded as burn-in, then one observation every hundredth was retained until a sample size of 10,000 was obtained.The chain appeared to mix well suggesting convergence to the target distribution.The sample was post-processed using the method described in Section 5.Then, several runs of the greedy algorithm were performed, using both the noisy MAP and completely random starting partitions.The left panel of Figure 1 shows a histogram of the observed data with the overall best clustering found.The number of groups for the VI Bayes action is 3, which

Galaxy dataset VI loss optimal partition
Velocity Density 10000 15000 20000 25000 30000 35000 0.00000 0.00005 0.00010 0.00015 0.00020 q q q q qq q q q q q q q q q q 2 4 6 8 is in line with the results of Wade and Ghahramani (2015), but notably different from the results of Richardson and Green (1997).The computational time needed to get the sample was about 45 seconds, whereas an average of 5 seconds was required for each run of the greedy algorithm, with K up fixed to 50 for both algorithms.
6.2 Stochastic block models: French political blogosphere

The data
The data first appeared in Zanghi et al. (2008) and consist of a undirected graph where nodes represent political blogs' websites and edges represent hyperlinks between them.As in Latouche et al. (2011) we focus only on a subset of the original dataset, available in the R package mixer.The data consist of a single day snapshot of political blogs automatically extracted on the 14th of October 2006 and manually classified by the "Observatoire Présidentielle project".The graph is composed of 196 nodes and 1432 edges, and the main political parties are the UMP (French "republican"), UDF ("moderate" party), liberal party (supporters of economic liberalism) and PS (French "democrat"), although 11 different parties appear in total.The observed data is modelled by the adjacency matrix Y whose entries are defined as follows: 1 if an undirected edge between blogs i and j appear; 0 otherwise; for every 1 ≤ i < j ≤ N .

Stochastic block models
Stochastic block models (Nowicki and Snijders 2001) are finite mixture models for networks, whereby the clustering problem is formulated on the nodes of the network and the connection profile of each node is selected by its cluster membership.For every i, the allocation variable z i denotes the group to which node i belongs, and, as in the Gaussian finite mixture context, a Multinomial-Dirichlet structure is assumed on these variables.
The number of underlying groups K is unknown and hence to be inferred.Conditionally on the allocations, the likelihood for the graph Y = {y ij : 1 ≤ i < j ≤ N } factorises as: Here, Π is a symmetric K × K matrix of connection probabilities, where the generic element π gh indicates the probability that an edge occurs between a node in group g and a node in group h, for any g and h in {1, . . ., K}.Furthermore, each π gh is assumed to be a realisation of an independent Beta random variable.The hyperparameters for the Beta and Dirichlet distributions are all set to 0.5.Since conjugate priors are used, all of the model parameters can be integrated out analytically.It follows that, as in the Gaussian finite mixture context, the quantity p (z|Y) is available anaylitically and can be targeted by the sampler described in Appendix A.1.Further details on the integration can be found in McDaid et al. (2013) and Côme and Latouche (2015).

Results
First, we performed block modelling using the variational algorithm implemented in the package mixer and obtained a partitioning to be used as reference.The optimal variational solution has 12 groups, which roughly correspond to the political affiliations, as shown in Table 1.
Table 1: French blogs: confusion matrix for the variational partition and the political affiliations.
The VI-optimal partition exhibits 18 groups, and is represented in the right panel of Figure 2. Figure 3 shows instead the reordered adjacency matrices for the three different partitions.The posterior distribution for the number of groups is shown in Figure 4.As in the galaxies' dataset, the optimal number of group contrasts with the modal value of the posterior distribution.
It appears that the VI-optimal clustering is a finer partition that splits up some of the larger groups into subgroups.Nonetheless from Figure 3 it is clear that this entails a better discrimination of the profiles of blogs.A confusion matrix matching the solution to the political affiliations is shown in Table 2.The liberals are well discriminated in both the variational and VI-optimal partitions.The two partitions also agree on the blogs affiliated to the UDF party: 24 of them are well-discriminated and isolated from the rest, a subset of 6 blogs are classified into their own group, 1 blog is associated to the UMP party and 1 is not well-recognised.The main differences between the two partitions arise with respect to the other two relevant parties: UMP and PS.In these two cases it appears that the relational profiles of the blogs are not particularly determined by the political affiliation, since both partitions recognise a number of subgroups within each party, signaling heterogeneity.UMP is decomposed in 5 subgroups in both partitions, while PS is decomposed in 6 and 7 subgroups for the variational and VI partition, respectively.

Latent block model: Congressional voting data
We propose an application of our methodology to the UCI Congressional voting data, previously analysed in Wyse and Friel (2012) and Wyse et al. (2014).

Adjacency matrix for political affiliations
Adjacency matrix for VI Bayes action Adjacency matrix for variational action Table 2: French blogs.Confusion matrix for the VI-optimal partition and the political affiliations.

The data
The data record whether 435 members of the 98 th congress voted "yay" or "nay" on 16 key issues.Abstained and absent were treated as "nays".Also, information on the political affiliation of each member is available: 267 individuals are "democrats" and 168 "republicans".Following Wyse and Friel (2012) and Wyse et al. (2014), the data are rearranged into a bipartite network, whereby two types of nodes are defined (one corresponding to congress members and one to issues) and only undirected edges between nodes of different types are allowed.Similarly to stochastic block models, an adjacency matrix Y is used to summarise the data, with edges corresponding to "yays" (y ij = 1) and non-edges corresponding to "nays" (y ij = 0).Note that in this case the matrix Y has size 435 × 16, whereby rows correspond to congressmen and columns to issues.

Bipartite latent block model
A latent block model (see, for instance, Wyse et al. (2014)) is used to model the bipartite graph.A clustering problem is formulated on both the rows and columns of the adjacency matrix: two partitions r and c determine the clustering of congress members and issues, respectively.The number of groups of r and c are denoted by K r and K c , respectively, and are unknown.These two partitions independently follow the same Multinomial-Dirichlet structure as described in previous applications.
As concerns the likelihood of the model, a K r × K c matrix Π is introduced, so that its generic element π gh ∈ [0, 1] corresponds to the probability of the occurance of an edge from a node in group g to a node in group h.Hence, conditionally on the allocations, the likelihood can be factorised into independent blocks: Bipartite latent block models may also be recast as finite mixture models, where the mixture is with respect to the partitions: The connection probabilities π gh are realisations of independent Beta random variables for every g = 1, . . ., K r and h = 1, . . ., K c , and all of the hyperparameters are fixed to 0.5.
Since conjugate priors are used, all of the model parameters can be integrated out analytically, thereby obtaining the marginal posterior p (r, c|Y) in exact form.Further details on the integration can be found in Wyse and Friel (2012) and Wyse et al. (2014).

Results
The algorithm of Wyse and Friel (2012) was used to obtain a sample for the allocations of both congress members and issues.Similarly to previous analyses, 1 million observations were discarded and 10,000 were used as final sample using a thinning of 100.The partitioning of the data corresponding to the highest posterior value was saved as a reference.We found that posing a clustering problem on the issues was not particularly interesting in that very few issues were aggregated in the same cluster, hence we will here show only the cluster analysis on the congress members.The sample of partitions for the members was processed through the procedure of Section 5, and then several runs of the greedy optimisation were performed.The computational time needed to get the sample was about 30 hours, whereas an average of 70 seconds was required for each run of the greedy algorithm, with K up fixed to 30. Figure 5 shows the posterior sample for the number of groups.The reordered adjacency matrices for the MAP and the VI-optimal partition q q q q q q q q 4 5 6 are shown in Figure 6.From the confusion table shown in Table 3 it appears that the two main political factions are split into 3 subgroups each, with a total of 29 individuals against the tide.

Conclusions
We have proposed a Bayesian approach to summarise a sample of partitions from an arbitrary clustering context.We have described a greedy algorithm capable of finding the optimal partition in a wide range of clustering frameworks.The algorithm can handle many well-known loss functions.In our analyses, we focused on the variation of information loss, which has proven to be particularly effective in the optimisation context.One appealing advantage of our methodology is that it can scale well with the number of items to be classified, hence being a useful general tool to use in an arbitrary clustering context.In fact, since previous methods focused only on particular choices of the loss function, our methodology is the only scalable method that can encompass most comparison measures within a unified framework.Also, label-switching issues do not affect our method.

MAP Bayes
The greedy algorithm usually converges with very few iterations, however several restarts are useful to avoid convergence to local optima.We noticed that, when compared to other similar greedy routines (Côme and Latouche 2015;Wyse et al. 2014;Bertoletti et al. 2015), the algorithm is more likely to converge to the global optimum, even though no final hierarchical merge step is used.This may be a consequence of the fact that the objective function is generally smoother and easier to optimise.
The wide applicability of our algorithm comes at a cost: each step of the optimisation process involves a computational cost depending on the size of the sample T , which can easily make the problem intractable if a large sample is used.However, in most cases this impasse can be downsized simply by "thinning" the sample.As concerns storage costs, the the main bottleneck is set by the T contingency tables of size K 2 that are used throughout the optimisation.
To emphasise the context-independence of our approach, we have proposed applications to real datasets for three different clustering frameworks.In the Gaussian finite mixture case (galaxies' dataset), the results look interesting and in line with the work of Wade and Ghahramani (2015).The results on the French political blogosphere appear to be very different from those obtained through previous analyses.On one hand an overestimation of the number of groups may be argued, on the other the groups obtained with our approach are evidently more homogeneous.A clustering problem on the members of the congressional voting data has also been proposed: here the two main political factions are well recognised and the results seem to agree with the previous analyses of Wyse and Friel (2012) and Wyse et al. (2014).
For each of the dataset analysed in this paper we have obtained the marginal sample for the allocation variables using a collapsed Gibbs sampler, which is a tool able to explore a number of models at the same time.This type of approach aims at improving the mixing of the Markov chain while keeping a low computational cost, and it generally works well in many clustering frameworks.However, due to the discrete nature of the sampled variables, rarely the sampler achieves good acceptance rates, and in some cases this causes a very slow mixing of the chain.This in turn biases the results obtained through the loss function approach, since our method heavily relies on the good quality of the sample of partitions.Unfortunately, at the moment there are no good solutions to address this impasse, suggesting that future research should focus on introducing new ways to explore the space of partitions Z in a clever way, hence making MCMC approaches more efficient.

Figure 1 :
Figure1: Galaxy dataset.On the left panel, the V I loss best partition found is shown.The right panel shows the posterior probabilities for the number of groups.The distribution has a peak at K = 4, which contrasts with the number of clusters in the optimal partition, equal to 3.

Figure 2 :
Figure 2: French blogs.Representation of the French political blogs network, with colours and node labels denoting cluster memberships.

Figure 3 :Figure 4 :
Figure 3: French blogs.Reordered adjacency matrices for three different partitioning of the French political blogs dataset: available political affiliations (left panel), VI-optimal allocations (central panel) and variational optimal allocations (right panel).

Figure 5 :
Figure5: Congressional voting data.Posterior distribution for the number of groups of congress members.The VI-optimal value K = 6 corresponds to the modal value, and the distribution is noticeably right-skewed.

Figure 6 :
Figure 6: Congressional voting data.Reordered adjacency matrices for the MAP and the VI-optimal partitions.The partitions on the columns (issues) are equivalent, whereas the rows are clustered in different ways, although the number of clusters is equal.

Table 3 :
Congressional voting data.Confusion matrix comparing the political affiliation with the VI-optimal partition.