Probabilistic consensus clustering using evidence accumulation
 1.2k Downloads
 11 Citations
Abstract
Clustering ensemble methods produce a consensus partition of a set of data points by combining the results of a collection of base clustering algorithms. In the evidence accumulation clustering (EAC) paradigm, the clustering ensemble is transformed into a pairwise coassociation matrix, thus avoiding the label correspondence problem, which is intrinsic to other clustering ensemble schemes. In this paper, we propose a consensus clustering approach based on the EAC paradigm, which is not limited to crisp partitions and fully exploits the nature of the coassociation matrix. Our solution determines probabilistic assignments of data points to clusters by minimizing a Bregman divergence between the observed coassociation frequencies and the corresponding cooccurrence probabilities expressed as functions of the unknown assignments. We additionally propose an optimization algorithm to find a solution under any doubleconvex Bregman divergence. Experiments on both synthetic and real benchmark data show the effectiveness of the proposed approach.
Keywords
Consensus clustering Evidence Accumulation Ensemble clustering Bregman divergence1 Introduction
Clustering ensemble methods look for consensus solutions from a set of base clustering algorithms, thus trying to combine into a single partition the information present in many different ones. Several authors have shown that these methods tend to reveal more robust and stable cluster structures than the individual clusterings in the ensemble (Fred 2001; Fred and Jain 2002; Strehl and Ghosh 2003). Leveraging an ensemble of clusterings is considerably more difficult than combining an ensemble of classifiers, due to the label correspondence problem: how to put in correspondence the cluster labels produced by different clustering algorithms? This problem is made more serious if clusterings with different numbers of clusters are allowed in the ensemble.
A possible solution to sidestep the cluster label correspondence problem has been proposed in the Evidence Accumulation Clustering (EAC) framework (Fred and Jain 2005). The core idea is based on the assumption that similar data points are very likely grouped together by some clustering algorithm and, conversely, data points that cooccur very often in the same cluster should be regarded as being very similar. Hence, it is reasonable to summarize a clustering ensemble in terms of a pairwise similarity matrix, called coassociation matrix, where each entry counts the number of clusterings in the ensemble in which a given pair of data points is placed in the same cluster. This new mapping can then be used as input for any similaritybased clustering algorithm. In Fred and Jain (2005), agglomerative hierarchical algorithms are used to extract the consensus partition (e.g. Single Link, Average Link, or Ward’s Link). In Fred and Jain (2006), an extension is proposed, entitled MultiCriteria Evidence Accumulation Clustering (MultiEAC), filtering the cluster combination process using a cluster stability criterion. Instead of using the information of the different partitions, it is assumed that, since algorithms can have different levels of performance in different regions of the space, only certain clusters should be considered.
The way the coassociation matrix is exploited in the literature is very naïve. Indeed, standard approaches based on EAC simply run a generic pairwise clustering algorithm with the coassociation matrix as input. The underlying clustering criteria of ad hoc algorithms, however, do not take advantage of the statistical interpretation of the computed similarities, which is an intrinsic part of the EAC framework. Also, the direct application of a clustering algorithm to the coassociation matrix typically induces a hard partition of the data. Although having crisp partitions as baseline for the accumulation of evidence of data organization is reasonable, this assumption is too restrictive in the phase of producing a consensus clustering. Indeed, the consensus partition is a solution that tries to accommodate the different clusterings in the ensemble and by allowing soft assignments of data points to clusters we can preserve some information about their intrinsic variability and capture the level of uncertainty of the overall label assignments, which would not be detected in the case of hard consensus partitions. The variability in the clustering solution of the ensemble might depend not only on the different algorithms and parametrizations adopted to build the ensemble, but also on the presence of clusters that naturally overlap in the data. This is the case for many important applications such as clustering microarray gene expression data, text categorization, perceptual grouping, labelling of visual scenes and medical diagnosis. In these cases, having a consensus solution in terms of a soft partition allows to detect also overlapping clusters charactering the data. It is worth mentioning that the importance of dealing with overlapping clusters has been recognized long ago (Jardine and Sibson 1968) and, in the machine learning community, there has been a renewed interest around this problem (Banerjee et al. 2005a; Heller and Ghahramani 2007). As an alternative, the consensus extraction could be obtained by running fuzzy kmedoids (Mei and Chen 2010) on the coassociation matrix as it were a standard similarity matrix, or fuzzy kmeans (Bezdek 1981) by interpreting each row of the coassociation matrix as a feature vector. However, such solutions would not take into account the underlying probabilistic meaning of the coassociation matrix and lack any formal statistical support.
In this paper, we propose a consensus clustering approach which is based on the EAC paradigm. Our solution fully exploits the nature of the coassociation matrix and does not lead to crisp partitions, as opposed to the standard approaches in the literature. Indeed, it consists of a model in which data points are probabilistically assigned a cluster. Moreover, each entry of the coassociation matrix, which is derived from the ensemble, is regarded as a realization of a Binomial random variable, parametrized by the unknown cluster assignments, that counts the number of times two specific data points are expected to be clustered together. A consensus clustering is then obtained by means of a maximum likelihood estimation of the unknown probabilistic cluster assignments. We further show that this approach is equivalent to minimizing the KullbackLeibler (KL) divergence between the observed cooccurrence frequencies derived from the coassociation matrix and the cooccurrence probabilities parametrizing the Binomial random variables. By replacing the KLdivergence with any Bregman divergence, we come up with a more general formulation for consensus clustering. In particular we consider, as an additional example, the case where the squared Euclidean distance is used as divergence. We also propose an optimization algorithm to solve the minimization problem derived from our formulation, which works for any doubleconvex Bregman divergence, and a comprehensive set of experiments shows the effectiveness of our new consensus clustering approach.
The remainder of the paper is organized as follows. In Sect. 2 we provide definitions and notations that will be used across the manuscript. In Sects. 3 and 4, we describe the proposed formulation for consensus clustering and the corresponding optimization problem. In Sect. 5, we present an optimization algorithm the can be used to find a consensus solution. Section 6 briefly reviews related work, and Sect. 7 reports experimental results. Finally, Sect. 8 presents some concluding remarks. A preliminary version of this paper appeared in Rota Bulò et al. (2010).
2 Notation and definitions
Sets are denoted by uppercase calligraphic letters (e.g., \(\mathcal {O}\), \(\mathcal {E}\), …) except for \(\mathbb{R}\) and \(\mathbb{R}_{+}\) which represent as usual the sets of real numbers and nonnegative real numbers, respectively. The cardinality of a finite set is written as ⋅. We denote vectors with lowercase boldface letters (e.g., x, y, …) and matrices with uppercase boldface letters (e.g., X, Y, …). The ith component of a vector x is denoted as x _{ i } and the (i,j)th component of a matrix Y is written as y _{ ij }. The transposition operator is given by the symbol ^{⊤}. The ℓ _{ p }norm of a vector x is written as ∥x∥_{ p } and we implicitly assume a ℓ _{2} (or Euclidean) norm, where p is omitted. We denote by e _{ n } a ndimensional column vector of all 1’s and by \(\mathbf {e}_{n}^{(j)}\) the jth column of the ndimensional identity matrix. The trace of matrix \(\mathbf {M}\in\mathbb{R}^{n\times n}\) is given by \(\operatorname {Tr}(\mathbf {M})=\sum_{i=1}^{n}m_{ii}\). The domain of a function f is denoted by dom(f) and Open image in new window is the indicator function giving 1 if P is true, 0 otherwise.
Examples of doubleconvex Bregman divergences
Divergence  ϕ(x)  d _{ ϕ }(x,y)  Domain 

Squared ℓ _{2}  ∥x∥^{2}  ∥x−y∥^{2}  \(\mathbf {x},\mathbf {y}\in\mathbb{R}^{K}\) 
Mahalanobis  x ^{⊤} A x  (x−y)^{⊤} A(x−y)  \(\mathbf {x},\mathbf {y}\in\mathbb{R}^{K}\), A≽0 
KullbackLeibler  −H(x)  \(\sum_{j=1}^{K} x_{j}\log ( \frac{x_{j}}{y_{j}} )\)  x,y∈Δ _{ K } 
Generalized Idiv.  −H(x)−e ^{⊤} x  \(\sum_{j=1}^{K} x_{j}\log ( \frac{x_{j}}{y_{j}} )x_{j}+y_{j}\)  \(\mathbf {x},\mathbf {y}\in\mathbb{R}_{+}^{K}\) 
3 A probabilistic model for consensus clustering
Consensus clustering is an unsupervised learning approach that summarizes an ensemble of partitions obtained from a set of base clustering algorithms into a single consensus partition. In this section, we introduce a novel model for consensus clustering, which collapses the information gathered from the clustering ensemble into a single partition, in which data points are assigned to clusters in a probabilistic sense.
Let \(\mathcal {O}=\{1,\dots,n\}\) be the indices of a set of data points to be clustered and let \(\mathcal {E}=\{p_{u}\}_{u=1}^{N}\) be a clustering ensemble, i.e., a set of N clusterings obtained by different algorithms with possibly different parametrizations and/or initializations and/or subsampled versions of the data set. Each clustering \(p_{u} \in \mathcal {E}\) is a function \(p_{u}:\mathcal {O}_{u} \rightarrow \{1,\ldots,K_{u}\}\) assigning a cluster out of K _{ u } available ones to data points in \(\mathcal {O}_{u}\subseteq \mathcal {O}\), where \(\mathcal {O}_{u}\) and K _{ u } can be different across the clusterings indexed by u. We put forward data subsampling as a most general framework for the following reasons: it favours the diversity of the clustering ensemble and it models situations of distributed clustering where local clusters have only partial access to the data.
In standard EAC literature the coassociation matrix holds the fraction of times two data points are coclustered, while in our definition it holds the number of times this event occurs. The reason of this choice stems from the fact that we allow subsampling in the ensemble construction. Consequently, the number of times two data points appear in a clustering of the ensemble is not constant over all possible pairs. This renders the observation of the fraction of times coclustering occurs statistically more significant for some pairs of data points and less for other ones. By considering C in absolute terms and by keeping track of the quantities N _{ ij }’s we can capture this information. As an example, consider that the ensemble consists on 100 partitions, and due to subsampling, let a pair of samples (i,j), coappear in partitions N _{ ij }=80, and be coclustered 70 times. Then, N=100, N _{ ij }=80 and c _{ ij }=70.
Matrix Y ^{∗}, the solution of problem (1), provides probabilistic cluster assignments for the data points, which constitute the solution to the consensus clustering problem according to our model.
Summary of notation
Symbol  Description 

N _{ ij }  Number of times data points i and j are on the same partition 
c _{ ij }  Number of times data points i and j are coclustered 
C  C=[c _{ ij }], coassociation matrix 
y _{ i }  Probability distribution of data point i over the set of K clusters 
Y  \(\mathbf {Y}=[\mathbf {y}_{1},\dots,\mathbf {y}_{n}]\in \varDelta _{K}^{n}\) 
C _{ ij }  \(C_{ij} \sim \mbox{Binomial} ( N_{ij} , \mathbf {y}_{i}^{\top} \mathbf {y}_{j} )\) 
4 A class of alternative formulations
The formulation introduced in the previous section for consensus clustering can be seen as a special instance of a more general setting, which will be described in this section.
Intuitively, the solution Y ^{∗} to (2) is a probabilistic cluster assignment yielding a minimum Bregman divergence between the observed cooccurrence statistics of each pair of data points and the estimated ones. Moreover, each term of f(Y) is weighted by N _{ ij } in order to account of the statistical significance of the observations.
The formulation in (2) encompasses the one introduced in the previous section as a special case. Indeed, by considering the parametrization ϕ(x)=−H(x), we have that B _{ ϕ }≡D _{ KL }, i.e., the Bregman divergence coincides with the KLdivergence, and by simple algebra the equivalence between (2) and (1) can be derived. For a formal proof we refer to Proposition 1 in Appendix.
In the next section we will cover the algorithmic aspects of the computation of probabilistic assignments, which represent our solution to the consensus clustering problem.
5 Optimization algorithm
In this section, we describe an efficient optimization procedure which allows to find a local solution to (2), which works for any doubleconvex Bregman divergence. This procedure falls in the class of primal linesearch methods because it iteratively finds a feasible descent direction, i.e., satisfying the constraints and guaranteeing a local decrease of the objective.
This section is organized into four parts. The first part is devoted to the problem of finding a feasible, descent direction, while the second part addresses the problem of searching a better solution along that direction. In the third part, we summarize the optimization algorithm and provide some additional techniques to reduce its computational complexity. Finally, in the last part we show how our algorithm can be adapted to efficiently cluster largescale datasets.
5.1 Computation of a search direction
Given a nonoptimal feasible solution \(\mathbf {Y}\in \varDelta _{K}^{n}\) of (2), we can look for a better solution along a direction \(\mathbf {D}\in\mathbb {R}^{K\times n}\) by finding a value of ϵ such that f(Z _{ ϵ })<f(Y), where Z _{ ϵ }=Y+ϵ D. The search direction D is said to be feasible and descending at Y if the two following conditions hold for all sufficiently small positive values of ϵ: \(\mathbf {Z}_{\epsilon}\in \varDelta _{K}^{n}\) and f(Z _{ ϵ })<f(Y).
The search direction D ^{∗} at Y obtained from (5) is clearly feasible since it belongs to \(\mathcal {D} (\mathbf {Y})\) but it is also always descending, unless Y satisfies the KarushKuhnTucker (KKT) conditions, i.e., the firstorder necessary conditions for local optimality, for the minimization problem in (2). This result is formally proven in Proposition 4 in Appendix.
5.2 Computation of an optimal step size
By the convexity of (9) and Kachurovskii’s theorem (Kachurovskii 1960) we have that ρ is nondecreasing in the interval 0≤ϵ≤y _{ VJ }. Moreover, ρ(0)<0 since D ^{∗} is a descending direction as stated by Proposition 4. Otherwise, we would have that Y satisfies the KKT conditions for local optimality.

if ρ(y _{ VJ })≤0 then ϵ ^{∗}=y _{ VJ } for f(Z _{ ϵ }) would be nonincreasing in the feasible set of (9);

if ρ(y _{ VJ })>0 then ϵ ^{∗} is a zero of ρ that can be found in general using a dichotomic search which preserves the discording signs of ρ at the endpoints of the search interval.
In some cases (9) has a closed form solution. This of course depends on the nature of the Bregman divergence adopted. For instance, if we consider the squared ℓ _{2} distance as a divergence (i.e., ϕ(x)=∥x∥^{2}), then f(Z _{ ϵ }) becomes a quadratic polynomial in the variable ϵ which can be trivially minimized in closedform.
5.3 Algorithm
At an abstract level, the algorithm iteratively finds a feasible, descending direction D ^{∗} at the current solution Y ^{(t)}, computes the optimal step ϵ ^{∗} and performs an update of the solution as Y ^{(t+1)}=Y ^{(t)}+ϵ ^{∗} D ^{∗}. This procedure is iterated until a stopping criterion is met.
In order to obtain a time complexity periteration that is linear in the number of variables, we exploit the extreme sparseness of the search direction D ^{∗} for the update of matrix Y ^{(t)} ^{⊤} Y ^{(t)} (denoted by A ^{(t)} in the pseudocode) and for the update of the gradient vectors \(g_{i}^{(t)}\). Each iteration, indeed, depends on these two fundamental quantities. In the specific, the computation of A ^{(t+1)} can be obtained in O(n) by simply changing the Jth row and the Jth column of A ^{(t)} (it follows from the update formula at line 10). By exploiting A ^{(t+1)}, the gradient vectors can be computed in O(Kn). In fact, we obtain \(g_{i}^{(t+1)}\) for all \(i\in \mathcal {O}\setminus\{J\}\) by performing a constant time operation on each entry of \(g_{i}^{(t)}\) (lines 12–14) and we compute \(g_{J}^{(t+1)}\) (line 15) in O(Kn) as well. Having A ^{(t)} and the gradient vectors computed allows us to find the search direction D ^{∗} at line 8 in O(nK), since it suffices to access each element of the gradient vectors only once to determine J,U and V. Moreover, the computation of the optimal step size at line 9 can be carried out in O(nlog_{2}(1/δ)), if a dichotomic search is employed, and in constant time in cases where a closedform solution exists (e.g., if ϕ(x)=∥x∥^{2}). Finally, the update of the solution at line 11 can be carried out in constant time by the sparsity of D ^{∗}. The time complexity of each iteration is thus given by O(nmax(K,log_{2}(1/δ))).
The most costly part of the algorithm is the initialization (2–5) which has O(n ^{2} K) time complexity. Hence, the overall complexity of the algorithm is O(n ^{2} K+mn×max(K,log_{2}(1/δ))) where m is the number of required iterations, which is difficult to know in advance. As a rule of thumb, we need m∈Ω(nK) iterations to converge, because every entry of Y should be modified at least once. In that case the complexity is decided by the iterations only.
Finally, the stopping criterion ideally should test whether D ^{∗} is a descending direction. Indeed, if this does not hold then we know that Y ^{(t)} is satisfying the KKT conditions (it follows from Proposition 4 in Appendix) and we can stop. In practice, we simply check if the quantity g _{ J }(Y ^{(t)})_{ V }−g _{ J }(Y ^{(t)})_{ U } is below a given threshold τ and we stop if this happens. Indeed, if that quantity is precisely zero, then Y ^{(t)} satisfies the KKT conditions. Additionally, we put an upper bound to the number of iterations.
5.4 A note on scalability
In applications where the number of data points to cluster is very large, the computation of the whole coassociation matrix becomes impossible. In this cases one resorts to sparsifying the coassociation matrix by keeping a number of entries that scales linearly with the number of data points.
Our algorithm can be easily adapted to deal with sparse coassociation matrices. Assume that \(\mathcal {P} \) contains only a sparse set of observable data point pairs. Let ℓ be the expected average number of entries of \(\mathcal {P} _{i}\), i.e., \(\ell=\sum_{i\in \mathcal {O} }\mathcal {P} _{i}/n\) and assume that the input quantities c _{ ij }’s and N _{ ij }’s are given only for the pairs \(\{i,j\}\in \mathcal {P} \). Since we need to know the value of \(\mathbf {y}_{i}^{\top} \mathbf {y}_{j}\) again only for pairs of data points in \(\mathcal {P} \), the computation of A ^{(0)} is not fully required and only the entries indexed by \(\mathcal {P} \) should be computed. This reduces to O(Kℓn) the complexity of line 2 of Algorithm 1, where ℓ≪n. The same complexity characterizes the initialization of the gradient at lines 3–5. The subsequent updates of matrix A ^{(t)} at line 10 and of the gradient at lines 12–15 require only O(ℓ) and O(K ℓ) operations, respectively. By adopting a priority queue (e.g., heap based), the computation of the optimal direction in terms of U, V and J at line 8 requires only an overall complexity of O(Klog_{2}(n)) per iteration. This can be achieved by initially storing in the priority queue the best values of U and V for all \(i\in \mathcal {O} \) and by updating the priorities based on the sparse changes in the gradient values. The optimal step at line 9 can be computed in O(ℓlog_{2}(1/δ)), where δ is the tolerance for the dichotomic search. Finally, the update of Y remains with a constant complexity. The overall periteration complexity becomes O(max(ℓlog_{2}(1/δ),Klog_{2}(n))). As for the number of iterations the considerations made in Sect. 5.3 still hold.
6 Related work
Several consensus methods have been proposed in the literature (Fred 2001; Strehl and Ghosh 2003; Fred and Jain 2005; Topchy et al. 2004; Dimitriadou et al. 2002; Ayad and Kamel 2008; Fern and Brodley 2004). Some of these methods are based on the similarity between data points, which is induced by the clustering ensemble, others are based on estimates of similarity between partitions and others cast the problem as a categorical clustering problem. All these methods tend to reveal a more robust and stable clustering solution than the individual clusterings used as input for the problem. A very recent survey can be found in Ghosh et al. (2011).
Strehl and Ghosh (2003) formulated the clustering ensemble problem as an optimization problem based on the maximal average mutual information between the optimal combined clustering and the clustering ensemble, presenting three algorithms to solve it, exploring graph theoretical concepts. The first one, entitled Clusterbased Similarity Partitioning Algorithm (CSPA), uses a graph partitioning algorithm, METIS (Karypis and Kumar 1998), for extracting a consensus partition from the coassociation matrix. The second and third algorithms, Hyper Graph Partitioning Algorithm (HGPA) and Meta CLustering Algorithm (MCLA), respectively, are based on hypergraphs, where vertices correspond to data points, and the hyperedges, which allow the connection of several vertices, correspond to clusters of the Clustering ensemble. HGPA obtains the consensus solution using an hypergraph partitioning algorithm, HMETIS (Karypis et al. 1997); MCLA, uses another heuristic which allows clustering clusters.
Fern and Brodley (2004) reduce the problem to graph partitioning. The proposed model, entitled Hybrid Bipartite Graph Formulation (HBGF), uses as vertices both instances and clusters of the ensemble, retaining all of the information provided by the clustering ensemble, and allowing to consider the similarity among instances and among clusters. The partitioning of this bipartite graph is produced using the multiway spectral graph partitioning algorithm proposed by Ng et al. (2001), which optimizes the normalized cut criterion (Shi and Malik 2000), or, as alternative, the graph partitioning algorithm METIS (Karypis and Kumar 1998).
These approaches were later extended by Punera and Ghosh (2007, 2008), to allow soft base clusterings on the clustering ensemble, showing that the addition of information on the ensemble is useful; the proposed models were the soft version of CSPA, of MCLA, and HBGF. Additionally they proposed to use information theoretic Kmeans (Dhillon et al. 2003), an algorithm very similar to Kmeans, differing only in the distance measure, using KLdivergence, for clustering in the feature space obtained from concatenating all the posteriors from the ensemble.
Topchy et al. (2003, 2004, 2005) proposed two different formulations, both derived from similarities between the partitions in the ensemble, rather than similarities between data points, differently from the case of coassociation based approaches. The first one is a multinomial mixture model (MM) over the labels of the clustering ensemble, thus each partition is considered as a feature with categorical attributes. The second one is based on the notion of median partition and is entitled Quadratic Mutual Information Algorithm (QMI). The median partition is defined as the partition that best summarizes the partitions of the ensemble.
Wang et al. (2009, 2011) extended this idea, introducing a Bayesian version of the multinomial mixture model, the Bayesian cluster ensembles (BCE). Although the posterior distribution cannot be calculated in closed form, it is approximated using variational inference and Gibbs sampling, in a very similar procedure as in latent Dirichlet allocation (LDA) models (Griffiths and Steyvers 2004; Steyvers and Griffiths 2007), but applied to a different input feature space, the feature space of the labels of the ensembles. In Wang et al. (2010), a nonparametric version of BCE was proposed.
Ayad and Kamel (2008), followed Dimitriadou et al. (2002), proposed the idea of cumulative voting as a solution for the problem of aligning the cluster labels. Each clustering of the ensemble is transformed into a probabilistic representation with respect to a common reference clustering. Three voting schemes are presented: Unnormalized fixedReference Cumulative Voting (URCV), fixedReference Cumulative Voting (RCV), and Adaptive Cumulative Voting (ACV).
Lourenço et al. (2011), modelled the problem of consensus extraction taking as input space pairwise information, and using a generative aspect model for dyadic data. The extraction of a consensus solutions is found by solving a maximum likelihood estimation problem, using the ExpectationMaximization (EM) algorithm.
Our framework is also related to Nonnegative Matrix Factorization (Paatero and Tapper 1994; Lee and Seung 2000), which is the problem of approximatively factorizing a given matrix M, with two entrywise nonnegative matrices F and G, so that M≈F G. Indeed our formulation can be regarded as a kind of matrix factorization of the coassociation matrix in terms of matrix Y ^{⊤} Y under the constraint that Y is column stochastic. This particular setting has been considered, for the ℓ _{2} norm, in Arora et al. (2011) and in Nepusz et al. (2008).
7 Experimental results
In this section we evaluate our formulation using synthetic datasets and realworld datasets from the UCI Irvine and UCI KDD Machine Learning Repository. We performed four different series of experiments: (i) we study the convergence properties of our algorithm on synthetically generated coassociation matrices, (ii) we compare the consensus clustering obtained on different datasets with the known, crisp, ground truth partitions using standard evaluation criteria and we compare against other consensus clustering approaches, (iii) we perform an experiment on a largescale dataset with incomplete partitions in the ensemble, (iv) we perform a qualitative analysis of a realworld dataset by deriving additional information from the probabilistic output of our algorithm.
We evaluate the performance of our Probabilistic Consensus Clustering (PCC) algorithm with KLdivergence (PCCKL) and with squared ℓ _{2} divergence (PCCℓ _{2}). From the quantitative perspective, we compare the performance of PCCℓ _{2} and PCCKL against other stateoftheart consensus algorithms: the classical EAC algorithm using as extraction criteria the hierarchical agglomerative singlelink (EACSL) and averagelink (EACAL) algorithms; Clusterbased Similarity Partitioning Algorithm (CSPA) (Strehl and Ghosh 2003); Hybrid Bipartite Graph Formulation (HBGF) (Fern and Brodley 2004); Mixture Model (MM) (Topchy et al. 2004, 2005); Quadratic Mutual Information Algorithm (QMI) (Topchy et al. 2003, 2005).
7.1 Simulated data
7.2 UCI and synthetic data
We followed the usual strategy of producing clustering ensembles and combining them using the coassociation matrix. Two different types of ensembles were created: (1) using kmeans with random initialization and random number of clusters (Lourenço et al. 2010), splitting natural clusters intro microblocks; (2) combining multiple algorithms (agglomerative hierarchical algorithms: single, average, ward, centroid link; kmeans Jain and Dubes 1988; spectral clustering Ng et al. 2001) with different number of clusters, inducing blockwise coassociation matrices.
Benchmark datasets
DataSets  K  n  Ensemble 

K _{ i }  
(s–1) rings  3  450  2–8 
(s–2) image1  8  1000  8–15, 20, 30 
(r–1) iris  3  150  3–10, 15, 20 
(r–2) wine  3  178  4–10, 15, 20 
(r–3) housevotes  2  232  4–10, 15, 20 
(r–4) ionsphere  2  351  4–10, 15, 20 
(r–5) stdyeastcell  5  384  5–10, 15, 20 
(r–6) breastcancer  2  683  3–10, 15, 20 
(r–7) optdigits  10  1000  10, 12, 15, 20, 35, 50 
Results from experiments conducted with an ensemble of type 1 evaluated with criterion \(\mathcal{H}\). See Sect. 7.2 for details
Alg  s–1  s–2  r–1  r–2  r–3  r–4  r–5  r–6  r–7  

PCCKL  avg  0.53  0.57  0.86  0.96  0.89  0.54  0.92  0.63  0.87 
std  0.02  0.03  0.10  0.00  0.01  0.00  0.11  0.00  0.04  
max  0.55  0.61  0.91  0.96  0.90  0.54  0.97  0.63  0.90  
min  0.51  0.55  0.69  0.96  0.88  0.54  0.73  0.63  0.80  
PCCℓ _{2}  avg  0.57  0.46  0.71  0.97  0.88  0.54  0.74  0.60  0.87 
std  0.00  0.02  0.00  0.00  0.00  0.00  0.00  0.02  0.03  
max  0.57  0.48  0.71  0.97  0.88  0.54  0.74  0.62  0.89  
min  0.57  0.43  0.71  0.97  0.88  0.54  0.74  0.57  0.82  
QMI  avg  0.52  0.39  0.55  0.67  0.77  0.43  0.73  0.64  0.36 
std  0.13  0.03  0.10  0.18  0.16  0.11  0.24  0.07  0.10  
max  0.75  0.43  0.72  0.96  0.93  0.57  0.97  0.75  0.49  
min  0.44  0.36  0.46  0.53  0.53  0.30  0.41  0.57  0.24  
MM  avg  0.54  0.34  0.61  0.65  0.70  0.38  0.78  0.62  0.57 
std  0.02  0.03  0.08  0.13  0.11  0.07  0.18  0.12  0.05  
max  0.57  0.39  0.73  0.85  0.85  0.48  0.95  0.75  0.63  
min  0.51  0.30  0.52  0.52  0.54  0.32  0.54  0.48  0.49  
HBGF  avg  0.50  0.37  0.66  0.71  0.58  0.44  0.68  0.65  0.49 
std  0.11  0.07  0.12  0.18  0.06  0.07  0.13  0.07  0.05  
max  0.64  0.47  0.84  0.96  0.66  0.50  0.86  0.71  0.56  
min  0.35  0.29  0.55  0.49  0.53  0.34  0.54  0.53  0.43  
EAC  SL  1.00  0.67  0.75  0.67  0.67  0.53  0.66  0.67  0.62 
AL  1.00  0.59  0.89  0.93  0.87  0.69  0.97  0.54  0.80  
WL  0.73  0.47  0.89  0.96  0.85  0.54  0.73  0.61  0.90  
CSPA  0.78  0.49  0.97  0.92  0.93  0.52  0.85  0.68  0.87 
Results from experiments conducted with an ensemble of type 2 evaluated with criterion \(\mathcal{H}\). See Sect. 7.2 for details
Alg  s–1  s–2  r–1  r–2  r–3  r–4  r–5  r–6  r–7  

PCCKL  avg  0.71  0.71  0.97  0.97  0.91  0.69  0.97  0.73  0.61 
std  0.03  0.02  0.00  0.00  0.00  0.00  0.00  0.00  0.02  
max  0.75  0.73  0.97  0.97  0.91  0.69  0.97  0.73  0.64  
min  0.69  0.68  0.97  0.97  0.91  0.69  0.97  0.73  0.58  
PCCℓ _{2}  avg  0.68  0.63  0.95  0.97  0.91  0.68  0.96  0.73  0.60 
std  0.00  0.04  0.02  0.00  0.00  0.00  0.03  0.00  0.01  
max  0.68  0.67  0.97  0.97  0.91  0.68  0.97  0.73  0.62  
min  0.68  0.57  0.93  0.97  0.91  0.68  0.91  0.72  0.59  
QMI  avg  0.49  0.30  0.44  0.59  0.75  0.52  0.84  0.69  0.24 
std  0.11  0.11  0.11  0.11  0.20  0.12  0.17  0.05  0.15  
max  0.62  0.48  0.62  0.65  0.91  0.67  0.97  0.73  0.50  
min  0.38  0.23  0.33  0.40  0.53  0.35  0.65  0.64  0.15  
MM  avg  0.53  0.38  0.61  0.57  0.63  0.48  0.85  0.67  0.51 
std  0.06  0.06  0.16  0.07  0.12  0.05  0.09  0.03  0.04  
max  0.64  0.46  0.87  0.64  0.80  0.55  0.96  0.70  0.54  
min  0.50  0.30  0.46  0.48  0.53  0.43  0.76  0.63  0.44  
HBGF  avg  0.43  0.27  0.48  0.55  0.62  0.38  0.74  0.62  0.50 
std  0.04  0.02  0.07  0.11  0.03  0.05  0.15  0.10  0.06  
max  0.46  0.30  0.56  0.70  0.68  0.46  0.88  0.77  0.57  
min  0.37  0.25  0.39  0.39  0.58  0.34  0.54  0.54  0.41  
EAC  SL  0.63  0.71  0.69  0.39  0.52  0.36  0.65  0.64  0.32 
AL  0.38  0.71  0.97  0.39  0.91  0.36  0.66  0.65  0.42  
WL  0.39  0.59  0.93  0.93  0.88  0.70  0.96  0.65  0.76  
CSPA  0.68  0.56  0.96  0.93  0.92  0.53  0.85  0.66  0.78 
Results from experiments conducted with an ensemble of type 1 evaluated with the AdjustedRand index. See Sect. 7.2 for details
Alg  s–1  s–2  r–1  r–2  r–3  r–4  r–5  r–6  r–7  

PCCKL  avg  0.59  0.86  0.87  0.95  0.81  0.79  0.88  0.53  0.96 
std  0.01  0.01  0.05  0.00  0.02  0.00  0.15  0.00  0.01  
max  0.60  0.88  0.89  0.95  0.82  0.79  0.94  0.53  0.96  
min  0.58  0.85  0.78  0.95  0.79  0.78  0.61  0.53  0.95  
PCCℓ _{2}  avg  0.60  0.83  0.78  0.96  0.78  0.78  0.61  0.52  0.96 
std  0.00  0.02  0.00  0.00  0.00  0.00  0.00  0.01  0.01  
max  0.60  0.84  0.78  0.96  0.78  0.78  0.61  0.53  0.96  
min  0.60  0.80  0.78  0.96  0.78  0.78  0.61  0.51  0.95  
QMI  avg  0.52  0.61  0.65  0.75  0.69  0.61  0.70  0.54  0.68 
std  0.12  0.08  0.10  0.11  0.15  0.22  0.22  0.05  0.13  
max  0.66  0.68  0.78  0.95  0.86  0.75  0.94  0.63  0.84  
min  0.41  0.48  0.56  0.68  0.50  0.23  0.51  0.51  0.55  
MM  avg  0.57  0.74  0.65  0.70  0.60  0.65  0.71  0.55  0.89 
std  0.05  0.01  0.07  0.10  0.09  0.08  0.18  0.06  0.01  
max  0.60  0.75  0.71  0.83  0.75  0.75  0.91  0.62  0.90  
min  0.49  0.72  0.55  0.54  0.50  0.53  0.50  0.50  0.88  
HBGF  avg  0.57  0.75  0.71  0.76  0.52  0.71  0.60  0.55  0.87 
std  0.04  0.04  0.08  0.11  0.02  0.01  0.11  0.03  0.03  
max  0.63  0.78  0.83  0.95  0.55  0.72  0.75  0.59  0.90  
min  0.52  0.69  0.61  0.67  0.50  0.70  0.50  0.50  0.83  
EAC  SL  1.00  0.86  0.79  0.73  0.55  0.70  0.55  0.55  0.89 
AL  1.00  0.86  0.88  0.90  0.77  0.81  0.94  0.50  0.95  
WL  0.66  0.84  0.88  0.95  0.75  0.79  0.61  0.52  0.96  
CSPA  0.78  0.83  0.97  0.90  0.86  0.78  0.75  0.56  0.96 
Results from experiments conducted with an ensemble of type 2 evaluated with the AdjustedRand index. See Sect. 7.2 for details
Alg  s–1  s–2  r–1  r–2  r–3  r–4  r–5  r–6  r–7  

PCCKL  avg  0.72  0.89  0.97  0.96  0.83  0.81  0.94  0.60  0.85 
std  0.03  0.00  0.00  0.00  0.00  0.00  0.00  0.00  0.02  
max  0.77  0.89  0.97  0.96  0.83  0.81  0.94  0.60  0.87  
min  0.70  0.88  0.97  0.96  0.83  0.81  0.94  0.60  0.83  
PCCℓ _{2}  avg  0.69  0.88  0.94  0.96  0.83  0.81  0.92  0.60  0.83 
std  0.00  0.01  0.03  0.00  0.00  0.00  0.05  0.00  0.01  
max  0.69  0.88  0.97  0.96  0.83  0.81  0.94  0.60  0.84  
min  0.69  0.87  0.92  0.96  0.83  0.81  0.84  0.60  0.82  
QMI  avg  0.56  0.30  0.42  0.63  0.69  0.62  0.78  0.58  0.41 
std  0.06  0.22  0.09  0.16  0.17  0.23  0.21  0.03  0.30  
max  0.61  0.68  0.56  0.74  0.83  0.80  0.93  0.60  0.84  
min  0.49  0.17  0.33  0.34  0.50  0.23  0.54  0.54  0.19  
MM  avg  0.58  0.76  0.65  0.63  0.55  0.71  0.75  0.56  0.86 
std  0.05  0.03  0.11  0.09  0.08  0.03  0.12  0.02  0.02  
max  0.63  0.81  0.84  0.72  0.68  0.75  0.92  0.58  0.88  
min  0.50  0.73  0.58  0.54  0.50  0.67  0.63  0.53  0.83  
HBGF  avg  0.55  0.72  0.60  0.64  0.53  0.67  0.65  0.54  0.87 
std  0.02  0.01  0.05  0.05  0.02  0.03  0.13  0.06  0.01  
max  0.57  0.74  0.66  0.71  0.56  0.71  0.79  0.64  0.88  
min  0.52  0.70  0.52  0.58  0.51  0.63  0.50  0.50  0.85  
EAC  SL  0.61  0.89  0.78  0.36  0.50  0.25  0.55  0.54  0.46 
AL  0.53  0.89  0.96  0.36  0.84  0.28  0.55  0.54  0.60  
WL  0.54  0.88  0.92  0.92  0.79  0.81  0.93  0.54  0.93  
CSPA  0.72  0.84  0.95  0.91  0.86  0.78  0.74  0.55  0.94 
The performance of PCCKL and PCCℓ _{2} depends on the type of ensemble. On Ensemble (1), PCCKL and PCCℓ _{2}, have generally lower performance when compared with EAC and CSPA (both on AdjustedRand Index and CI), that seem very suitable for this kind of ensembles. Nevertheless, on the UCI datasets, both obtain promising results: PCCℓ _{2} is the best in 1 (over 9) dataset, and PCCKL the best in 1 (over 9) and is very close to the best consensus in several situations. On ensemble (2), PCCKL obtains the best results almost on all datasets, 7 (over 9).
Its also very important to notice that the standard deviation of the proposed methods is very low, being in almost every datasets very close to zero.
7.3 A largescale experiment
In order to show that our algorithm can be used also on largescale datasets we propose here an experiment on a KKD Cup 1999 dataset^{1}. From the available datasets we analysed a subset of “kddcup.data10percent”, consisting of 120.000 data points characterized by 41 attributes distributed in 3 classes. The preprocessing consisted in standardizing numerical features, and discretizing categorical features, arriving to a 39dimensional feature space. We produced an ensemble consisting of 100 Kmeans partitions obtained on random subsets of the dataset (sampling rate 50 %) with random initializations and random number of clusters (2≤K≤10). Since the ensemble is composed by incomplete partitions, the consensus clustering phase becomes more challenging.
In order to cope with the large amount of data points, which renders the construction of the coassociation matrix impossible both from a space and computational time perspective, we run a sparsified version of our algorithm as described in Sect. 5.4. In the specific, we created \(\mathcal {P} \) by sampling a share of 0.25 ‰ data points pairs from the available (around 8 billions) ones. Our algorithms (PCCℓ _{2} and PCCKL) were run with a maximum number of nK iterations. Our nonparallelized C implementations of PCCℓ _{2} and PCCKL took on average 13.8 s and 16.7 s, respectively, to deliver a solution on a dualcore 64bits Pentium 2.8 GHz with 4 Gb RAM (only one core was effectively used). We were able to compare our algorithm only against CSPA, which nevertheless obtained competitive results in the previous set of experiments. All other approaches could not be run due to the large size of the dataset, or because of their inability of handling incomplete partitions in the ensemble.
Our consensus solution provides a considerable improvement of this score, confirming the effectiveness of our algorithm also in the presence of incomplete partitions in the ensemble.
7.4 Visualizing probabilistic relations
8 Conclusions
In this paper, we introduced a new probabilistic consensus clustering formulation based on the EAC paradigm. Each entry of the coassociation matrix, derived from the ensemble, is regarded as a Binomial random variable, parametrized by the unknown class assignments. We showed that the loglikelihood function corresponding to this model coincides with the KL divergence between the coassociation relative frequencies and the cooccurrence probabilities parametrized by the Binomial random variables. This formulation can be seen as a special case of a more general setting, replacing the KL divergence with any Bregman divergence. We proposed an algorithm to find a consensus clustering solution according to our model, which works with any doubleconvex Bregman divergence. We also showed how the algorithm can be adapted to deal with largescale datasets. Experiments on synthetic and real world datasets have demonstrated the effectiveness of our approach with ensembles composed by heterogeneous partitions obtained from multiple algorithms (agglomerative hierarchical algorithms, kmeans, spectral clustering) with varying number of clusters. Additionally, we have shown that our algorithm is able to deal with largescale datasets and can successfully be applied also in case of ensembles having incomplete partitions. On different datasets and ensembles, we outperformed the competing stateoftheart algorithms and showed particularly outstanding results on the largescale experiment. The qualitative analysis of the probabilistic consensus solutions provided some evidences that the proposed formulation can discover new structures in data. For the PenDigits dataset, we showed visual relationships between overlapping clusters representing the same digit, using the centroids of each cluster and similarities between clusters obtained from the probabilities of the consensus solution.
Footnotes
Notes
Acknowledgements
This work was partially financed by an ERCIM “Alain Bensoussan” Fellowship Programme under the European Union Seventh Framework Programme (FP7/20072013), grant agreement n. 246016, by Fundação para a Ciência e Tecnologia, under grants PTDC/EIACCO/103230/2008, SFRH/PROTEC/ 49512/2009 and PEstOE/EEI/LA0008/2011, and by the Área Departamental de Engenharia Electronica e Telecomunicações e de Computadores of Instituto Superior de Engenharia de Lisboa, whose support the authors gratefully acknowledge.
References
 Arora, R., Gupta, M., Kapila, A., & Fazel, M. (2011). Clustering by leftstochastic matrix factorization. In L. Getoor & T. Scheffer (Eds.), ICML (pp. 761–768). Omnipress. Google Scholar
 Ayad, H., & Kamel, M. S. (2008). Cumulative voting consensus method for partitions with variable number of clusters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(1), 160–173. CrossRefGoogle Scholar
 Banerjee, A., Krumpelman, C., Basu, S., Mooney, R. J., & Ghosh, J. (2005a). Modelbased overlapping clustering. In Int. conf. on knowledge discovery and data mining (pp. 532–537). Google Scholar
 Banerjee, A., Merugu, S., Dhillon, I., & Ghosh, J. (2005b). Clustering with Bregman divergences. Journal of Machine Learning Research, 6, 1705–1749. MathSciNetzbMATHGoogle Scholar
 Bezdek, J. (1981). Pattern recognition with fuzzy objective function algorithms. Norwell: Kluwer Academic. CrossRefzbMATHGoogle Scholar
 Bezdek, J., & Hathaway, R. (2002). VAT: a tool for visual assessment of (cluster) tendency. In Proceedings of the 2002 international joint conference on neural networks 2002, IJCNN’02 (Vol. 3, pp. 2225–2230). Google Scholar
 Boyd, S., & Vandenberghe, L. (2004). Convex optimization (1st ed.). Cambridge: Cambridge University Press. CrossRefzbMATHGoogle Scholar
 Cui, Y., Fern, X. Z., & Dy, J. G. (2010). Learning multiple nonredundant clusterings. In Transactions on Knowledge Discovery from Data (TKDD) (Vol. 4, pp. 1–32). Google Scholar
 Dhillon, I. S., Mallela, S., & Kumar, R. (2003). A divisive informationtheoretic feature clustering algorithm for text classification. Journal of Machine Learning Research, 3, 1265–1287. MathSciNetzbMATHGoogle Scholar
 Dimitriadou, E., Weingessel, A., & Hornik, K. (2002). A combination scheme for fuzzy clustering. In AFSS’02 (pp. 332–338). Google Scholar
 Färber, I., Günnemann, S., Kriegel, H., Kröger, P., Müller, E., Schubert, E., Seidl, T., & Zimek, A. (2010). On using classlabels in evaluation of clusterings. In MultiClust: 1st international workshop on discovering, summarizing and using multiple clusterings. Google Scholar
 Fern, X. Z., & Brodley, C. E. (2004). Solving cluster ensemble problems by bipartite graph partitioning. In Proc. ICML ’04. Google Scholar
 Frank, A., & Asuncion, A. (2012). In UCI machine learning repository. http://archive.ics.uci.edu/ml. Google Scholar
 Fred, A. (2001). Finding consistent clusters in data partitions. In J. Kittler & F. Roli (Eds.), Multiple classifier systems (Vol. 2096, pp. 309–318). Berlin: Springer. CrossRefGoogle Scholar
 Fred, A., & Jain, A. (2002). Data clustering using evidence accumulation. In Proc. of the 16th int’l conference on pattern recognition (pp. 276–280). Google Scholar
 Fred, A., & Jain, A. (2005). Combining multiple clustering using evidence accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6), 835–850. CrossRefGoogle Scholar
 Fred, A., & Jain, A. (2006). Learning pairwise similarity for data clustering. In Proc. of the 18th int’l conference on pattern recognition (ICPR), Hong Kong (Vol. 1, pp. 925–928). CrossRefGoogle Scholar
 Ghosh, J., & Acharya, A. (2011). Cluster ensembles Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(4), 305–315. Google Scholar
 Karypis, G., Aggarwal, R., Kumar, V., & Shekhar, S. (1997). Multilevel hypergraph partitioning: applications in VLSI domain. In Proc. design automation conf. Google Scholar
 Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl 1), 5228–5235. CrossRefGoogle Scholar
 Heller, K., & Ghahramani, Z. (2007). A nonparametric Bayesian approach to modeling overlapping clusters. In Int. conf. AI and statistics. Google Scholar
 Jain, A. K., & Dubes, R. (1988). Algorithms for clustering data. New York: Prentice Hall. zbMATHGoogle Scholar
 Jardine, N., & Sibson, R. (1968). The construction of hierarchic and nonhierarchic classifications. Computer Journal, 11, 177–184. CrossRefzbMATHGoogle Scholar
 Kachurovskii, I. R. (1960). On monotone operators and convex functionals. Uspehi Matematičeskih Nauk, 15(4), 213–215. Google Scholar
 Karypis, G., & Kumar, V. (1998). Multilevel algorithms for multiconstraint graph partitioning. In Proceedings of the 10th supercomputing conference. Google Scholar
 Lee, D. D., & Seung, H. S. (2000). Algorithms for nonnegative matrix factorization. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), NIPS (pp. 556–562). Cambridge: MIT Press. Google Scholar
 Lourenço, A., Fred, A., & Figueiredo, M. (2011). A generative dyadic aspect model for Evidence Accumulation Clustering. In Proc. 1st int. conf. similaritybased pattern recognition, SIMBAD’11 (pp. 104–116). Berlin/Heidelberg: Springer. CrossRefGoogle Scholar
 Lourenço, A., Fred, A., & Jain, A. K. (2010). On the scalability of evidence accumulation clustering. In Proc. 20th international conference on pattern recognition (ICPR), Istanbul, Turkey. Google Scholar
 Mei, J. P., & Chen, L. (2010). Fuzzy clustering with weighted medoids for relational data. Pattern Recognition, 43(5), 1964–1974. CrossRefzbMATHGoogle Scholar
 Meila, M. (2003). Comparing clusterings by the variation of information. In Springer (Ed.), Proc. of the sixteenth annual conf. of computational learning theory, COLT. Google Scholar
 Nepusz, T., Petróczi, A., Négyessy, L., & Bazsó, F. (2008). Fuzzy communities and the concept of bridgeness in complex networks. Physical Review A, 77, 016107. MathSciNetGoogle Scholar
 Ng, A. Y., Jordan, M. I., & Weiss, Y. (2001). On spectral clustering: analysis and an algorithm. In NIPS (pp. 849–856). Cambridge: MIT Press. Google Scholar
 Paatero, P., & Tapper, U. (1994). Positive matrix factorization: a nonnegative factor model with optimal utilization of error estimates of data values. Environmetrics, 5(2), 111–126. CrossRefGoogle Scholar
 Punera, K., & Ghosh, J. (2007). Soft consensus clustering. In Advances in fuzzy clustering and its applications. New York: Wiley. Google Scholar
 Punera, K., & Ghosh, J. (2008). Consensusbased ensembles of soft clusterings. Applied Artificial Intelligence, 22(7&8), 780–810. CrossRefGoogle Scholar
 Rota Bulò, S., Lourenço, A., Fred, A., & Pelillo, M. (2010). Pairwise probabilistic clustering using evidence accumulation. In Proc. 2010 int. conf. on structural, syntactic, and statistical pattern recognition, SSPR&SPR’10 (pp. 395–404). CrossRefGoogle Scholar
 Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 22(8), 888–905. CrossRefGoogle Scholar
 Steyvers, M., & Griffiths, T. (2007). Latent semantic analysis: a road to meaning. In Probabilistic topic models. Laurence Erlbaum. Google Scholar
 Strehl, A., & Ghosh, J. (2003). Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3, 583–617. MathSciNetzbMATHGoogle Scholar
 Topchy, A., Jain, A., & Punch, W. (2003). Combining multiple weak clusterings. In IEEE intl. conf on data mining, Melbourne (pp. 331–338). CrossRefGoogle Scholar
 Topchy, A., Jain, A., & Punch, W. (2004). A mixture model of clustering ensembles. In Proc. of the SIAM conf. on data mining. Google Scholar
 Topchy, A., Jain, A. K., & Punch, W. (2005). Clustering ensembles: models of consensus and weak partitions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12), 1866–1881. CrossRefGoogle Scholar
 Wang, H., Shan, H., & Banerjee, A. (2009). Bayesian cluster ensembles. In 9th SIAM int. conf. on data mining. Google Scholar
 Wang, H., Shan, H., & Banerjee, A. (2011). Bayesian cluster ensembles. Statistical Analysis and Data Mining, 4(1), 54–70. MathSciNetCrossRefGoogle Scholar
 Wang, P., Domeniconi, C., & Laskey, K. B. (2010). Nonparametric Bayesian clustering ensembles. In ECML PKDD’10 (pp. 435–450). Google Scholar