Abstract
Clustering ensemble methods produce a consensus partition of a set of data points by combining the results of a collection of base clustering algorithms. In the evidence accumulation clustering (EAC) paradigm, the clustering ensemble is transformed into a pairwise coassociation matrix, thus avoiding the label correspondence problem, which is intrinsic to other clustering ensemble schemes. In this paper, we propose a consensus clustering approach based on the EAC paradigm, which is not limited to crisp partitions and fully exploits the nature of the coassociation matrix. Our solution determines probabilistic assignments of data points to clusters by minimizing a Bregman divergence between the observed coassociation frequencies and the corresponding cooccurrence probabilities expressed as functions of the unknown assignments. We additionally propose an optimization algorithm to find a solution under any doubleconvex Bregman divergence. Experiments on both synthetic and real benchmark data show the effectiveness of the proposed approach.
Introduction
Clustering ensemble methods look for consensus solutions from a set of base clustering algorithms, thus trying to combine into a single partition the information present in many different ones. Several authors have shown that these methods tend to reveal more robust and stable cluster structures than the individual clusterings in the ensemble (Fred 2001; Fred and Jain 2002; Strehl and Ghosh 2003). Leveraging an ensemble of clusterings is considerably more difficult than combining an ensemble of classifiers, due to the label correspondence problem: how to put in correspondence the cluster labels produced by different clustering algorithms? This problem is made more serious if clusterings with different numbers of clusters are allowed in the ensemble.
A possible solution to sidestep the cluster label correspondence problem has been proposed in the Evidence Accumulation Clustering (EAC) framework (Fred and Jain 2005). The core idea is based on the assumption that similar data points are very likely grouped together by some clustering algorithm and, conversely, data points that cooccur very often in the same cluster should be regarded as being very similar. Hence, it is reasonable to summarize a clustering ensemble in terms of a pairwise similarity matrix, called coassociation matrix, where each entry counts the number of clusterings in the ensemble in which a given pair of data points is placed in the same cluster. This new mapping can then be used as input for any similaritybased clustering algorithm. In Fred and Jain (2005), agglomerative hierarchical algorithms are used to extract the consensus partition (e.g. Single Link, Average Link, or Ward’s Link). In Fred and Jain (2006), an extension is proposed, entitled MultiCriteria Evidence Accumulation Clustering (MultiEAC), filtering the cluster combination process using a cluster stability criterion. Instead of using the information of the different partitions, it is assumed that, since algorithms can have different levels of performance in different regions of the space, only certain clusters should be considered.
The way the coassociation matrix is exploited in the literature is very naïve. Indeed, standard approaches based on EAC simply run a generic pairwise clustering algorithm with the coassociation matrix as input. The underlying clustering criteria of ad hoc algorithms, however, do not take advantage of the statistical interpretation of the computed similarities, which is an intrinsic part of the EAC framework. Also, the direct application of a clustering algorithm to the coassociation matrix typically induces a hard partition of the data. Although having crisp partitions as baseline for the accumulation of evidence of data organization is reasonable, this assumption is too restrictive in the phase of producing a consensus clustering. Indeed, the consensus partition is a solution that tries to accommodate the different clusterings in the ensemble and by allowing soft assignments of data points to clusters we can preserve some information about their intrinsic variability and capture the level of uncertainty of the overall label assignments, which would not be detected in the case of hard consensus partitions. The variability in the clustering solution of the ensemble might depend not only on the different algorithms and parametrizations adopted to build the ensemble, but also on the presence of clusters that naturally overlap in the data. This is the case for many important applications such as clustering microarray gene expression data, text categorization, perceptual grouping, labelling of visual scenes and medical diagnosis. In these cases, having a consensus solution in terms of a soft partition allows to detect also overlapping clusters charactering the data. It is worth mentioning that the importance of dealing with overlapping clusters has been recognized long ago (Jardine and Sibson 1968) and, in the machine learning community, there has been a renewed interest around this problem (Banerjee et al. 2005a; Heller and Ghahramani 2007). As an alternative, the consensus extraction could be obtained by running fuzzy kmedoids (Mei and Chen 2010) on the coassociation matrix as it were a standard similarity matrix, or fuzzy kmeans (Bezdek 1981) by interpreting each row of the coassociation matrix as a feature vector. However, such solutions would not take into account the underlying probabilistic meaning of the coassociation matrix and lack any formal statistical support.
In this paper, we propose a consensus clustering approach which is based on the EAC paradigm. Our solution fully exploits the nature of the coassociation matrix and does not lead to crisp partitions, as opposed to the standard approaches in the literature. Indeed, it consists of a model in which data points are probabilistically assigned a cluster. Moreover, each entry of the coassociation matrix, which is derived from the ensemble, is regarded as a realization of a Binomial random variable, parametrized by the unknown cluster assignments, that counts the number of times two specific data points are expected to be clustered together. A consensus clustering is then obtained by means of a maximum likelihood estimation of the unknown probabilistic cluster assignments. We further show that this approach is equivalent to minimizing the KullbackLeibler (KL) divergence between the observed cooccurrence frequencies derived from the coassociation matrix and the cooccurrence probabilities parametrizing the Binomial random variables. By replacing the KLdivergence with any Bregman divergence, we come up with a more general formulation for consensus clustering. In particular we consider, as an additional example, the case where the squared Euclidean distance is used as divergence. We also propose an optimization algorithm to solve the minimization problem derived from our formulation, which works for any doubleconvex Bregman divergence, and a comprehensive set of experiments shows the effectiveness of our new consensus clustering approach.
The remainder of the paper is organized as follows. In Sect. 2 we provide definitions and notations that will be used across the manuscript. In Sects. 3 and 4, we describe the proposed formulation for consensus clustering and the corresponding optimization problem. In Sect. 5, we present an optimization algorithm the can be used to find a consensus solution. Section 6 briefly reviews related work, and Sect. 7 reports experimental results. Finally, Sect. 8 presents some concluding remarks. A preliminary version of this paper appeared in Rota Bulò et al. (2010).
Notation and definitions
Sets are denoted by uppercase calligraphic letters (e.g., \(\mathcal {O}\), \(\mathcal {E}\), …) except for \(\mathbb{R}\) and \(\mathbb{R}_{+}\) which represent as usual the sets of real numbers and nonnegative real numbers, respectively. The cardinality of a finite set is written as ⋅. We denote vectors with lowercase boldface letters (e.g., x, y, …) and matrices with uppercase boldface letters (e.g., X, Y, …). The ith component of a vector x is denoted as x _{ i } and the (i,j)th component of a matrix Y is written as y _{ ij }. The transposition operator is given by the symbol ^{⊤}. The ℓ _{ p }norm of a vector x is written as ∥x∥_{ p } and we implicitly assume a ℓ _{2} (or Euclidean) norm, where p is omitted. We denote by e _{ n } a ndimensional column vector of all 1’s and by \(\mathbf {e}_{n}^{(j)}\) the jth column of the ndimensional identity matrix. The trace of matrix \(\mathbf {M}\in\mathbb{R}^{n\times n}\) is given by \(\operatorname {Tr}(\mathbf {M})=\sum_{i=1}^{n}m_{ii}\). The domain of a function f is denoted by dom(f) and is the indicator function giving 1 if P is true, 0 otherwise.
A probability distribution over a finite set {1,…,K} is an element of the standard simplex Δ _{ K }, which is defined as
The support σ(x) of a probability distribution x∈Δ _{ K } is the set of indices corresponding to positive components of x, i.e.,
The entropy of a probability distribution x∈Δ _{ K } is given by
and the KullbackLeibler divergence between two distributions x,y∈Δ _{ K } is given by
where we assume log0≡−∞ and 0log0≡0.
Given a continuouslydifferentiable, realvalued and strictly convex function \(\phi:\varDelta _{K}\!\rightarrow\!\mathbb {R}\), we denote by
the Bregman divergence associated with ϕ for points x,y∈Δ _{ K }, where ∇ is the gradient operator. By construction, the Bregman divergence is convex in its first argument. If convexity holds also for the second one, then we say that the Bregman divergence is doubleconvex.
The KullbackLeibler divergence is a special case of doubleconvex Bregman divergence, which is obtained by considering ϕ(x)=−H(x). In Table 1 we report other examples of doubleconvex Bregman divergences.
A probabilistic model for consensus clustering
Consensus clustering is an unsupervised learning approach that summarizes an ensemble of partitions obtained from a set of base clustering algorithms into a single consensus partition. In this section, we introduce a novel model for consensus clustering, which collapses the information gathered from the clustering ensemble into a single partition, in which data points are assigned to clusters in a probabilistic sense.
Let \(\mathcal {O}=\{1,\dots,n\}\) be the indices of a set of data points to be clustered and let \(\mathcal {E}=\{p_{u}\}_{u=1}^{N}\) be a clustering ensemble, i.e., a set of N clusterings obtained by different algorithms with possibly different parametrizations and/or initializations and/or subsampled versions of the data set. Each clustering \(p_{u} \in \mathcal {E}\) is a function \(p_{u}:\mathcal {O}_{u} \rightarrow \{1,\ldots,K_{u}\}\) assigning a cluster out of K _{ u } available ones to data points in \(\mathcal {O}_{u}\subseteq \mathcal {O}\), where \(\mathcal {O}_{u}\) and K _{ u } can be different across the clusterings indexed by u. We put forward data subsampling as a most general framework for the following reasons: it favours the diversity of the clustering ensemble and it models situations of distributed clustering where local clusters have only partial access to the data.
Since each clustering in the ensemble may stem from a subsampled version of the original dataset, some pairs of data points may not appear in all clusterings. Let Ω _{ ij }⊆{1,…,N} denote the set of indices of clusterings in the ensemble where both data points i and j appear, i.e., \((u\in \varOmega_{ij}) \Leftrightarrow (\{i,j\}\subseteq \mathcal {O}_{u} )\), and let N _{ ij }=Ω _{ ij } denote its cardinality. Clearly, Ω _{ ij }=Ω _{ ji } and consequently N _{ ij }=N _{ ji } for all pairs (i,j) of data points. The ensemble of clusterings is summarized in the coassociation matrix C=[c _{ ij }]∈{0,…,N}^{n×n}, where
is the number of times i and j are coclustered in the ensemble \(\mathcal{E}\); of course, 0≤c _{ ij }≤N _{ ij }≤N and c _{ ij }=c _{ ji }.
In standard EAC literature the coassociation matrix holds the fraction of times two data points are coclustered, while in our definition it holds the number of times this event occurs. The reason of this choice stems from the fact that we allow subsampling in the ensemble construction. Consequently, the number of times two data points appear in a clustering of the ensemble is not constant over all possible pairs. This renders the observation of the fraction of times coclustering occurs statistically more significant for some pairs of data points and less for other ones. By considering C in absolute terms and by keeping track of the quantities N _{ ij }’s we can capture this information. As an example, consider that the ensemble consists on 100 partitions, and due to subsampling, let a pair of samples (i,j), coappear in partitions N _{ ij }=80, and be coclustered 70 times. Then, N=100, N _{ ij }=80 and c _{ ij }=70.
Our model assumes that each data point has an unknown probability of being assigned to each cluster. We denote by y _{ i }=(y _{1i },…,y _{ Ki })^{⊤}∈Δ _{ K } the probability distribution over the set of K clusters {1,…,K} which characterizes data point \(i\in \mathcal {O}\), i.e., \(y_{ki} = \mathbb{P}[ i \in \mathcal {C} _{k} ]\), where \(\mathcal {C} _{k}\) denotes the subset of \(\mathcal {O}\) that constitutes the kth cluster. The model parameter K should not be understood as the desired number of clusters but rather as a maximum number of clusters. Without prior knowledge, K might coincide with the number of data points, i.e., K=n. Finally, we store all the y _{ i }’s in a K×n matrix \(\mathbf {Y}=[\mathbf {y}_{1},\dots,\mathbf {y}_{n}]\in \varDelta _{K}^{n}\). The probability that data points i and j are coclustered is thus
Let C _{ ij }, i<j, be a binomial random variable representing the number of times that data points i and j are coclustered; from the assumptions above we have that \(C_{ij} \sim \mbox{Binomial} ( N_{ij} , \mathbf {y}_{i}^{\top} \mathbf {y}_{j} )\), that is
Each element c _{ ij }, i<j, of the coassociaton matrix C, is interpreted as a sample of the random variable C _{ ij } and due to the symmetry of C, entries c _{ ij } and c _{ ji } are considered as the same sample. The model considers the different C _{ ij }’s independent. This simplification is essential, in practice, because by decoupling the pairwise, or higher order, correlations present in the consensus the likelihood becomes more tractable. Consequently, the probability of observing C, given the cluster probabilities Y, is given by
where \(\mathcal {P} =\{ \{i,j\}\subseteq \mathcal {O}: i\neq j\}\) is the set of all distinct pairs of data points. Since we consider the observations c _{ ij } and c _{ ji } as being identical due to the symmetry of C, the product is taken over the set of distinct unordered pairs of data points.
We can now estimate the unknown cluster assignments by maximizing the loglikelihood \(\log\mathbb{P}[\mathbf {C}\mathbf {Y}]\) with respect to Y, which is given by
This yields the following maximization problem, where terms not depending on Y have been dropped:
Matrix Y ^{∗}, the solution of problem (1), provides probabilistic cluster assignments for the data points, which constitute the solution to the consensus clustering problem according to our model.
In Table 2 we summarize the notation introduced in this section.
A class of alternative formulations
The formulation introduced in the previous section for consensus clustering can be seen as a special instance of a more general setting, which will be described in this section.
Let \(\psi:\mathbb {R}\rightarrow\mathbb {R}^{2}\) be a function mapping a scalar to a 2dimensional vector defined as ψ(x)=(x,1−x)^{⊤} and let \(d_{\phi}:\mathbb {R}\times\mathbb {R}\rightarrow\mathbb {R}\) be given as follows:
Consider now the following class of formulations for consensus clustering, which is parametrized by a continuouslydifferentiable and strictly convex function \(\phi:\varDelta _{2}\rightarrow\mathbb {R}\):
where
Intuitively, the solution Y ^{∗} to (2) is a probabilistic cluster assignment yielding a minimum Bregman divergence between the observed cooccurrence statistics of each pair of data points and the estimated ones. Moreover, each term of f(Y) is weighted by N _{ ij } in order to account of the statistical significance of the observations.
The formulation in (2) encompasses the one introduced in the previous section as a special case. Indeed, by considering the parametrization ϕ(x)=−H(x), we have that B _{ ϕ }≡D _{ KL }, i.e., the Bregman divergence coincides with the KLdivergence, and by simple algebra the equivalence between (2) and (1) can be derived. For a formal proof we refer to Proposition 1 in Appendix.
Different algorithms for consensus clustering can be derived by adopting different Bregman divergences in (2), i.e., by changing the way errors between observed frequencies and estimated probabilities of cooccurrence are penalized. This is close in spirit to the work (Banerjee et al. 2005b), where a similar approach has been adopted in the context of partitional data clustering. In addition to the formulation corresponding to the KLdivergence, in this paper, we study also the case where a squared ℓ _{2} penalization is considered in (3), i.e., when ϕ(x)=∥x∥^{2} and d _{ ϕ } becomes the squared Euclidean distance. This yields the following optimization problem:
In the next section we will cover the algorithmic aspects of the computation of probabilistic assignments, which represent our solution to the consensus clustering problem.
Optimization algorithm
In this section, we describe an efficient optimization procedure which allows to find a local solution to (2), which works for any doubleconvex Bregman divergence. This procedure falls in the class of primal linesearch methods because it iteratively finds a feasible descent direction, i.e., satisfying the constraints and guaranteeing a local decrease of the objective.
This section is organized into four parts. The first part is devoted to the problem of finding a feasible, descent direction, while the second part addresses the problem of searching a better solution along that direction. In the third part, we summarize the optimization algorithm and provide some additional techniques to reduce its computational complexity. Finally, in the last part we show how our algorithm can be adapted to efficiently cluster largescale datasets.
Computation of a search direction
Given a nonoptimal feasible solution \(\mathbf {Y}\in \varDelta _{K}^{n}\) of (2), we can look for a better solution along a direction \(\mathbf {D}\in\mathbb {R}^{K\times n}\) by finding a value of ϵ such that f(Z _{ ϵ })<f(Y), where Z _{ ϵ }=Y+ϵ D. The search direction D is said to be feasible and descending at Y if the two following conditions hold for all sufficiently small positive values of ϵ: \(\mathbf {Z}_{\epsilon}\in \varDelta _{K}^{n}\) and f(Z _{ ϵ })<f(Y).
Our algorithm considers search directions at Y that are everywhere zero except for two entries lying on the same column. Specifically, it selects directions belonging to the following set:
Here, the condition imposing v∈σ(y _{ j }) guarantees that every direction in \(\mathcal {D} (\mathbf {Y})\) is feasible at Y (see Proposition 2 in Appendix). Among this set, by taking a greedy decision, we select the direction leading to the steepest descent, i.e., we look for a solution to the following optimization problem:
By exploiting the definition of \(\mathcal {D} (\mathbf {Y})\) the solution to (5) can be written as \(\mathbf {D}^{*}=(\mathbf {e}_{K}^{U}\mathbf {e}_{K}^{V})(\mathbf {e}_{n}^{J})^{\top}\), where the indices U,V and J are determined as follows. Let U _{ j }, V _{ j } be given by
for all \(j\in \mathcal {O} \), where g _{ j }(Y) is the partial derivative of f with respect to y _{ j }, which is given by
Here \(\mathcal {P} _{j}=\{i\in \mathcal {O} :\{i,j\}\in \mathcal {P} \}\). Then, by Proposition 3 in Appendix, J can be computed as
while U=U _{ J } and V=V _{ J }.
The search direction D ^{∗} at Y obtained from (5) is clearly feasible since it belongs to \(\mathcal {D} (\mathbf {Y})\) but it is also always descending, unless Y satisfies the KarushKuhnTucker (KKT) conditions, i.e., the firstorder necessary conditions for local optimality, for the minimization problem in (2). This result is formally proven in Proposition 4 in Appendix.
Computation of an optimal step size
Once a feasible descending direction \(\mathbf {D}^{*}=(\mathbf {e}_{K}^{U}\mathbf {e}_{K}^{V})(\mathbf {e}_{n}^{J})^{\top}\) is computed from (5), we have to find an optimal step size ϵ ^{∗} that allows us to achieve a decrease in the objective value. The optimal step is given as a solution to the following one dimensional optimization problem,
where Z _{ ϵ }=Y+ϵ D ^{∗} and the feasible interval for ϵ follows from the constraint that \(\mathbf {Z}_{\epsilon}\in \varDelta _{K}^{n}\). This problem is convex thanks to the assumption of doubleconvexity imposed on the Bregman divergence (see Proposition 5 in Appendix).
Let ρ(ϵ′) denote the first order derivative of f with respect to ϵ evaluated at ϵ′, i.e.,
By the convexity of (9) and Kachurovskii’s theorem (Kachurovskii 1960) we have that ρ is nondecreasing in the interval 0≤ϵ≤y _{ VJ }. Moreover, ρ(0)<0 since D ^{∗} is a descending direction as stated by Proposition 4. Otherwise, we would have that Y satisfies the KKT conditions for local optimality.
In order to compute the optimal step size ϵ ^{∗} in (9) we distinguish two cases:

if ρ(y _{ VJ })≤0 then ϵ ^{∗}=y _{ VJ } for f(Z _{ ϵ }) would be nonincreasing in the feasible set of (9);

if ρ(y _{ VJ })>0 then ϵ ^{∗} is a zero of ρ that can be found in general using a dichotomic search which preserves the discording signs of ρ at the endpoints of the search interval.
In the specific, if the second case holds the optimal step size ϵ ^{∗} can be found by iteratively updating the search interval as follows:
for all t>0, where m ^{(t)} denotes the center of segment [ℓ ^{(t)},r ^{(t)}], i.e., m ^{(t)}=(ℓ ^{(t)}+r ^{(t)})/2. Since an approximation of ϵ ^{∗} is sufficient, the dichotomic search is carried out until the interval size is below a given threshold. If δ is this threshold, the number of iterations required is log_{2}(y _{ VJ }/δ) at worst.
In some cases (9) has a closed form solution. This of course depends on the nature of the Bregman divergence adopted. For instance, if we consider the squared ℓ _{2} distance as a divergence (i.e., ϕ(x)=∥x∥^{2}), then f(Z _{ ϵ }) becomes a quadratic polynomial in the variable ϵ which can be trivially minimized in closedform.
Algorithm
The proposed consensus clustering method is summarized in Algorithm 1. The input arguments consist of the ensemble of clusterings \(\mathcal {E}\), the parameter ϕ for the Bregman divergence, and an initial guess Y ^{(0)} for the cluster assignments (cluster assignments are uniformly distributed in the absence of prior knowledge).
At an abstract level, the algorithm iteratively finds a feasible, descending direction D ^{∗} at the current solution Y ^{(t)}, computes the optimal step ϵ ^{∗} and performs an update of the solution as Y ^{(t+1)}=Y ^{(t)}+ϵ ^{∗} D ^{∗}. This procedure is iterated until a stopping criterion is met.
In order to obtain a time complexity periteration that is linear in the number of variables, we exploit the extreme sparseness of the search direction D ^{∗} for the update of matrix Y ^{(t)} ^{⊤} Y ^{(t)} (denoted by A ^{(t)} in the pseudocode) and for the update of the gradient vectors \(g_{i}^{(t)}\). Each iteration, indeed, depends on these two fundamental quantities. In the specific, the computation of A ^{(t+1)} can be obtained in O(n) by simply changing the Jth row and the Jth column of A ^{(t)} (it follows from the update formula at line 10). By exploiting A ^{(t+1)}, the gradient vectors can be computed in O(Kn). In fact, we obtain \(g_{i}^{(t+1)}\) for all \(i\in \mathcal {O}\setminus\{J\}\) by performing a constant time operation on each entry of \(g_{i}^{(t)}\) (lines 12–14) and we compute \(g_{J}^{(t+1)}\) (line 15) in O(Kn) as well. Having A ^{(t)} and the gradient vectors computed allows us to find the search direction D ^{∗} at line 8 in O(nK), since it suffices to access each element of the gradient vectors only once to determine J,U and V. Moreover, the computation of the optimal step size at line 9 can be carried out in O(nlog_{2}(1/δ)), if a dichotomic search is employed, and in constant time in cases where a closedform solution exists (e.g., if ϕ(x)=∥x∥^{2}). Finally, the update of the solution at line 11 can be carried out in constant time by the sparsity of D ^{∗}. The time complexity of each iteration is thus given by O(nmax(K,log_{2}(1/δ))).
The most costly part of the algorithm is the initialization (2–5) which has O(n ^{2} K) time complexity. Hence, the overall complexity of the algorithm is O(n ^{2} K+mn×max(K,log_{2}(1/δ))) where m is the number of required iterations, which is difficult to know in advance. As a rule of thumb, we need m∈Ω(nK) iterations to converge, because every entry of Y should be modified at least once. In that case the complexity is decided by the iterations only.
Finally, the stopping criterion ideally should test whether D ^{∗} is a descending direction. Indeed, if this does not hold then we know that Y ^{(t)} is satisfying the KKT conditions (it follows from Proposition 4 in Appendix) and we can stop. In practice, we simply check if the quantity g _{ J }(Y ^{(t)})_{ V }−g _{ J }(Y ^{(t)})_{ U } is below a given threshold τ and we stop if this happens. Indeed, if that quantity is precisely zero, then Y ^{(t)} satisfies the KKT conditions. Additionally, we put an upper bound to the number of iterations.
A note on scalability
In applications where the number of data points to cluster is very large, the computation of the whole coassociation matrix becomes impossible. In this cases one resorts to sparsifying the coassociation matrix by keeping a number of entries that scales linearly with the number of data points.
Our algorithm can be easily adapted to deal with sparse coassociation matrices. Assume that \(\mathcal {P} \) contains only a sparse set of observable data point pairs. Let ℓ be the expected average number of entries of \(\mathcal {P} _{i}\), i.e., \(\ell=\sum_{i\in \mathcal {O} }\mathcal {P} _{i}/n\) and assume that the input quantities c _{ ij }’s and N _{ ij }’s are given only for the pairs \(\{i,j\}\in \mathcal {P} \). Since we need to know the value of \(\mathbf {y}_{i}^{\top} \mathbf {y}_{j}\) again only for pairs of data points in \(\mathcal {P} \), the computation of A ^{(0)} is not fully required and only the entries indexed by \(\mathcal {P} \) should be computed. This reduces to O(Kℓn) the complexity of line 2 of Algorithm 1, where ℓ≪n. The same complexity characterizes the initialization of the gradient at lines 3–5. The subsequent updates of matrix A ^{(t)} at line 10 and of the gradient at lines 12–15 require only O(ℓ) and O(K ℓ) operations, respectively. By adopting a priority queue (e.g., heap based), the computation of the optimal direction in terms of U, V and J at line 8 requires only an overall complexity of O(Klog_{2}(n)) per iteration. This can be achieved by initially storing in the priority queue the best values of U and V for all \(i\in \mathcal {O} \) and by updating the priorities based on the sparse changes in the gradient values. The optimal step at line 9 can be computed in O(ℓlog_{2}(1/δ)), where δ is the tolerance for the dichotomic search. Finally, the update of Y remains with a constant complexity. The overall periteration complexity becomes O(max(ℓlog_{2}(1/δ),Klog_{2}(n))). As for the number of iterations the considerations made in Sect. 5.3 still hold.
Related work
Several consensus methods have been proposed in the literature (Fred 2001; Strehl and Ghosh 2003; Fred and Jain 2005; Topchy et al. 2004; Dimitriadou et al. 2002; Ayad and Kamel 2008; Fern and Brodley 2004). Some of these methods are based on the similarity between data points, which is induced by the clustering ensemble, others are based on estimates of similarity between partitions and others cast the problem as a categorical clustering problem. All these methods tend to reveal a more robust and stable clustering solution than the individual clusterings used as input for the problem. A very recent survey can be found in Ghosh et al. (2011).
Strehl and Ghosh (2003) formulated the clustering ensemble problem as an optimization problem based on the maximal average mutual information between the optimal combined clustering and the clustering ensemble, presenting three algorithms to solve it, exploring graph theoretical concepts. The first one, entitled Clusterbased Similarity Partitioning Algorithm (CSPA), uses a graph partitioning algorithm, METIS (Karypis and Kumar 1998), for extracting a consensus partition from the coassociation matrix. The second and third algorithms, Hyper Graph Partitioning Algorithm (HGPA) and Meta CLustering Algorithm (MCLA), respectively, are based on hypergraphs, where vertices correspond to data points, and the hyperedges, which allow the connection of several vertices, correspond to clusters of the Clustering ensemble. HGPA obtains the consensus solution using an hypergraph partitioning algorithm, HMETIS (Karypis et al. 1997); MCLA, uses another heuristic which allows clustering clusters.
Fern and Brodley (2004) reduce the problem to graph partitioning. The proposed model, entitled Hybrid Bipartite Graph Formulation (HBGF), uses as vertices both instances and clusters of the ensemble, retaining all of the information provided by the clustering ensemble, and allowing to consider the similarity among instances and among clusters. The partitioning of this bipartite graph is produced using the multiway spectral graph partitioning algorithm proposed by Ng et al. (2001), which optimizes the normalized cut criterion (Shi and Malik 2000), or, as alternative, the graph partitioning algorithm METIS (Karypis and Kumar 1998).
These approaches were later extended by Punera and Ghosh (2007, 2008), to allow soft base clusterings on the clustering ensemble, showing that the addition of information on the ensemble is useful; the proposed models were the soft version of CSPA, of MCLA, and HBGF. Additionally they proposed to use information theoretic Kmeans (Dhillon et al. 2003), an algorithm very similar to Kmeans, differing only in the distance measure, using KLdivergence, for clustering in the feature space obtained from concatenating all the posteriors from the ensemble.
Topchy et al. (2003, 2004, 2005) proposed two different formulations, both derived from similarities between the partitions in the ensemble, rather than similarities between data points, differently from the case of coassociation based approaches. The first one is a multinomial mixture model (MM) over the labels of the clustering ensemble, thus each partition is considered as a feature with categorical attributes. The second one is based on the notion of median partition and is entitled Quadratic Mutual Information Algorithm (QMI). The median partition is defined as the partition that best summarizes the partitions of the ensemble.
Wang et al. (2009, 2011) extended this idea, introducing a Bayesian version of the multinomial mixture model, the Bayesian cluster ensembles (BCE). Although the posterior distribution cannot be calculated in closed form, it is approximated using variational inference and Gibbs sampling, in a very similar procedure as in latent Dirichlet allocation (LDA) models (Griffiths and Steyvers 2004; Steyvers and Griffiths 2007), but applied to a different input feature space, the feature space of the labels of the ensembles. In Wang et al. (2010), a nonparametric version of BCE was proposed.
Ayad and Kamel (2008), followed Dimitriadou et al. (2002), proposed the idea of cumulative voting as a solution for the problem of aligning the cluster labels. Each clustering of the ensemble is transformed into a probabilistic representation with respect to a common reference clustering. Three voting schemes are presented: Unnormalized fixedReference Cumulative Voting (URCV), fixedReference Cumulative Voting (RCV), and Adaptive Cumulative Voting (ACV).
Lourenço et al. (2011), modelled the problem of consensus extraction taking as input space pairwise information, and using a generative aspect model for dyadic data. The extraction of a consensus solutions is found by solving a maximum likelihood estimation problem, using the ExpectationMaximization (EM) algorithm.
Our framework is also related to Nonnegative Matrix Factorization (Paatero and Tapper 1994; Lee and Seung 2000), which is the problem of approximatively factorizing a given matrix M, with two entrywise nonnegative matrices F and G, so that M≈F G. Indeed our formulation can be regarded as a kind of matrix factorization of the coassociation matrix in terms of matrix Y ^{⊤} Y under the constraint that Y is column stochastic. This particular setting has been considered, for the ℓ _{2} norm, in Arora et al. (2011) and in Nepusz et al. (2008).
Experimental results
In this section we evaluate our formulation using synthetic datasets and realworld datasets from the UCI Irvine and UCI KDD Machine Learning Repository. We performed four different series of experiments: (i) we study the convergence properties of our algorithm on synthetically generated coassociation matrices, (ii) we compare the consensus clustering obtained on different datasets with the known, crisp, ground truth partitions using standard evaluation criteria and we compare against other consensus clustering approaches, (iii) we perform an experiment on a largescale dataset with incomplete partitions in the ensemble, (iv) we perform a qualitative analysis of a realworld dataset by deriving additional information from the probabilistic output of our algorithm.
We evaluate the performance of our Probabilistic Consensus Clustering (PCC) algorithm with KLdivergence (PCCKL) and with squared ℓ _{2} divergence (PCCℓ _{2}). From the quantitative perspective, we compare the performance of PCCℓ _{2} and PCCKL against other stateoftheart consensus algorithms: the classical EAC algorithm using as extraction criteria the hierarchical agglomerative singlelink (EACSL) and averagelink (EACAL) algorithms; Clusterbased Similarity Partitioning Algorithm (CSPA) (Strehl and Ghosh 2003); Hybrid Bipartite Graph Formulation (HBGF) (Fern and Brodley 2004); Mixture Model (MM) (Topchy et al. 2004, 2005); Quadratic Mutual Information Algorithm (QMI) (Topchy et al. 2003, 2005).
In order to evaluate the quality of a consensus clustering result against a hard, ground truth partition we convert our probabilistic assignments in hard assignments according to a maximum likelihood criterion. We compare then two hard clusterings \(\mathcal{P}=\{\mathcal {P} _{1},\dots,\mathcal {P} _{k}\}\) and \(\mathcal {Q} =\{\mathcal {Q} _{1},\dots,\mathcal {Q} _{k}\}\) using the \(\mathcal{H}\) criterion based on cluster matching (Meila 2003) and the AdjustedRand index (Jain and Dubes 1988), which is based on counting pairs. Note that we assume without loss of generality that \(\mathcal {P} \) and \(\mathcal {Q} \) have the same number of elements, since we can add empty clusters where needed. The \(\mathcal{H}\) criterion (Meila 2003) gives the accuracy of the partitions and is obtained by finding the optimal onetoone matching between the clusters in \(\mathcal {P} \) with the ground truth labels in \(\mathcal {Q} \):
where the vector v of the maximization runs over all possible permutations of the vector (1,…,k).
When we have a soft, ground truth partition given in terms of a probabilistic assignment \(\mathbf {Z}\in \varDelta _{k}^{n}\), we evaluate the divergence between a soft consensus partition \(\mathbf {Y}\in \varDelta _{k}^{n}\) and Z in terms of the JensenShannon (JS) divergence. In more details, let D _{ JS }(⋅∥⋅) denote the JSdivergence between two distributions given as points of Δ _{ k }. Then the divergence between Z and Y is given by
where the matrix P in the minimization runs over all possible k×k permutation matrices. Similarly to the case of hard partitions, we assume without loss of generality that Z and Y have the same number of rows, since we can eventually add zero rows to fill the gap. Note that \(0\leq \mathcal{J}(\mathbf {Y},\mathbf {Z})\leq 1\) holds for any \(\mathbf {Y},\mathbf {Z}\in \varDelta _{K}^{n}\).
For the qualitative experiments, we analyse the probabilistic assignments Y in order to exploit the information about the cluster uncertainty. For this analysis, we remove probability values, lower than a predefined threshold δ, and then we normalize each column to sum again to one. We measure the normalized similarity between two clusters i, j as the expected value of common elements over the expected cardinality of the ith clusters:
Given a set of data points \(\{\mathbf {x}_{i}\}_{i=1}^{n}\), we define the centroid of class k according to Y as
Given the weighted matrix M=[m _{ ij }], which is usually sparse, and the centroids, we can visualize the obtained clusters and their relationships simply by drawing them in the plane with a weighted graph. The structures found in these graphs, like paths or cliques, highly depend on the type of data and the geometry of the consensus set, leading to a different and interesting way of interpreting the consensus results.
Simulated data
We first study the proposed formulation using a synthetic experiment with soft partitions as ground truth. A soft partition Y ^{∗} is determined by generating 4 isotropic, bivariate, planar Gaussian distributions, each consisting of 200 data points, with mean vectors randomly selected in the four orthants, and by computing for each point the normalized probability of having been generated by one of the 4 Gaussians. Given a soft partition Y ^{∗} we artificially generated an ensemble by randomly sampling N=1000 hard partitions with cluster assignments determined by Y ^{∗} and we constructed the corresponding coassociation matrices. Figure 1(a) illustrates one example of such a dataset, where there is some overlap between the components, and Fig. 1(b) shows the corresponding coassociation matrix. We generated 10 different datasets according to the aforementioned procedure.
For each dataset we run our PCCKL and PCCℓ _{2} algorithms with the purpose of recovering the ground truth soft partition Y ^{∗}. Although the optimal number of clusters is K=4, we run our algorithms with a larger value of K=8. This is not a problem as our formulation can automatically tune itself to select a smaller number of clusters. Indeed, we can see from Fig. 2(a) the estimated cluster assignments Y corresponding to the dataset in Fig. 1, where only 4 components have significant probabilities thus confirming our previous claim. We evaluated the divergence between the ground truth soft partition and the recovered one by our algorithms on each of the 10 datasets using the \(\mathcal {J}\)criterion introduced at the beginning of Sect. 7. Both our algorithms obtained an average divergence of 0.0012 and a standard deviation of ±0.00005, which indicate a good recovery of the ground truth probabilistic cluster assignments.
UCI and synthetic data
We followed the usual strategy of producing clustering ensembles and combining them using the coassociation matrix. Two different types of ensembles were created: (1) using kmeans with random initialization and random number of clusters (Lourenço et al. 2010), splitting natural clusters intro microblocks; (2) combining multiple algorithms (agglomerative hierarchical algorithms: single, average, ward, centroid link; kmeans Jain and Dubes 1988; spectral clustering Ng et al. 2001) with different number of clusters, inducing blockwise coassociation matrices.
Table 3 summarizes the main characteristics of the UCI and synthetic datasets used in the evaluation, and the parameters used for generating ensemble (2). Figure 3 illustrates the synthetic datasets used in the evaluation: (a) rings; (b) image1.
We summarized the performance of both algorithms after several runs, accounting for possible different solutions due to initialization, in terms of \(\mathcal{H}\) and Adjusted Rand criteria, in Tables 4, 5 and 6, 7, respectively. We present for both validation indices, the average performance (avg), the standard deviation (std), maximum value (max), and minimum value (min), highlighting in bold the best results for each dataset.
The performance of PCCKL and PCCℓ _{2} depends on the type of ensemble. On Ensemble (1), PCCKL and PCCℓ _{2}, have generally lower performance when compared with EAC and CSPA (both on AdjustedRand Index and CI), that seem very suitable for this kind of ensembles. Nevertheless, on the UCI datasets, both obtain promising results: PCCℓ _{2} is the best in 1 (over 9) dataset, and PCCKL the best in 1 (over 9) and is very close to the best consensus in several situations. On ensemble (2), PCCKL obtains the best results almost on all datasets, 7 (over 9).
Its also very important to notice that the standard deviation of the proposed methods is very low, being in almost every datasets very close to zero.
Figure 4 shows examples of obtained coassociation matrices, where the matrices have been reordered according to VAT algorithm (Bezdek and Hathaway 2002), to highlight the clustering structure. Its color scheme ranges from black (c _{ ij } =0) to white (c _{ ij }=N _{ ij }), corresponding to the magnitude of similarity. As is possible to see, the coassociation of ensemble (1), has not a so evident blockwise structure, since it was produced splitting of natural clusters into smaller clusters inducing microblocks in the coassociation matrix; on Ensembles (2), the coassociation matrices has a much more blockwise form, as it was generated with a combination of several algorithms with numbers of clusters ranging from small to large. The results show that blockwise matrices are very adequate for the proposed model, even in cases where there is much overlap.
A largescale experiment
In order to show that our algorithm can be used also on largescale datasets we propose here an experiment on a KKD Cup 1999 dataset^{Footnote 1}. From the available datasets we analysed a subset of “kddcup.data10percent”, consisting of 120.000 data points characterized by 41 attributes distributed in 3 classes. The preprocessing consisted in standardizing numerical features, and discretizing categorical features, arriving to a 39dimensional feature space. We produced an ensemble consisting of 100 Kmeans partitions obtained on random subsets of the dataset (sampling rate 50 %) with random initializations and random number of clusters (2≤K≤10). Since the ensemble is composed by incomplete partitions, the consensus clustering phase becomes more challenging.
In order to cope with the large amount of data points, which renders the construction of the coassociation matrix impossible both from a space and computational time perspective, we run a sparsified version of our algorithm as described in Sect. 5.4. In the specific, we created \(\mathcal {P} \) by sampling a share of 0.25 ‰ data points pairs from the available (around 8 billions) ones. Our algorithms (PCCℓ _{2} and PCCKL) were run with a maximum number of nK iterations. Our nonparallelized C implementations of PCCℓ _{2} and PCCKL took on average 13.8 s and 16.7 s, respectively, to deliver a solution on a dualcore 64bits Pentium 2.8 GHz with 4 Gb RAM (only one core was effectively used). We were able to compare our algorithm only against CSPA, which nevertheless obtained competitive results in the previous set of experiments. All other approaches could not be run due to the large size of the dataset, or because of their inability of handling incomplete partitions in the ensemble.
We report in Fig. 5 the results obtained in terms of accuracy (\(\mathcal{H}\)criterion) by the algorithms at varying values of the parameter K. At the optimal number of clusters, K=3, all approaches achieve their best score, but our approach outperforms CSPA both when the KLdivergence and the squared ℓ _{2}divergence are used, the former being slightly better than the latter. Moreover, it turns out that our approach can automatically tune the optimal number of clusters thus being more robust to overestimations of the parameter K. Indeed, we remark that the parameter K for our approach is intended as a maximum number of clusters rather than the desired number of clusters that the algorithm must deliver. The partitions in the ensemble achieved on average an accuracy of 79 %. Clearly, due to the presence of incomplete partitions, this score has been computed by considering the data points that have effectively been used in each partition.
Our consensus solution provides a considerable improvement of this score, confirming the effectiveness of our algorithm also in the presence of incomplete partitions in the ensemble.
In Fig. 6 we report the results in terms of accuracy, when varying the percentage of observed entries in the coassociation matrix. The trend on the performance is constant with percentages larger than 0.08 ‰. By further reducing the number of observed entries, we experience a performance drop as one would expect.
Visualizing probabilistic relations
It was observed in Färber et al. (2010), Cui et al. (2010) that we can discover new structures in data using previous known information. A case study we consider here is that of the PenDigits (Frank and Asuncion 2012). The PenDigits dataset contains handwritten digits produced by different persons. Each digit is stored as a sequence of 8 (x,y) positions, collected at different time intervals during the execution of each single digit. A manual analysis of Cui et al. (2010) highlights that the same digit can be written in different ways, but this information is not contained in the ground truth which just collects each type of digit in the same class. These observations become apparent if we build a consensus matrix for each digit taken in isolation and then visualize the obtained classes. Each digit has different ways to be written, but some of them are not completely different, so when building the consensus matrix we can see that these classes overlap. As an example we consider the digit ‘2’. In Fig. 7(a) we can see that the coassociation matrix contains 5 blocks, two of which, the second and the third, are highly overlapped. The resulting matrix Y Fig. 7(b) reflects the overlap by assigning uncertainty to the two overlapping classes. The uncertainty is not symmetric, but the third class seems actually a subclass of the second. In Fig. 8 we can see the five centroids of the classes and their pairwise similarities. The centroids are ordered so that the upper image is the first class centroid and the others are in clockwise order. Each centroid visualizes eight points and the order in which they appear. The visualization of the centroids gives an explanation of the similarities/diversities that are numerically encoded on the edges, and in particular it is clear the dependence of class three with respect to class one and two.
Conclusions
In this paper, we introduced a new probabilistic consensus clustering formulation based on the EAC paradigm. Each entry of the coassociation matrix, derived from the ensemble, is regarded as a Binomial random variable, parametrized by the unknown class assignments. We showed that the loglikelihood function corresponding to this model coincides with the KL divergence between the coassociation relative frequencies and the cooccurrence probabilities parametrized by the Binomial random variables. This formulation can be seen as a special case of a more general setting, replacing the KL divergence with any Bregman divergence. We proposed an algorithm to find a consensus clustering solution according to our model, which works with any doubleconvex Bregman divergence. We also showed how the algorithm can be adapted to deal with largescale datasets. Experiments on synthetic and real world datasets have demonstrated the effectiveness of our approach with ensembles composed by heterogeneous partitions obtained from multiple algorithms (agglomerative hierarchical algorithms, kmeans, spectral clustering) with varying number of clusters. Additionally, we have shown that our algorithm is able to deal with largescale datasets and can successfully be applied also in case of ensembles having incomplete partitions. On different datasets and ensembles, we outperformed the competing stateoftheart algorithms and showed particularly outstanding results on the largescale experiment. The qualitative analysis of the probabilistic consensus solutions provided some evidences that the proposed formulation can discover new structures in data. For the PenDigits dataset, we showed visual relationships between overlapping clusters representing the same digit, using the centroids of each cluster and similarities between clusters obtained from the probabilities of the consensus solution.
References
Arora, R., Gupta, M., Kapila, A., & Fazel, M. (2011). Clustering by leftstochastic matrix factorization. In L. Getoor & T. Scheffer (Eds.), ICML (pp. 761–768). Omnipress.
Ayad, H., & Kamel, M. S. (2008). Cumulative voting consensus method for partitions with variable number of clusters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(1), 160–173.
Banerjee, A., Krumpelman, C., Basu, S., Mooney, R. J., & Ghosh, J. (2005a). Modelbased overlapping clustering. In Int. conf. on knowledge discovery and data mining (pp. 532–537).
Banerjee, A., Merugu, S., Dhillon, I., & Ghosh, J. (2005b). Clustering with Bregman divergences. Journal of Machine Learning Research, 6, 1705–1749.
Bezdek, J. (1981). Pattern recognition with fuzzy objective function algorithms. Norwell: Kluwer Academic.
Bezdek, J., & Hathaway, R. (2002). VAT: a tool for visual assessment of (cluster) tendency. In Proceedings of the 2002 international joint conference on neural networks 2002, IJCNN’02 (Vol. 3, pp. 2225–2230).
Boyd, S., & Vandenberghe, L. (2004). Convex optimization (1st ed.). Cambridge: Cambridge University Press.
Cui, Y., Fern, X. Z., & Dy, J. G. (2010). Learning multiple nonredundant clusterings. In Transactions on Knowledge Discovery from Data (TKDD) (Vol. 4, pp. 1–32).
Dhillon, I. S., Mallela, S., & Kumar, R. (2003). A divisive informationtheoretic feature clustering algorithm for text classification. Journal of Machine Learning Research, 3, 1265–1287.
Dimitriadou, E., Weingessel, A., & Hornik, K. (2002). A combination scheme for fuzzy clustering. In AFSS’02 (pp. 332–338).
Färber, I., Günnemann, S., Kriegel, H., Kröger, P., Müller, E., Schubert, E., Seidl, T., & Zimek, A. (2010). On using classlabels in evaluation of clusterings. In MultiClust: 1st international workshop on discovering, summarizing and using multiple clusterings.
Fern, X. Z., & Brodley, C. E. (2004). Solving cluster ensemble problems by bipartite graph partitioning. In Proc. ICML ’04.
Frank, A., & Asuncion, A. (2012). In UCI machine learning repository. http://archive.ics.uci.edu/ml.
Fred, A. (2001). Finding consistent clusters in data partitions. In J. Kittler & F. Roli (Eds.), Multiple classifier systems (Vol. 2096, pp. 309–318). Berlin: Springer.
Fred, A., & Jain, A. (2002). Data clustering using evidence accumulation. In Proc. of the 16th int’l conference on pattern recognition (pp. 276–280).
Fred, A., & Jain, A. (2005). Combining multiple clustering using evidence accumulation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6), 835–850.
Fred, A., & Jain, A. (2006). Learning pairwise similarity for data clustering. In Proc. of the 18th int’l conference on pattern recognition (ICPR), Hong Kong (Vol. 1, pp. 925–928).
Ghosh, J., & Acharya, A. (2011). Cluster ensembles Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(4), 305–315.
Karypis, G., Aggarwal, R., Kumar, V., & Shekhar, S. (1997). Multilevel hypergraph partitioning: applications in VLSI domain. In Proc. design automation conf.
Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl 1), 5228–5235.
Heller, K., & Ghahramani, Z. (2007). A nonparametric Bayesian approach to modeling overlapping clusters. In Int. conf. AI and statistics.
Jain, A. K., & Dubes, R. (1988). Algorithms for clustering data. New York: Prentice Hall.
Jardine, N., & Sibson, R. (1968). The construction of hierarchic and nonhierarchic classifications. Computer Journal, 11, 177–184.
Kachurovskii, I. R. (1960). On monotone operators and convex functionals. Uspehi Matematičeskih Nauk, 15(4), 213–215.
Karypis, G., & Kumar, V. (1998). Multilevel algorithms for multiconstraint graph partitioning. In Proceedings of the 10th supercomputing conference.
Lee, D. D., & Seung, H. S. (2000). Algorithms for nonnegative matrix factorization. In T. K. Leen, T. G. Dietterich, & V. Tresp (Eds.), NIPS (pp. 556–562). Cambridge: MIT Press.
Lourenço, A., Fred, A., & Figueiredo, M. (2011). A generative dyadic aspect model for Evidence Accumulation Clustering. In Proc. 1st int. conf. similaritybased pattern recognition, SIMBAD’11 (pp. 104–116). Berlin/Heidelberg: Springer.
Lourenço, A., Fred, A., & Jain, A. K. (2010). On the scalability of evidence accumulation clustering. In Proc. 20th international conference on pattern recognition (ICPR), Istanbul, Turkey.
Mei, J. P., & Chen, L. (2010). Fuzzy clustering with weighted medoids for relational data. Pattern Recognition, 43(5), 1964–1974.
Meila, M. (2003). Comparing clusterings by the variation of information. In Springer (Ed.), Proc. of the sixteenth annual conf. of computational learning theory, COLT.
Nepusz, T., Petróczi, A., Négyessy, L., & Bazsó, F. (2008). Fuzzy communities and the concept of bridgeness in complex networks. Physical Review A, 77, 016107.
Ng, A. Y., Jordan, M. I., & Weiss, Y. (2001). On spectral clustering: analysis and an algorithm. In NIPS (pp. 849–856). Cambridge: MIT Press.
Paatero, P., & Tapper, U. (1994). Positive matrix factorization: a nonnegative factor model with optimal utilization of error estimates of data values. Environmetrics, 5(2), 111–126.
Punera, K., & Ghosh, J. (2007). Soft consensus clustering. In Advances in fuzzy clustering and its applications. New York: Wiley.
Punera, K., & Ghosh, J. (2008). Consensusbased ensembles of soft clusterings. Applied Artificial Intelligence, 22(7&8), 780–810.
Rota Bulò, S., Lourenço, A., Fred, A., & Pelillo, M. (2010). Pairwise probabilistic clustering using evidence accumulation. In Proc. 2010 int. conf. on structural, syntactic, and statistical pattern recognition, SSPR&SPR’10 (pp. 395–404).
Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 22(8), 888–905.
Steyvers, M., & Griffiths, T. (2007). Latent semantic analysis: a road to meaning. In Probabilistic topic models. Laurence Erlbaum.
Strehl, A., & Ghosh, J. (2003). Cluster ensembles—a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3, 583–617.
Topchy, A., Jain, A., & Punch, W. (2003). Combining multiple weak clusterings. In IEEE intl. conf on data mining, Melbourne (pp. 331–338).
Topchy, A., Jain, A., & Punch, W. (2004). A mixture model of clustering ensembles. In Proc. of the SIAM conf. on data mining.
Topchy, A., Jain, A. K., & Punch, W. (2005). Clustering ensembles: models of consensus and weak partitions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(12), 1866–1881.
Wang, H., Shan, H., & Banerjee, A. (2009). Bayesian cluster ensembles. In 9th SIAM int. conf. on data mining.
Wang, H., Shan, H., & Banerjee, A. (2011). Bayesian cluster ensembles. Statistical Analysis and Data Mining, 4(1), 54–70.
Wang, P., Domeniconi, C., & Laskey, K. B. (2010). Nonparametric Bayesian clustering ensembles. In ECML PKDD’10 (pp. 435–450).
Acknowledgements
This work was partially financed by an ERCIM “Alain Bensoussan” Fellowship Programme under the European Union Seventh Framework Programme (FP7/20072013), grant agreement n. 246016, by Fundação para a Ciência e Tecnologia, under grants PTDC/EIACCO/103230/2008, SFRH/PROTEC/ 49512/2009 and PEstOE/EEI/LA0008/2011, and by the Área Departamental de Engenharia Electronica e Telecomunicações e de Computadores of Instituto Superior de Engenharia de Lisboa, whose support the authors gratefully acknowledge.
Author information
Authors and Affiliations
Corresponding author
Additional information
Editors: Emmanuel Müller, Ira Assent, Stephan Günnemann, Thomas Seidl, and Jennifer Dy.
Appendix: Proof of results
Appendix: Proof of results
Proposition 1
Let ϕ(x)=−H(x). Maximizers of (1) are minimizers of (2) and vice versa.
Proof
We have that for all \(i,j\in \mathcal {O}\), i≠j,
where ψ(x)=(x,1−x)^{⊤}.
The result follows from
□
Proposition 2
Any search direction \(\mathbf {D}\in \mathcal {D} (\mathbf {Y})\) is feasible for (2).
Proof
Let \(\mathbf {D}=(\mathbf {e}_{K}^{u}\mathbf {e}_{K}^{v})(\mathbf {e}_{n}^{j})^{\top}\in \mathcal {D} (\mathbf {Y})\) and Z _{ ϵ }=Y+ϵ D. For any ϵ,
As ϵ increases, only the (v,j)th entry of Z _{ ϵ }, which is given by y _{ vj }−ϵ, decreases. This entry is nonnegative for all values of ϵ satisfying ϵ≤y _{ vj }. Hence, \(\mathbf {Z}_{\epsilon}\in \varDelta _{K}^{n}\) for all sufficiently small positive values of ϵ. □
Proposition 3
A solution to (5) is
where J is given as (8), U=U _{ J } and V=V _{ J } (U _{ i } and V _{ i } defined as (6)).
Proof
Let \(\mathcal {I} (\mathbf {Y})\) be a set of triplets of indices given by
Optimization problem (5) can be rewritten as follows by exploiting the definition of \(\mathcal {D} (\mathbf {Y})\):
and \(\mathbf {D}^{*}=(\mathbf {e}_{K}^{U}\mathbf {e}_{K}^{V})(\mathbf {e}_{n}^{J})^{\top}\). Here, J can be further characterized as the solution to
The result follows by exploiting the definition of U _{ i } and V _{ i } in (6). □
Proposition 4
If \(\mathbf {Y}\in \varDelta _{K}^{n}\) does not satisfy the KKT firstorder necessary conditions for (2) then the search direction D ^{∗} at Y, which is solution to (5), is descending.
Proof
To prove the result we have to show that f(Y+ϵ D ^{∗})<f(Y) holds for all sufficiently small values of ϵ. This is equivalent to proving that
The KKT necessary conditions for local optimality for (2) are the following:
where \(\mathbf {M}=(\boldsymbol{\mu}_{1},\dots,\boldsymbol{\mu}_{n})\in\mathbb {R}^{K\times n}_{+}\) and \(\boldsymbol{\lambda}\in\mathbb {R}^{n}\) are the Lagrangian multipliers. We can express the Lagrange multipliers λ in terms of Y from the relation
which yields \(\lambda_{i}=\mathbf {y}_{i}^{\top}g_{i}(\mathbf {Y})\) for all \(i\in \mathcal {O}\). This can then be used to obtain an alternative characterization of the KKT conditions, where the Lagrange multipliers do not appear:
where
The two characterizations (13) and (12) are equivalent. This can be verified by exploiting the non negativity of both matrices M and Y, and the complementary slackness conditions. Additionally we have that \([r_{j}(\mathbf {Y})]_{U_{j}}\leq 0\leq [r_{j}(\mathbf {Y})]_{V_{j}}\) for all \(j\in \mathcal {O}\). In fact,
and by subtracting \(\mathbf {y}_{j}^{\top}g_{j}(\mathbf {Y})\) we obtain the desired relation
Now, by assuming Y to be feasible but not satisfying the KKT conditions, we derive from (13) that there exists \(j\in \mathcal {O}\) such that at least one of the two cases hold: [r _{ j }(Y)]_{ u }<0 for some u∈{1,…,K}, or [r _{ j }(Y)]_{ v }>0 for some v∈σ(y _{ j }). This, by definition of U _{ j },V _{ j } and by (14), implies that \([r_{j}(\mathbf {Y})]_{U_{j}}<0\leq [r_{j}(\mathbf {Y})]_{V_{j}}\) or \([r_{j}(\mathbf {Y})]_{U_{j}}\leq 0< [r_{j}(\mathbf {Y})]_{V_{j}}\). Hence, by definition of J,
from which the result follows. □
Proposition 5
The optimization problem in (9) is convex, provided that the Bregman divergence is doubleconvex.
Proof
The search direction D ^{∗}, solution to (5), is everywhere null excepting two entries of the Jth column. This and the fact that the sum in (3) is taken over all pairs (i,j) such that i≠j implies that the second argument of every B _{ ϕ }(⋅∥⋅) function is linear in ϵ. The Bregman divergence B _{ ϕ }(⋅∥⋅) adopted is by assumption doubleconvex and in particular convex in its second argument and trivially the same holds for the function d _{ ϕ }. Since convexity is preserved by the composition of convex functions with linear ones and by the sum of convex functions (Boyd and Vandenberghe 2004) it follows that the minimization problem in (2) is convex as well. □
Rights and permissions
About this article
Cite this article
Lourenço, A., Rota Bulò, S., Rebagliati, N. et al. Probabilistic consensus clustering using evidence accumulation. Mach Learn 98, 331–357 (2015). https://doi.org/10.1007/s1099401353396
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s1099401353396
Keywords
 Consensus clustering
 Evidence Accumulation
 Ensemble clustering
 Bregman divergence