In this section, we first formally define the problem of alternative clustering in Sect. 3.1 and then detail our algorithm COGNAC to address such problem in Sect. 3.2. Then, based on COGNAC, we propose another algorithm (SGAC) for generating a set of different alternative clusterings in Sect. 3.3. Some techniques for analyzing and visualizing the Pareto front returned by our algorithm are also presented in Sect. 3.4.
The alternative clustering problem
Given a dataset \(\mathbf{X}=\lbrace \mathbf{x}_{i} \rbrace_{i=1}^{N}\) of N objects, the traditional clustering problem is to partition this dataset into K disjoint clusters such that the clustering quality is as high as possible. Let C be a clustering solution where C(i) is the index of the cluster that x
i
belongs to, \(\mathbb{D}(\mathbf{C}, \overline{\mathbf{C}})\) be the dissimilarity between two clusterings C and \(\overline{\mathbf{C}}\), and \(\mathbb{Q}(\mathbf{C})\) be the inner quality of a clustering C. We defer the definition of \(\mathbb{D}(\mathbf{C}, \overline{\mathbf{C}})\) and \(\mathbb{Q}(\mathbf{C})\) to Sect. 3.2 where we also present other components of our algorithm and define in this section the dissimilarity between a clustering and a set of clusterings, the overall quality of a clustering and the dominance relation between two clusterings.
Definition 1
(Dissimilarity)
The dissimilarity between a clustering C and a set \(\overline{\mathbf{S}}\) of clusterings is the minimum dissimilarity between C and all clusterings \(\overline{\mathbf{C}}\in \overline{\mathbf{S}}\):
$$ \mathbb{D}(\mathbf{C}, \overline{\mathbf{S}}) = \min_{\overline{\mathbf{C}} \in \overline{\mathbf{S}}} \mathbb{D}(\mathbf{C}, \overline{\mathbf{C}}) $$
(1)
In Fig. 1a and Fig. 1b, we illustrate the benefit of using minimum dissimilarity over maximum dissimilarity and average dissimilarity in defining the dissimilarity between a clustering and a clustering set. Assume that the clustering set is \(\overline{\mathbf{S}}=\lbrace \mathbf{C}_{1}, \mathbf{C}_{2}\rbrace\) and we want to select a clustering between two clusterings C
a
and C
b
based on the dissimilarity. As it can be seen in Fig. 1a, although C
a
is very similar (or can be equal to) C
1 but its maximum dissimilarity to the clustering set \(\overline{\mathbf{S}}\) (which is the dissimilarity between C
a
and C
2) is greater than the maximum dissimilarity of C
b
to \(\overline{\mathbf{S}}\) (which is the dissimilarity between C
b
and C
2). Therefore, based on the maximum dissimilarity, C
a
will be selected. However, from human interpretation, C
b
is more different from the clustering set \(\overline{\mathbf{S}}\) than C
a
. Similarly, in Fig. 1b, C
a
has a higher average dissimilarity to the clustering set \(\overline{\mathbf{S}}\) than C
b
and C
a
will be selected if the average dissimilarity is used. However, C
b
is clearly more different from the clustering set \(\overline{\mathbf{S}}\) than C
a
.
Definition 2
(Overall Clustering Quality)
The overall quality of a clustering C is characterized by the following bi-objective function \(\mathbb{F}(\mathbf{C}, \overline{\mathbf{S}})\):
Definition 3
(Clustering Dominance)
Given a set \(\overline{\mathbf{S}}\) of negative clusterings, a clustering C
dominates another clustering C′ w.r.t \(\overline{\mathbf{S}}\), written \(\mathbf{C} \succ_{\overline{\mathbf{S}}} \mathbf{C}'\) iff one quality objective of C is better than or equal to that of C′ and the other objective of C is strictly better than that of C′. Formally, \(\mathbf{C}\succ_{\overline{\mathbf{S}}} \mathbf{C}'\) if and only if the following conditions hold:
Finally, the alternative clustering problem is defined as follows:
Definition 4
(Alternative Clustering)
Given a set \(\overline{\mathbf{S}}\) of negative clusterings, alternative clustering is the problem of finding a representative set of clusterings C along the Pareto front defined by the bi-objective function \(\mathbb{F}(\mathbf{C}, \overline{\mathbf{S}})\).
A cluster-oriented algorithm for alternative clustering
In multi-objective optimization problems, usually there are efficient optimal solutions which cannot be compared. Therefore, one of the main goals of multi-objective optimization is to approximate the Pareto front. Evolutionary algorithms (EAs) possess several characteristics that are well suitable for multi-objective optimization. EAs approximate the Pareto front in a single run by maintaining a solution set or population. In each iteration, this solution set Q is modified by two basic steps: selection and variation. The selection step chooses only well-adapted candidates from Q to form a set P of parent solutions. Then, in the variation step, the parent candidates in P are used to produce the next generation through recombination and mutation. The two steps are repeated until a number of generations is reached. We adapt one of the most popular EAs, NSGAII (Deb et al. 2002), for the alternative clustering problem defined in Sect. 3.1.
Applying EAs to the clustering problem is not straightforward because the traditional recombination and mutation operators of EAs are not very suitable for the clustering problem. The first reason is that they often cannot produce offspring solutions with good characteristics inherited from their parents (Hruschka et al. 2009). Besides, in the case of fixed number of clusters, these operators can also produce invalid solutions. Therefore, when applying NSGAII, we replace these operators by two new operators called Cluster-Oriented Recombination and Neighbor-Oriented Mutation which can produce a valid offspring with good properties inherited from its parents. We defer the discussion of the deficiencies of the traditional genetic operators in Sects. 3.2.4 and 3.2.5, where we also describe our new genetic operators in detail. In the next sections, we summarize the NSGAII mechanism and present our modifications for the alternative clustering problem.
Fast nondominated sorting algorithm (NSGAII)
The pseudo code of NSGAII is shown in Algorithm 1. Let P be the fixed size of populations generated in the algorithm. The algorithm first creates an initial parent population P
0 and then produces an offspring population Q
0 from P
0 by the usual binary tournament selection, recombination, and mutation operators. The binary tournament selection procedure picks randomly two solutions and returns the non-dominated one. The set of non-dominated solutions P
best is initialized as the set of non-dominated solutions in P
0∪Q
0. Then for each generation t, the procedure FastNonDominatedSort(R
t
) classifies all solutions in the combined population R
t
=P
t
∪Q
t
into different dominance fronts (sorted in the ascending order of dominance depth where the first front is the set of non-dominated solutions). The pseudo code of the FastNonDominatedSort procedure is presented in Algorithm 2. We denote n
p
the number of solutions that dominate a solution p∈P and S
p
the set of solutions that p dominates. The solutions p with n
p
=0 are placed in the first non-dominated front. Then, for each solution p with n
p
=0, we visit each member q of its dominated set S
p
and reduce its domination count by one. When doing so, if n
q
=0 then we put q in a separate list Q. These members will belong the second nondominated front. The above process is repeated until all solutions are classified. The complexity of this procedure is O(P
2
Λ) where P is the population size, Λ is the number of objectives. In the case of alternative clustering with two objectives, the complexity of the FastNonDominatedSort procedure is O(P
2). Because the combined population includes all solutions of the previous parent population P
t
and the offspring population Q
t
, the non-dominated solutions found in previous generations are always kept in the next generations.
The algorithm sequentially adds the solutions of the first fronts \(\mathbb{F}_{i}\) to the next parent population P
t+1 if after adding \(\mathbb{F}_{i}\), the size of P
t+1 is still less than or equal to P. Otherwise, the remaining vacancies of P
t+1 are filled by P−|P
t+1| solutions of \(\mathbb{F}_{i}\) with the largest crowding distances. The crowding distance of a solution in a population is an estimate of the solution density around that solution. The crowding distance of a solution p is measured as the sum of the normalized distances of two solutions on the left and right side of that solution along each objective. As illustrated in Fig. 2, the crowding distance of the solution p is the sum of side lengths of the cuboid (shown with a dashed box). The larger the crowding distance of a solution is, the less the solution density surrounding that solution is. Therefore, adding the largest crowding distance points encourages the diversity of the next parent population P
t+1. The parent population P
t+1 is now used to create a new offspring population Q
t+1 by the regular evolutionary operators like binary tournament selection, recombination, mutation. In order to create a new solution p, NSGAII selects two parents p
1 and p
2 by the binary tournament selection and then applies the recombination operator on p
1 and p
2 to produce p. With probability of α, a mutation operator can be applied on p to increase the perturbation level. Then, the set P
best
of non-dominated solutions obtained so far is updated by the non-dominated solutions in Q
t+1. The whole process is repeated for the next generations. The complexity of generating a new offspring population Q
t+1 from its parent population P
t+1 is O(PΩ) where Ω is the complexity of computing Λ objectives. In alternative clustering problem, we have two objectives, therefore, the total complexity of NSGAII is O(T(P
2+PΩ)) where T is the number of generations.
In each generation, the number of non-dominated solutions is bounded by the population size P. Therefore, when running NSGAII with T generations, the size of the result set P
best
is bounded by PT. However, when the populations are moved towards the true Pareto front, the solutions at generation t mostly dominate the solutions of previous generation t−1. Therefore, in practice, the size of P
best
is around P and much smaller than PT. Only when NSGAII has converged at generation t<T, but it is still running for other T−t generations, then the size of P
best
can grow up larger than P because of very similar non-dominated solutions produced by NSGAII after converging.
The application of NSGAII to the alternative clustering problem requires the following choices:
-
Two objective functions.
-
A genetic encoding of clusterings.
-
Recombination and mutation operators to generate a new offspring population from a parent population.
-
An effective initialization scheme.
In the next sections, we present the above components.
Objective functions
We consider the Vector Quantization Error (VQE)—normally used in K-Means (Lloyd 1982)—for measuring the clustering quality, because the base clustering algorithm used in AFDT and AFDT2 is also K-Means. This objective has been shown to be very robust for noisy datasets. The VQE of a clustering solution C
t
is the sum of the square distances from each data object x
i
to the centroid of the cluster \(\mathbf{C}_{t}^{k} \) where x
i
belongs to. The VQE function is:
$$ \mathit{VQE}(\mathbf{C}_t) = \sum_{\mathbf{C}_t^k\in \mathbf{C}_t} \sum_{\mathbf{x}_i \in \mathbf{C}_t^k} \bigl\| \mathbf{x}_i - \boldsymbol{\mu}_t^k\bigr\|^2 $$
(4)
where \(\mathbf{C}_{t}^{k} \) is a cluster in the clustering solution C
t
, \(\boldsymbol{\mu}_{t}^{k} \) is the centroid of \(\mathbf{C}_{t}^{k} \), and \(\| \mathbf{x}_{i} - \boldsymbol{\mu}_{t}^{k}\|^{2} \) is the squared Euclidean distance between an item and its centroid. However, in the text datasets, the cosine distance is more suitable than the Euclidean distance. Therefore, when measuring the performance on the text datasets, we replace the Euclidean distance by the cosine distance. The Cosine VQE is:
$$ \mathit{CosineVQE}(\mathbf{C}_t) = \sum_{\mathbf{C}_t^k\in \mathbf{C}_t} \sum_{\mathbf{x}_i \in \mathbf{C}_t^k} \bigl(1 - \mathit{cosine}\bigl(\mathbf{x}_i, \boldsymbol{\mu}_t^k\bigr)\bigr) $$
(5)
where \(\mathit{cosine}(\mathbf{x}_{i}, \boldsymbol{\mu}_{t}^{k})\) is the cosine similarity between x
i
and \(\boldsymbol{\mu}_{t}^{k} \). The smaller the VQE is, the better the quality of a clustering is. The cost of computing VQE for a clustering C
t
is O(ND) where N is the dataset size and D is the number of dimensions of data objects.
To measure the similarity between two clusterings, we use the popular Adjusted Rand Index (ARI) (Hubert and Arabie 1985). ARI is a normalized version of the Rand Index (RI) proposed by Rand (1971). The Rand Index RI(C
1,C
2) between two clusterings C
1 and C
2 is simply defined as \(\frac{n_{11} + n_{00}}{ n_{11}+n_{10}+n_{01}+n_{00}} \) where n
11 is the number of object pairs that are in the same cluster in both two clusterings; n
00 is the number of pairs that are in different clusters in both clusterings; n
10 is the number of pairs that are assigned in the same cluster by the clustering C
1 and in different clusters by the clustering C
2; n
01 is the number of pairs that are assigned in different clusters by the clustering C
1 and in the same cluster by the clustering C
2. A problem with RI is that the expected value for two random clusterings is not constant. Hubert and Arabie (1985) fix this issue by introducing a normalized version of RI, called ARI. The ARI between two solutions C
1 and C
2 is defined as follows:
$$ \everymath{\displaystyle} \begin{array}{lll} \mathit{ARI}(\mathbf{C}_1, \mathbf{C}_2) &=& \frac{\mathit{Index} - \mathit{ExpectedIndex}}{\mathit{MaxIndex} - \mathit{ExpectedIndex}} \\[6pt] \mathit{ARI}(\mathbf{C}_1, \mathbf{C}_2) &=& \frac{\sum_{ij} \binom{n_{ij}}{2} - \bigl[\sum_{i} \binom{n_{i.}}{2} \sum_{j} \binom{n_{.j}}{2}\bigr]/\binom{n}{2}}{\frac{1}{2}\bigl[\sum_{i} \binom{n_{i.}}{2} + \sum_{j} \binom{n_{.j}}{2}\bigr] - \bigl[\sum_{i} \binom{n_{i.}}{2} \sum_{j} \binom{n_{.j}}{2}\bigr]/\binom{n}{2}} \end{array} $$
(6)
where n
ij
is the number of common data objects of two clusters X
i
and X
j
produced by the clustering solutions C
1 and C
2, and n
i.=∑
j
n
ij
,n
.j
=∑
i
n
ij
. The maximum value of ARI is 1 when two clusterings are identical. The value of ARI is around zero, or even a negative value, when two clusterings are very different. As we prefer different alternative clusterings, the smaller the ARI is, the better the dissimilarity between two clusterings is. In other words, we minimize the maximum similarity between the alternative clustering and the negative clustering set. Therefore, we define the similarity between an alternative clustering and a negative clustering set (similarly to the dissimilarity definition in Eq. (1)) as the maximum similarity between that alternative clustering and the clusterings in the negative clustering set. The complexity of computing ARI between two clusterings is O(N) where N is the dataset size. Therefore, the total complexity of COGNAC when optimizing VQE and ARI is O(T(P
2+PND)) where T is the number of generations and P is the population size. In other words, fixing the number of generations and the population size, the complexity of COGNAC increases linearly with the dataset size N and the number of dimensions D of data objects.
Genetic encoding of clusterings
We use the cluster-index based representation to encode clustering solutions. In detail, a clustering solution C
t
of N data objects \(\lbrace \mathbf{x}_{i} \rbrace_{i=1}^{N} \) is a N-dimensional vector where C
t
(i) is the index of the cluster where the data object x
i
belongs to. The index of each cluster is in the range of 1 to K with K is the fixed number of clusters. For example, with a dataset of 10 objects, and the number of clusters is 3, the clustering solution C
t
=[1113331122] represents there clusters: X
1={x
1,x
2,x
3,x
7,x
8}, X
2={x
4,x
5,x
6}, X
3={x
9,x
10}.
Cluster-oriented recombination operator
Although the cluster-index encoding is popular in the literature, its main disadvantage is that the traditional recombination operators often do not produce offspring solutions which inherit good properties from their parents. The first problem is that the traditional recombination operators are performed on the object level whereas the clustering meaning is considered on the cluster level. In other words, the clusters are the smallest units containing information about the quality of the clustering solution to which they belong (Falkenauer 1994). Another drawback is that one clustering can be represented by many chromosomes, e.g. two chromosomes [123144] and [314322] represents the same solution of four clusters C
1={x
1,x
4},C
2={x
2},C
3={x
3},C
4={x
5,x
6}. Therefore, performing recombination operators without a correct matching of clusters can return invalid solutions as in the following example:
The offspring [3|23|322] not only has an invalid number of clusters but also is very different from its parents. In this case, the offspring should be identical to their parents because they represent the same clustering solution.
To solve the above problems, we propose a cluster-oriented recombination operator where recombination is performed on clusters rather than on separate objects. The idea of performing recombination on clusters was first proposed by Falkenauer (1994) for the bin packing problem. However, their recombination operator cannot be applied to the clustering problem as it assumes special characteristics of the bin packing problem. In addition, their recombination does not perform a matching before merging clusters of two parents, therefore invalid solutions can still be returned.
The pseudo-code of our cluster-oriented recombination operator is presented in Algorithm 3. We first find a perfect matching M between clusters of two parents such that the number of common objects between them is largest. In this paper, we use the perfect matching algorithm proposed by Munkres (1957). The complexity of the matching algorithm is O(K
4) (or O(K
3) if optimized) where K is the number of clusters. Often, K is very small compared to the dataset size N, therefore the overhead of computing a perfect matching is relatively small.
Then, we perform uniform crossover on clusters as follows. First, we select a set I of K/2 random positions in {1,…,K} to copy clusters \(\mathbf{C}^{i}_{p_{1}} \) (where i∈I) of the first parent \(\mathbf{C}_{p_{1}} \) to the offspring C
o
. Let U be the set of all unassigned objects. Then, for each remaining position i∈{1,…,K}∖I, we compute the set \(\mathbf{C}^{\mathbf{M}(i)}_{p_{2}} \cap \mathbf{U}\) of unassigned objects in the cluster \(\mathbf{C}^{\mathbf{M}(i)}_{p_{2}} \) of the second parent \(\mathbf{C}_{p_{2}} \). If all objects in \(\mathbf{C}^{\mathbf{M}(i)}_{p_{2}} \) are assigned, it means that \(\mathbf{C}^{\mathbf{M}(i)}_{p_{2}} \) is strictly included in some cluster of the first parent. Therefore, we move all objects in \(\mathbf{C}^{\mathbf{M}(i)}_{p_{2}} \) to cluster i to avoid empty clusters. Otherwise, we simply assign the unassigned objects in \(\mathbf{C}^{\mathbf{M}(i)}_{p_{2}} \cap \mathbf{U}\) to cluster i. After merging clusters from two parents, there are still unassigned (or orphan) objects. These orphan objects will be assigned to the clusters of one randomly chosen parent. We assign the orphan objects to the clusters of only one parent to preserve good characteristics from that parent.
An example of a dataset with 12 objects is in Fig. 3a. The number of clusters is 3. The clusters of two parents are as in Fig. 3b, 3c. The perfect matching M will match: \(\mathbf{C}^{1}_{p_{1}} \rightarrow \mathbf{C}^{\mathbf{M}(1)=3}_{p_{2}} \), \(\mathbf{C}^{2}_{p_{1}} \rightarrow \mathbf{C}^{\mathbf{M}(2)=1}_{p_{2}} \), \(\mathbf{C}^{3}_{p_{1}} \rightarrow \mathbf{C}^{\mathbf{M}(3)=2}_{p_{2}} \) as in Fig. 3d. Assume that I={1}. We copy cluster \(\mathbf{C}^{1}_{p_{1}} \) from \(\mathbf{C}_{p_{1}} \), and move the unassigned objects in two clusters \(\mathbf{C}^{\mathbf{M}(2)=1}_{p_{2}} \), \(\mathbf{C}^{\mathbf{M}(3)=2}_{p_{2}} \) from \(\mathbf{C}_{p_{2}} \) to the offspring, as in Fig. 3e. Then, we assign the orphan object 5 to the cluster \(\mathbf{C}^{2}_{o} \) as in the first parent to obtain the offspring as in Fig. 3f. As it can be seen, the offspring inherits all good properties from its parents and converges to a correct clustering solution.
Neighbor-oriented mutation operator
In the traditional mutation operators, usually some objects are selected and moved randomly to different clusters. However, moving an object x
i
to a random cluster C
k can radically decrease the clustering quality when x
i
is too far from C
k. Also, if only few objects are moved to new clusters, the resulting perturbation can be too small for escaping local minima. But if many objects are moved to random clusters, the quality of the offspring can be strongly degraded. Therefore, determining the set of items to move is a difficult task. As it was the case for recombination, the traditional mutation operators can also produce invalid solutions when moving an object in a singleton cluster to a new cluster without checking the validity. To solve these problems, we replace the traditional mutation operators with a new operator called the Neighbor-Oriented Mutation operator.
The pseudo-code of the new mutation operator is presented in Algorithm 4. In detail, with the probability of ρ, each data object x
i
in a cluster with size greater than 1 is moved to the cluster of one of its γ nearest neighbors x
j
. In other words, a proportion ρ of the data objects will be selected randomly and moved to the clusters of one of their nearest neighbors. Besides, we do not move objects of singleton clusters, therefore the validity of the offspring solution is guaranteed. Moving an item in this manner avoids assigning it to a very far cluster, therefore the search space is reduced significantly, but the clustering solution is still of good quality. Besides, this operator can also produce arbitrarily-shaped clusters by linking near objects, e.g., a long cluster can be formed as a chain of near objects. Note that moving an object to one of the clusters of its nearest neighbors is different from moving that object to the nearest cluster: the first strategy allows arbitrary-shape clusters whereas the second one favors spherical clusters. To reduce the computational cost, the nearest neighbors of all objects will be computed only once before calling the neighbor-oriented mutation operator. For high-dimensional datasets, the distance between the objects can be computed by a suitable kernel. In this paper, we simply use the Euclidean distance.
The larger the value of ρ is, the more perturbed the offspring is. On the contrary, with small values of ρ, the search space is restricted, and as a result, the probability of remaining stuck in local minima increases. A similar issue also regards the number of nearest neighbors (parameter γ). A large value of γ allows moving an object x
i
to far clusters and a small value of γ permits moving x
i
only to near clusters. Setting γ too large results in random moves of data objects and thus wasting a lot of computing resources because very far objects can be assigned to the same cluster. But setting γ too small can limit the search space too much and keep the algorithm confined close to local minima.
To solve the above problems, we propose a method inspired by Simulated Annealing (Kirkpatrick et al. 1983). At the beginning, both parameters are assigned large values to allow the algorithm to explore potential solution regions. Then, these parameters are gradually decreased to help the algorithm exploit the most promising regions. This scheme automatically shifts in a gradual manner from diversification to intensification. In detail, in the first generation, ρ is assigned to ρ
max
and then in the next generations, ρ is decreased sequentially by multiplying by a decay factor ρ
dec
such that the value of ρ in the last generation is ρ
min
. Mathematically, the probability ρ
t
of moving an object at the t-th generation is computed as \(\rho_{t} = \rho_{max} \rho_{dec}^{t} \) and \(\rho_{max} \rho_{dec}^{T} = \rho_{min}\) where T is the number of generations. Formally, ρ
dec
is computed based on ρ
min
and ρ
max
as follows:
$$ \rho_{\mathit{dec}} = \sqrt[T]{\frac{\rho_{\mathit{min}}}{\rho_{\mathit{max}}}} $$
(7)
Similarly, the number of nearest neighbours γ is first set to γ
max
and then sequentially decreased by multiplying by a decay factor γ
dec
. However, we only decrease γ in the first T/2 generations and keep γ as γ
min
in the remaining generations to guarantee a large enough perturbation for the algorithm to escape from local minima. In detail, for the t-th generation where t<T/2, \(\gamma_{t} = \gamma_{\mathit{max}} \gamma_{\mathit{dec}} ^{t}\) where γ
dec
is computed such that \(\gamma_{\mathit{max}} \gamma_{\mathit{dec}} ^{T/2} = \gamma_{\mathit{min}}\). In other words, γ
dec
is computed as:
$$ \gamma_{\mathit{dec}} = \sqrt[T/2]{\frac{\gamma_{\mathit{min}}}{\gamma_{\mathit{max}}}} $$
(8)
For the t-th generation where t≥T/2, γ
t
is set to γ
min
. In the implementation, γ
t
is a double variable and rounded to an integer by a built-in ceiling function when calling the Neighbor-Oriented Mutation operator.
Initialization
The initialization phase plays an important role in guiding the algorithm towards the true Pareto front. If the initial population contains only random solutions which are very different from the negative clusterings, then the algorithm explores well the region where the dissimilarity of the alternative clusterings and the negative ones is high. Analogously, if solutions similar to the negative clusterings are included into the initial set, then the algorithm often produces high-quality clusterings but similar to the negative ones. Here we assume that the negative clusterings are of high quality because they are usually obtained from single objective clustering algorithms. From this observation, we generate the initial population such that half of them are dissimilar clusterings and the rest are high-quality clusterings as follows.
Generating dissimilar clusterings: Let P be the initial population size and K
neg
and K
alter
be the number of clusters in negative clusterings and alternative clusterings, respectively. We generate P/2 dissimilar solutions from pairs of negative clusterings and individual negative clusterings as follows.
For each pair of two negative clustering solutions C
1 and C
2, we first find a perfect matching M between their clusters. Then, for each pair of matched clusters \(\mathbf{C}^{i}_{1} \rightarrow \mathbf{C}^{\mathbf{M}(i)}_{2}\), we compute a common cluster \(\mathbf{C}^{i}_{\mathit{com}} = \mathbf{C}^{i}_{1} \cap \mathbf{C}^{\mathbf{M}(i)}_{2}\), and a xor cluster \(\mathbf{C}^{i}_{\mathit{xor}} = (\mathbf{C}^{i}_{1} \cup \mathbf{C}^{\mathbf{M}(i)}_{2})\setminus \mathbf{C}^{i}_{\mathit{com}}\). Then we randomly merge two nearest common clusters or xor clusters until their total number equals K
alter
to generate a dissimilar offspring C
o
. The distance between two common clusters or two xor clusters is computed as the Euclidean distance between their centroids. The offspring is very dissimilar from its parents because in their parents, the objects in two common or two xor clusters are in different clusters, but in the offspring they are placed into the same cluster. Note that we do not merge a common cluster and a xor cluster because they can reproduce a cluster which equals one of parents’ clusters. If the number of clusters in two negative solutions are different, before matching, we sequentially split the largest cluster of the solution with the smaller number of clusters into two sub-clusters by K-Means until the number of clusters in two solutions are the same.
For individual negative clustering C
1, we first extract its K
neg
clusters \(\lbrace \mathbf{C}^{i}_{1} \rbrace_{i=1}^{K_{\mathit{neg}}}\). Next, for each cluster \(\mathbf{C}^{i}_{1}\), we use K-Means (Lloyd 1982) to partition this cluster into K
alter
sub-clusters \(\lbrace \mathbf{C}^{i_{j}}_{1}\rbrace_{j=1}^{K_{\mathit{alter}}}\). The remaining objects in \(\mathbf{X} \setminus \mathbf{C}^{i}_{1} \) are assigned to the j-th nearest sub-cluster \(\mathbf{C}^{i_{j}}_{1}\), with probability \(\alpha^{-j} / \sum_{t=1}^{K} \alpha^{-t} \) to form a dissimilar offspring C
o
. The parameter α is a factor determining the perturbation level of the offspring solution. In other words, the probability of assigning an unassigned object to its (j+1)-th nearest sub-cluster is α times smaller than the probability of assigning that object to the j-th nearest sub-cluster. The smaller α is, the more perturbed the offspring is, therefore we vary α from α
min
=2 to α
max
=10 to generate a diverse set of dissimilar solutions. In detail, from each cluster \(\mathbf{C}^{i}_{1}\) and a value α∈{α
min
,…,α
max
}, we generate \(\frac{P/2 - N_{\overline{C}}(N_{\overline{C}}-1)/2 }{N_{\overline{C}}K_{\mathit{neg}} (\alpha_{\mathit{max}} - \alpha_{\mathit{min}} + 1)}\) dissimilar offspring where \(N_{\overline{C}}\) is the number of negative clusterings. The distance between an object and a sub-cluster is computed as the Euclidean distance between that object and the sub-cluster centroid. The offspring generated by the above strategy is very dissimilar to their parents because the objects in each cluster \(\mathbf{C}^{i}_{1}\) of their parents C
1 are now assigned to different clusters. This strategy is similar to the ensemble clustering algorithm proposed by Gondek and Hofmann (2005), but different because of the perturbation parameter α to diversify the offspring set.
Generating high-quality clustering: We generate \(\frac{P/2}{N_{\overline{C}}}\) high quality offspring from each negative clustering C
1 as follows. First, we extract its K
neg
clusters \(\lbrace \mathbf{C}^{i}_{1} \rbrace_{i=1}^{K_{\mathit{neg}}}\). If K
neg
>K
alter
, we merge sequentially two nearest clusters until the number of clusters is K
alter
. In the case where K
neg
<K
alter
, we split iteratively the largest cluster into two sub-clusters by K-Means until the number of clusters equals K
alter
. Then, we compute the cluster centroids \(\lbrace \boldsymbol{\mu}^{i}_{1} \rbrace_{i=1}^{K_{\mathit{alter}}}\) and assign each object to its i-th nearest centroid with the probability \(\alpha^{-i} / \sum_{t=1}^{K_{\mathit{alter}}} \alpha^{-t} \) to obtain a new offspring. Similar to the procedure of generating dissimilar offspring, we also vary α from α
min
=2 to α
max
=10 for diversifying the high-quality offspring set.
Sequential generation of alternative clusterings
Based on the COGNAC algorithm (presented in Sect. 3.2), we propose the algorithm SGAC (the abbreviation of Sequential Generation of Alternative Clusterings) to generate a sequence of alternative clusterings as in Algorithm 5. First, the negative clustering set contains only the initial negative clustering and the alternative clustering set is empty. This initial negative clustering is often obtained from popular single objective clustering algorithms like K-Means (Lloyd 1982), Hierarchical Clustering (Lance and Williams 1967). Then in each iteration, the user will select one of the alternative clusterings returned by COGNAC. We defer the detailed discussion of the selection technique in Sect. 3.4. This alternative clustering is added to both sets of negative and alternative clusterings. Therefore, the alternative clustering generated in each next iteration is guaranteed to be different from the previous alternative clusterings. Finally, the set of all different alternative clusterings is returned to the user.
Analyzing the Pareto front
In order to reduce the number of solutions presented to users, we apply a traditional clustering algorithm like K-Means to partition the solution set into K clusters. Because the range of dissimilarity and quality are different, when clustering the solutions, we normalize their dissimilarity and quality as follows:
where \(\mu_{\mathbb{D}}, \sigma_{\mathbb{D}},\mu_{\mathbb{Q}}, \sigma_{\mathbb{Q}}\) are the mean and standard deviation of dissimilarity and quality of the solution set, respectively. For each cluster of solutions S
i
, the ranges of its dissimilarity and quality are represented in two border solutions: the one with the highest dissimilarity and lowest quality, and the other one with the highest quality and lowest dissimilarity. Therefore, users only need to consider the two border solutions of each cluster and quickly discard unsuitable clusters. If the user is satisfied with one of the border solutions, the algorithm can stop. Otherwise, the user selects a cluster of solutions with a reasonable range of dissimilarity and quality. Then, he can analyze that cluster more deeply by partitioning it again into sub-clusters and repeating the above process until a satisfactory solution is met.
Figure 4 shows an example of partitioning the Pareto solutions into 3 clusters. In cluster 2, the solution C
2a
is the solution with the highest dissimilarity and lowest quality, and the solution C
2b
is the solution with the highest quality and lowest dissimilarity. Assume that the range of dissimilarity and quality of two solutions C
2a
and C
2b
satisfies the users’ requirement. If they satisfy with one of the two border solutions, they can stop the algorithm. Otherwise, if users want to have finer solutions, they can run a traditional clustering algorithm like K-Means (Lloyd 1982) to partition cluster 2 into three other sub-clusters and repeat the whole process.
Besides, when plotting all solutions, the figures are very difficult to read, therefore, we filter the similar solutions on the Pareto front as follows. First, we sort all solutions in the descending order of the dissimilarity objective. Then, we add the first solution with the highest dissimilarity to the filtered Pareto front S
filtered
. For each next solution C, we compute the normalized difference on each objective between C and the previously added solution C′. Denote S
∗ the Pareto solution set returned by COGNAC. The normalized difference (w.r.t a negative clustering set \(\overline{\mathbf{S}}\)) in dissimilarity Δ
D
and in quality Δ
Q
of two solutions C and C′ are computed as:
If the normalized difference on two objectives \({\varDelta}_{\mathbb{D}}(\mathbf{C}, \mathbf{C}')\) and \({\varDelta}_{\mathbb{Q}}(\mathbf{C}, \mathbf{C}')\) between two solutions C and C′ are equal to or greater than a threshold δ, then C is added to the filtered Pareto front S
filtered
. The above process is repeated until all solutions are considered. This technique can also be applied before partitioning the approximated Pareto front into K clusters to remove similar solutions.