Abstract
Kleinberg introduced an axiomatic system for clustering functions. Out of three axioms, he proposed, two (scale invariance and consistency) are concerned with data transformations that should produce the same clustering under the same clustering function. The so-called consistency axiom provides the broadest range of transformations of the data set. Kleinberg claims that one of the most popular clustering algorithms, k-means does not have the property of consistency. We challenge this claim by pointing at invalid assumptions of his proof (infinite dimensionality) and show that in one dimension in Euclidean space the k-means algorithm has the consistency property. We also prove that in higher dimensional space, k-means is, in fact, inconsistent. This result is of practical importance when choosing testbeds for implementation of clustering algorithms while it tells under which circumstances clustering after consistency transformation shall return the same clusters. Two types of remedy are proposed: gravitational consistency property and dataset consistency property which both hold for k-means and hence are suitable when developing the mentioned testbeds.
Similar content being viewed by others
Explore related subjects
Find the latest articles, discoveries, and news in related topics.Avoid common mistakes on your manuscript.
1 Introduction
In his heavily cited paper (Kleinberg, 2002), Kleinberg introduced an axiomatic system for clustering functions. A clustering function applied to a dataset S produces a partition Γ. A partition Γ of a set S into k subsets (clusters) is to be understood as the set of subsets Γ = {C1,C2,...,Ck} such that \(\cup _{i=1}^{k} C_{i}=S\), Ci ∩ Cj = ∅ for any i≠j, \(C_{i}\subseteq S\) and Ci≠∅ for any i. Kleinberg (2002, Section 2) defines clustering function as:
Definition 1
A clustering function is a function f that takes a distance function d on [set] S [of size n ≥ 2] and returns a partition Γ of S. The sets in Γ will be called its clusters.
where the distance is understood by him as
Definition 2
With the set \(S=\{1,2,\dots ,n\}\) [...] we define a distance function to be any function \(d: S \times S\rightarrow \mathbb {R}\) such that for distinct i,j ∈ S we have d(i,j) ≥ 0,d(i,j) = 0 if and only if i = j, and d(i,j) = d(j,i).
Out of three axioms, he proposed, two are concerned with data transformations that should produce the same clustering (partition) under the same clustering function. We can speak here about “clustering preserving transformations” induced by these axioms. The so-called consistency axiom, mentioned below, shall be of interest to us here as it provides the broadest range of transformations. Note that, following literature, we use here terms “property” and “axiom” interchangeably.
Property 1
Let Γ be a partition of S, and d and \(d^{\prime }\) two distance functions on S. We say that \(d^{\prime }\) is a Γ-transformation of d if (a) for all i,j ∈ S belonging to the same cluster of Γ, we have \(d^{\prime }(i, j) \le d(i, j)\) and (b) for all i,j ∈ S belonging to different clusters of Γ, we have \(d^{\prime }(i, j) \ge d(i, j)\). The clustering function f has the consistency property if for each distance function d and its Γ-transformation \(d^{\prime }\) the following holds: if f(d) = Γ, then \(f(d^{\prime }) = {{\varGamma }}\)
Subsequently, we will talk about Γ-transformation exchangeably with Γ-based consistency transformation or just consistency transformation. Let us mention also the other clustering preservation axiom of Kleinberg, that is the scale-invariance axiom.
Property 2
A function f has the scale-invariance property if for any distance function d and any α > 0, we have f(d) = f(α ⋅ d).
The validity or non-validity of any clustering preserving axiom for a given clustering function is of vital practical importance, as it may serve as a foundation for a testbed of the correctness of the function. Any modern software developing firm creates tests for its software in order to ensure its proper quality. Generators providing versatile test data are therefore of significance because they may detect errors unforeseen by the developers. Thus the consistency axiom may be used to generate new test data from existent one knowing a priori what the true result of clustering should be. The scale-invariance axiom may be used too, but obviously, the diversity of derived sets is much smaller.
Kleinberg defined a class of clustering functions, called the centroid functions as follows: for any natural number k ≥ 2, and any continuous, non-decreasing, and unbounded function \(g: \mathbb {R}^{+}\rightarrow \mathbb {R}^{+}\), the (k;g)-centroid clustering consists of: (1) choosing the set of k centroid points \(T \subseteq S\) for which the objective function \({\Delta }^{g}_ d(T) = {\sum }_{i\in S} g(d(i, T))\) is minimized, where \( d(i, T) = \min \limits _{j\in T} d(i, j) \). (2) a partition of S into k clusters is obtained by assigning each point to the element of T closest to it. He claims that the objective function underlying k-means clustering is obtained by setting g(d) = d2. This is not quite correct because cluster centers in k-means do not necessarily belong to S, though with a dense set S, the approximation may be relatively good. It would be more appropriate if Kleinberg would speak about k-medoid algorithm
Note that his distance definition (Def. 2) is not a Euclidean one and not even metric, as he stresses. This is of vital importance because based on this he formulates and proves a theorem (his Theorem 4.1)
Theorem 1
Theorem 4.1 fromKleinberg (2002). For every k ≥ 2 and every function g [...] and for [data set size] n sufficiently large relative to k, the (k;g)-centroid clustering function [this term encompassing k-means] does not satisfy the Consistency property.
which we claim is wrong with respect to k-means for a number of reasons as we will show below. The reasons are:
-
The objective function underlying k-means clustering is not obtained by setting g(d) = d2 contrary to Kleinberg’s assumption (k-medoid is obtained).
-
k-means always works in fixed-dimensional space while his proof relies on unlimited dimensional space.
-
Unlimited dimensionality implies a serious software testing problem because the algorithm’s correctness cannot be established by testing as the number of tests is too vast.
-
The consistency property holds for k-means in one-dimensional space.
The last result opens the problem of whether or not the consistency also holds for higher dimensions.
We begin our presentation with recalling basics of the k-means algorithms in Section 2. We recall the Kleinberg’s proof of k-means inconsistency and point at its weak points in Section 3. Then we investigate the impact of dimensionality of k-means consistency in Section 4. In Section 5 we discuss the reasons for inconsistency in multi-dimensional spaces and propose a remedy in terms of gravitational consistency and generalized gravitational consistency. In Section 6, we suggest still a different way around the problem by proposing dataset consistency property. Section 7 reports on some experiments illustrating selected insights from the paper. Conclusions are presented in Section 8.
2 k-Means algorithm
The popular clustering algorithm, k-means (MacQueen, 1967) strives to minimize the partition quality function (called also partition cost function)
where xi, \(i=1,\dots , m\) are the data points, M is the matrix of cluster centers μj, \(j=1,\dots , k\), and U is the cluster membership indicator matrix, consisting of entries uij, where uij is equal to 1 if among all of cluster (gravity) centers μj is the closest to xi, and is 0 otherwise.
It can be rewritten in various ways while the following are of interest to us here. Let the partition \({{\varGamma }}=\{C_{1},\dots ,C_{k}\}\) b a partition of the data set onto k clusters \(C_{1},\dots ,C_{k}\). Then
where \( \boldsymbol {\mu }(C)=\frac {1}{|C|}{\sum }_{\mathbf {x}_{i} \in C} \mathbf {x}_{i}\) is the gravity center of the cluster C. The above can be presented also as
The problem of seeking the pair (U,M) minimizing J from equation (1) is called k-means-problem. This problem is known as NP-hard. We will call k-means-ideal such an algorithm that finds a pair (U,M) minimizing Q from equation (1). Practical implementations of k-means usually find some local minima of Q(). There exist various variants of this algorithm. For an overview of many of them, see, e.g., Wierzchoń and Kłopotek (2018). An algorithm is said to be from the k-means family if it has the structure described by Algorithm 1. We will use a version with random initialization (randomly chosen initial seeds) as well as an artificial one initialized close to the true cluster center, which mimics k-means-ideal.
3 Kleinberg’s proof of Theorem 1 and its unlimited dimensionality deficiency
Kleinberg’s proof, delimited to the case of k = 2 only, runs as follows: Consider a set of points S = X ∪ Y where X,Y are disjoint and |X| = m, |Y | = γm, where γ > 0 is “small”. ∀i,j∈Xd(i,j) = r, ∀i,j∈Yd(i,j) = 𝜖 < r, ∀i∈X,j∈Yd(i,j) = r + δ where δ > 0 and δ is “small”. By choosing γ,𝜖,r,δ appropriately, the optimal choice of k = 2 centroids will consist of one point from X and one from Y. The resulting partition is Γ = {X,Y }. Let divide X into X = X0 ∪ X1 with X0,X1 of equal cardinality. Reduce the distances so that \(\forall _{c=1,2}\forall _{i,j\in X_{c}} d^{\prime }(i,j)=r^{\prime }<r\) and \(d^{\prime }=d\) otherwise. If \(r^{\prime }\) is “sufficiently small”, then the optimal choice of two centroids for S will now consist of one point from each Xc, yielding a different partition of S. But \(d^{\prime }\) is a Γ-transform of d so that a violation of consistency occurs. So far the proof of Kleinberg of the Theorem 1.
The proof cited above is a bit excentric because the clusters are heavily unbalanced (k-means tends to produce rather balanced clusters). Furthermore, the distance function is awkward because Kleinberg’s counter-example would require an embedding in a very high dimensional space, non-typical for k-means applications. It needs to be mentioned that Kleinberg’s proof, sketchy in nature, omitted many details. Kleinberg uses a distance definition that is broader than Euclidean and therefore he does not consider space dimensionality. k-means, on the other hand, in its basic version, explicitly assumes an Euclidean space. This is the reason, why we consider Kleinberg’s proof in the light of Euclidean space embedding.
We claim in brief:
Theorem 2
Kleinberg’s proof of Kleinberg (2002) Theorem 4.1 that k-means (k = 2) is not consistent, is not valid in \(\mathbb {R}^{p}\) for data sets of cardinality n > 2(p + 1).
Proof
In terms of the concepts used in the Kleinberg’s proof, either the set X or the set Y is of cardinality p + 2 or higher. Kleinberg requires that distances between p + 2 points are all identical which is impossible in \(\mathbb {R}^{p}\) (only up to p + 1 points may be equidistant). □
Furthermore Kleinberg’s minimized target function
where \( d(i, T) =\min \limits _{j\in T} d(i, j) \), differs significantly from the formula (3). For the original set X, the formula (3) would return \(\frac 12 (m-1) r^{2} \), while Kleinberg’s would produce (m − 1)r2. For a combination of a elements from X and b elements from Y in one cluster we get \( \frac {a(a-1)r^{2}/2+b(b-1)\epsilon ^{2}/2+ab(r+\delta )^{2}}{a+b}\) from (2) or the minimum of (a − 1)r2 + b(r + δ)2 and (b − 1)𝜖2 + a(r + δ)2 for Kleinberg’s \({{\Delta }^{g}_{d}}(T)\). The discrepancy between these formulas is shown in Fig. 1. We assumed there r = 10, 𝜖 = 8, δ = 1 and m = 1000.
We see immediately that
Theorem 3
Kleinberg’s target function does not match the real k-means target function.
4 The impact of dimensionality of consistency property
As visible from Theorem 2, the dimensionlity of the space impacts the validity of Kleinberg’s proof of inconsistency of k-means. However, this does not answer the question whether or not k-means is actually consistent in a fixed dimensional space. In this section we will show that in fact k-means is consistent in one-dimensional space (Theorem 4), but it is inconsistent in 3 or more dimensions (Theorem 5) and also it is inconsistent in 2 dimensions (Theorem 6).
Theorem 4
k-means is consistent in one dimensional Euclidean space.
The proof is postponed to the Appendix A.
But what about higher dimensions?
Theorem 5
k-means in 3D is not consistent.
The proof, by example, is postponed to the Appendix A. The example used in that proof is more realistic (balanced, in Euclidean space) than that of Kleinberg and shows that inconsistency of k-means in \(\mathbb {R}^{m}\) is a real problem. With the example used in the proof of Theorem 5 not only consistency violation is shown, but also refinement-consistency violation. Not only in 3D, but also in higher dimensions (as 3D example may always be embedded in n dimensions, n > 3). So what about the case of two dimensions - 2D?
Theorem 6
k-means in 2D is not consistent.
Proof
The proof of Theorem 6 uses a less realistic example than in Theorem 5, hence Theorem 5 was worthy considering in spite of the fact that it is implied by Theorem 6. Imagine a unit circle with data points arranged as follows (Fig. 2 left): one data point in the center, and the remaining points arranged on the circle with the following angular positions with respect to the circle center.
Set A={\(13^{o},14^{o},\dots ,22^{o}, -13^{o},-14^{o},\dots ,-22^{o}\)}.Set B={\(133^{o},134^{o},\dots ,142^{o}, -133^{o},-134^{o},\dots ,-142^{o}\)}.k-means with k = 2 will merge points of the set B and the circle middle point as one cluster, and the set A as the other cluster. After a Γ transformation (Fig. 2 right) let A turn to \(A^{\prime }\) identical with A and and let B change to \(B^{\prime }\)\(B^{\prime }\)={\(162^{o},163^{o},\dots ,171^{o}, -162^{o},-163^{o},\dots ,-171^{o}\)}, while the point in the center of the circle remains in its position. Now k-means with k = 2 yields one cluster consisting of points of the set \(B^{\prime }\) and the second cluster consisting of the circle middle point and the set \(A^{\prime }\). The center point of the circle switches the clusters upon Γ transformation (Fig. 2 right). □
5 Reasons for multidimensional inconsistency
In order to investigate the reasons for k-means inconsistency in higher dimensions, in analogy to the proof of Theorem 4 from Section 4, let us consider two alternative partitions in a multi-dimensional space:
-
the partition \({{\varGamma }}{}_{1}=\{C_{1.},\dots , C_{k.}\}\) which will be base for the Γ-transform
-
and the competing partition \({{\varGamma }}{}_{2}=\{C_{.1},\dots , C_{.k^{\prime }}\}\).
Assume further that Cij = Ci. ∩ C.j are non-empty intersections of clusters Ci. ∈Γ1, C.j ∈Γ2, of both partitions. Define minind(Ci.), resp. maxind(Ci.) as the minimal/maximal index j such that Cij is not empty. The Q(Γ1) will be the sum of centered sums of squares over all Cij plus the squared distances of centers of all Cij to the center of Ci. times cardinality of Cij.
We can derive the formula for Q(Γ1) in the same way as in the proof of Theorem 4 in Appendix A (8)
The Q(Γ2) can be derived also in analogy to equation (8) in the proof of Theorem 4 in Appendix A as:
The first summand of Q(Γ1), that is \(\left ({\sum }_{i,j; C_{ij}\ne \emptyset } {\sum }_{\mathbf {x} \in C_{ij} } \|\mathbf {x}- \boldsymbol {\mu }(C_{ij}) \|^{2} ) \right ) \) will decrease upon Γ1 based consistency transformation. The reason is that \({\sum }_{\mathbf {x}\in C_{ij} } \|\mathbf {x}- \boldsymbol {\mu }(C_{ij})\|^{2}\) is equivalent to \( \frac {0.5}{|C_{ij}|} \left ({\sum }_{\mathbf {x}\in C_{ij},\mathbf {y}\in C_{ij} } \| \mathbf {x}-\mathbf {y}\|^{2} \right ) \) which decreases because the distances between elements of Cij decreases as they are all in the same cluster Ci.. As summands of Q(Γ2) are concerned, the first, equal \(\left ({\sum }_{i,j; C_{ij}\ne \emptyset } \frac {|C_{ij}|}{|C_{.j}|} {\sum }_{\mathbf {x}\in C_{ij} } \|\mathbf {x} -\boldsymbol {\mu }(C_{ij})\|^{2} \right )\), will therefore also decrease upon Γ1 transformation. But not by the same absolute value as the first one of Q(Γ1), that is \(\left ({\sum }_{i,j; C_{ij}\ne \emptyset } {\sum }_{\mathbf {x}\in C_{ij} } \|\mathbf {x}-\mu (C_{ij})\|^2 \right )\), because always |Cij|≤|C.j|. The second summand of Q(Γ2), that is
will increase because x,y stem from different clusters of Γ1. If Γ1 was the optimal clustering for k-means cost function prior to Γ1 transformation, it would remain so afterward if the second summand of Q(Γ1), that is \({\sum }_{C_{i.}\in {{\varGamma }}{}_1} 0.5{\sum }_{j; C_{ij}\ne \emptyset } {\sum }_{j^{\prime }; C_{ij^{\prime }}\ne \emptyset } \frac {|C_{ij}| \cdot |C_{ij^{\prime }}|}{|C_{i.}|} \| \boldsymbol {\mu }(C_{ij})- \boldsymbol {\mu }(C_{ij^{\prime }})\|^2\) , would decrease. However, in a multidimensional space, this is not granted anymore, because \(\| \boldsymbol {\mu }(C_{ij})- \boldsymbol {\mu }(C_{ij^{\prime }})\|^2\) may increase when the points of the cluster \(C_{i.^{\prime }}\) are getting closer to one another. An immediate remedy would be then to require that for any two convex subsets \(C_{ij},C_{ij^{\prime }}\) of Ci., \(\| \boldsymbol {\mu }(C_{ij})- \boldsymbol {\mu }(C_{ij^{\prime }})\|^2\) is non-increasing upon Γ1 transformation. This condition is not easy to check. However, if one decreases all distances within one cluster Ci. by the very same factor, then this condition holds. It also holds if, within an orthogonal coordinate system, one decreases all distances within one cluster Ci. along each dimension by a factor specific for the dimension and the cluster. Under such circumstances, the distances within a cluster will not be necessarily changed by the same factor.
So, define the gravitational consistency as follows:
Property 3
Let Γ be a partition of S, and d and \(d^{\prime }\) two distance functions on S. We say that \(d^{\prime }\) is a Γ-gravitational-transformation of d if (a) for all i,j ∈ S belonging to the same cluster of Γ, we have \(d^{\prime }(i, j) =\alpha d(i, j)\) where 0 < α ≤ 1 and α is specified for a given cluster (may be different for different clusters) and (b) for all i,j ∈ S belonging to different clusters of Γ, we have \(d^{\prime }(i, j) \ge d(i, j)\). The clustering function f has the gravitational consistency property if for each distance function d and its Γ-gravitational-transformation \(d^{\prime }\) the following holds: if f(d) = Γ, then \(f(d^{\prime }) = {{\varGamma }}\)
Theorem 7
k-means ideal has the gravitational consistency property.
Proof
Straightforward from the above. □
Define also the generalized gravitational consistency as follows:
Property 4
Let Γ be a partition of S, and d and \(d^{\prime }\) two distance functions on S. We say that \(d^{\prime }\) is a Γ-generalized-gravitational-transformation of d if (a) for all i ∈ S belonging to the same cluster C of Γ, with μ(C) being its gravity center, and for an orthogonal coordinate CS specific for this cluster, for each coordinate axis a ∈ CS we have \(d_a^{\prime }(i, \boldsymbol {\mu }(C)) =\alpha (C,a) d_a(i, \boldsymbol {\mu }(C))\) where 0 < α(C,a) ≤ 1, da being the length of projection of the vector (i,μ(C) on the coordinate axis a (same for \(d^{\prime }\)) and α(C,a) is specified for a given cluster and coordinate (may be different for different clusters and different coordinates) and (b) for all i,j ∈ S belonging to different clusters of Γ, we have \(d^{\prime }(i, j) \ge d(i, j)\). The clustering function f has the generalized gravitational consistency property if for each distance function d and its Γ-generalized-gravitational-transformation \(d^{\prime }\) the following holds: if f(d) = Γ, then \(f(d^{\prime }) = {{\varGamma }}\)
Theorem 8
k-means ideal has the generalized gravitational consistency property.
Proof
Straightforward from the above. □
6 Dataset consistency
The gravitational consistency can be viewed as too rigid as there exists a very strict limitation on how the distances between data elements can change. Though generalized gravitational consistency is less restrictive, the variations of distances within a cluster are nonetheless quite restricted, determined by as many factors only as there are dimensions.
Note that we had considered so far the case when any data was clustered by the clustering algorithm. Let us now investigate whether or not we can define data set properties for which Kleinberg’s consistency property would hold for k-means. We would speak then about dataset consistency.
The idea we present here is quite simplistic, but nonetheless, it demonstrates that clustering algorithm properties may be implied by data set properties.
Assume we know what properties a dataset needs to possess so that we would know in advance partition Γ0 for which the absolute minimum of k-means quality function Q(Γ) (3) is obtained. Assume that this property depends on the distances between cluster centers, among others. When performing Γ-transformation, the cluster centers can move by at most the distance between the cluster center and the most distant point of the cluster. So it is sufficient to add to the distances between the clusters the maximum relocation for each cluster. Hence after Γ transformation, the distances are still sufficient to ensure the absolute minimum of the k-means target function.
The only task to do now is to identify this property of a dataset, allowing to know in advance the aforementioned absolute minimum of k-means Q-function.
So we will investigate below under what circumstances it is possible to tell, without exhaustive check, that the well-separated clusters are the global minimum of k-means. We will see that the ratio between the largest and the smallest cluster cardinality plays here an important role.
Definition 3
There is a gap g between two clusters A,B, if the distance between (hyper)balls centered at gravity centers of these clusters and enclosing each cluster amounts to g.
Let us consider a set of clusters \({{\varGamma }}=\{C_1,\dots ,C_k\}\), where k is the number of clusters, ni is the number of elements in cluster Ci, ri is the radius of the (hyper)ball centered at gravity center of cluster Ci and containing all the datapoints of the cluster Ci, \(M=\max \limits _i n_i\), \(m=\min \limits _i n_i\). Let g be the gap between every two clusters Ci,Cj fulfilling the conditions (6) and (7)
Theorem 9
A clustering Γ0 for which conditions (6) and (7) imposing constraints on the gap between clusters g hold, is optimal clustering that is with the lowest value of Q(Γ) among all the partitions of the same cardinality as Γ0.
Proof has been postponed to Appendix A.
Therefore we may call the above-mentioned well-separatedness as absolute clustering.
Definition 4
A clustering is called absolute if conditions (6) and (7) imposing constraints on the gap between clusters g hold.
One sees immediately that inner cluster consistency is kept, this time in terms of global optimum, under the restraint to k clusters.
Theorem 10
k-means ideal, applied to a dataset with gaps between intrinsic clusters amounting to the g plus the radii of the clusters between which the gap is measured, has the Kleinberg’s consistency property.
The proof is straightforward.
7 Experiments
7.1 Theorem 4 related experiments
Experiments have been performed to check whether or not the Theorem 4 that denies Kleinberg’s findings for one-dimensional space really holds. Samples were generated from uniform distribution (sample size 100, 200, 400, 1000, 2000, 4000, 10000) for each \(k=2,\dots ,floor(\sqrt {samplesize})\). Then the respective sample was clustered into k clusters (\(k=2,\dots ,floor(\sqrt {samplesize})\)) and k-means clustering (R package) was performed with 100k restarts. Subsequently, Γ transformation was performed where the distances within a cluster were decreased by a randomly chosen factor (a separate factor for each pair of neighboring data points), and at the same time, the clusters were moved away so that the distance between cluster elements of distinct clusters is not decreased. Then k-means clustering was performed with 100k restarts in two variants. The first variant was with random initialization. The second variant was with the initialization of the midpoint of the original (rescaled) cluster interval. Additionally, for control purposes, the original samples were reclustered. The number of partitions was counted for which errors in restoring the original clustering was observed. Experiments were repeated ten times. Table 1 presents the average results obtained.
In this table, looking at the errors for variant 1, we see that more errors are committed with the increasing sample size (and hence increasing the maximum of k). This contrasts with the variant 2 where the number of errors is negligible. The second variant differs from the first in that seeds are distributed so that there is one in each intrinsic cluster.
Clearly the Theorem 4 holds (as visible from the variant 2). At the same time, however, the table shows that k-means with random initialization is unable to initialize properly for a larger number k of clusters in spite of a large number of restarts (variant 1). This is confirmed by the experiments with reclustering original data.
This study also shows how a test data generator may work when comparing variants of k-means algorithm (for one-dimensional data)
7.2 Theorem 5 related experiments
A simulation was performed concerning the relocation of points of the line segments AB,AC from the proof of Theorem 5.
The results are presented in Table 2 The top row, named \(\measuredangle C^{\prime }AB^{\prime }\) represents the angle between line segments \(C^{\prime }a\) and \(AB^{\prime }\) after rotation of AB and AC line segments upon Γ transformation. The effects of this rotating transformation are measured by the following quantities
-
wrong Γ - number of k-means clustering errors compared to the original clustering before Γ transformation (consisting in the rotation of AB,AC) (out of 4000 data points in both clusters).
Initially, the angle \(\measuredangle CAB\) between the line segments AB,AC was a right angle (π/2). As shown in Table 2, the angle between these line segments was decreased in steps of π/20 down to π/20 and the clustering using k-means (with 50 restarts) was performed.
k-means algorithm, applied to the data set AB ∪ AC ∪ DE ∪ DF returned, as expected two clusters: AB ∪ AC and DE ∪ DF (the column \(\frac {\pi }{2}\)). As visible in the row wrong Γ, the number of clustering errors compared to the original clustering was increasing up to over 4% of data points being misclassified upon rotation. It is apparent that in fact k-means is not consistent in three dimensions, as claimed in Theorem 5.
In order to illustrate better the importance of the concept of gravitational consistency, an experiment was performed related to equation (5) (first line). As previously, a data set related to \(AB^{\prime } \cup AC^{\prime }\) subsets of the data for appropriate rotations of the line segments AB,AC was considered. This data set was split into two parts: 1) subcluster Z1 consisting of points with distance to A not higher than 20, 2) subcluster Z2 consisting of the remaining points. Z1 ∪ Z2 = AB ∪ AC. While the rotation was performed, the following statistics of \(Z_{1}^{\prime },Z_{2}^{\prime }\), that is images of Z1,Z2 after rotation were observed:
-
μ sc 1 - distance between means of the cluster \(AB^{\prime }\cup AC^{\prime }\) and the mean of subcluster \(Z_{1}^{\prime }\),
-
μsc 2 - distance between means of the cluster \(AB^{\prime }\cup AC^{\prime }\) and the mean of subcluster \(Z_{2}^{\prime }\),
-
SS sc 1 - contribution of subcluster \(Z_{1}^{\prime }\) to the sum of squares of the \(AB^{\prime }\cup AC^{\prime }\). SS sc 2 - contribution of subcluster \(Z_{2}^{\prime }\) to the sum of squares of the \(AB^{\prime }\cup AC^{\prime }\).
When the angle \(\measuredangle C^{\prime }AB^{\prime }\) was decreased (Γ transformation), the distances between points within both subsets \(Z_{1}^{\prime },Z_{2}^{\prime }\) as well as between both subsets \(Z_{1}^{\prime },Z_{2}^{\prime }\) were decreased. So was the distance between the gravity center of the entire data set \(A^{\prime }B^{\prime }\cup A^{\prime }C^{\prime }\) and the gravity center of the second subset \(Z_{2}^{\prime }\) was decreasing, as visible in the row μ sc 2 of the Table 2. However, the distance between the gravity center of the entire data set \(A^{\prime }B^{\prime }\cup A^{\prime }C^{\prime }\) and the gravity center of the first subset Z1 was increasing, as visible in the row μ sc 1 of the Table 2. Also the contribution of this subset to the overall sum of squares of the entire set was increasing, as visible from the row SS sc 1 of the Table 2. This demonstrates that the Γ transformation, though decreasing the distances between cluster data points, does not necessarily decrease the distance between sub-cluster centers and the cluster center which results in the inconsistency of k-means under Kleinberg’s Γ transformation.
7.3 Theorem 7 related experiments
Experiments were also performed referring to the Theorem 7 and the results are summarized in Table 3. The following metrics were used.
-
α - the contraction coefficient from Theorem 7
-
wrong α - number of k-means clustering errors compared to the original clustering before Γ-gravitational transformation of the AB,AC cluster.
The experiments were performed for the same data as in previous subsection. the Γ-gravitational transformation was performed for the (original) cluster AB ∪ AC with α as indicated in the row α. The choice of α was based on the requirement that the Γ transformation and the Γ-gravitational transformation should yield a resulting cluster with the same variance of the data points in the cluster after transformation. As visible in the row wrong α, no error in data clustering was induced by Γ-gravitational transformation, as expected from the Theorem 7.
8 Conclusions
In this paper, we have provided a definite answer to the problem of whether or not k-means algorithm possesses the consistency property. The answer is negative except for one-dimensional space. Settling this problem was necessary because the proof of Kleinberg of this property was inappropriate for real application areas of k-means that it is a fixed-dimensional Euclidean space. The result precludes usage of consistency axiom as a generator of test examples for k-means clustering function (except for one-dimensional data) and implies the need to seek alternatives.
We proposed gravitational consistency, generalized gravitational consistency and dataset consistency as an alternative to Kleinberg’s consistency property. Γ-gravitational transformation, as an alternative to Γ transformation, preserves the k-means clustering, but it is a bit rigid, because it keeps the proportions between distances in a single cluster. Generalized Γ-gravitational transformation does not have this disadvantage though there is still some rigidness as the changes in distances are concerned. The dataset consistency transformation is more flexible but requires quite large distances between the clusters. We believe, however, that these three alternatives can still generate a sufficient set of datasets for software tests. Note that an orientation on k-means is not a too serious limitation of usefulness as quite a large number of modern clustering algorithms encompass k-means clustering, just to mention the whole branch of spectral clustering.
Kleinberg’s consistency was subject of strong criticism and new variants were proposed like Monotonic consistency (Strazzeri and Sánchez-García, 2018) or MST-consistency (Zadeh, 2010). See also criticism in Carlsson and Mémoli (2010) and Correa-Morrisa (2013). The mentioned new definitions of consistency are apparently restrictions of Γ-consistency, and therefore the Theorem 4 would be valid. The Monotonic consistency seems not to impose restrictions on Kleinberg’s proof on k-means violating consistency. Therefore in those cases, the consistency of k-means under higher dimensionality needs to be investigated. Note that we have also challenged the result (Wei, 2017), who claims that Kleinberg’s consistency may be achieved by k-means with random initialization (see our Theorem 5). The shift of axioms from clustering function to quality measure (Ben-David and Ackerman, 2008) was suggested to the problems with consistency, but this approach fails to tell what the outcome of clustering should be, which is not useful for the mentioned test generator application.
It should be noted that, beside the Kleinberg axiomatic system, other axiomatic frameworks have been proposed, which may serve as foundations of development of new test data sets from existent ones. For example for unsharp partitioning there was a proposal of an axiomatic system by Wright (1973), for graph clustering by van Laarhoven and Marchiori (2014), for cost function driven algorithms by Ben-David and Ackerman (2009), for linkage algorithms by Ackerman et al. (2010), for hierarchical algorithms by Carlsson and Mémoli (2010), Gower (1990), and Thomann et al. (2015), for multiscale clustering by Carlsson and Mémoli (2008). for settings with increasing sample sizes by (Hopcroft & Kannan, 2012), for community detection by Zeng et al. (2016), for pattern clustering by Shekar (1988). They were not investigated here and are a bit hard to compare because they were proposed for different classes of clustering algorithms that do not cover the settings relevnt for k-means that is the embedding in the Euclidean space and partition of not only the sample but of the sample space.
Notes
In a test run with 100 restarts, in the first case we got clusters of equal sizes, with cluster centers at (17,0,0) and (-17,0,0), (between_SS / total_SS = 40%) whereas after rotation we got clusters of sizes 1800, 2200 with centers at (26,0,0), (-15,0,0) (between_SS / total_SS = 59%)
References
Ackerman, M., Ben-David, S., & Loker, D. (2010). Characterization of linkage-based clustering. In COLT 2010 (pp. 270–281).
Ben-David, S., & Ackerman, M. (2008). Measures of clustering quality: A working set of axioms for clustering. In Proc. Advances in Neural Information Processing Systems, (Vol. 21 pp. 121–128).
Ben-David, S., & Ackerman, M. (2009). Measures of clustering quality: a working set of axioms for clustering. In D Koller, D Schuurmans, Y Bengio, & L Bottou (Eds.). Advances in neural information processing systems, (Vol. 21 pp. 121–128). Curran Associates Inc.
Carlsson, G., & Mémoli, F. (2008). Persistent clustering and a theorem of J. Kleinberg. arXiv:08082241.
Carlsson, G., & Mémoli, F. (2010). Characterization, stability and convergence of hierarchical clustering methods. Journal of Machine Learning Research, 11, 1425–1470.
Correa-Morrisa, J. (2013). An indication of unification for different clustering approaches. Pattern Recognition, 46, 2548–2561.
Gower, J.C. (1990). Clustering axioms. Classification Society of North America Newsletter, pp 2–3.
Hopcroft, J., & Kannan, R. (2012). Computer science theory for the information age. Chapter 8.13.2. A Satisfiable Set of Axioms, p 272ff.
Kleinberg, J. (2002). An impossibility theorem for clustering. In Proc. NIPS. http://books.nips.cc/papers/files/nips15/LT17.pdf, (Vol. 2002 pp. 446–453).
Klopotek, M.A., & Klopotek, R.A. (2020). Clustering algorithm consistency in fixed dimensional spaces. In D Helic, G Leitner, M Stettinger, A Felfernig, & Z W Ras (Eds.) Foundations of intelligent systems - 25th international symposium, ISMIS 2020, Graz, Austria, September 23–25, 2020, Proceedings, Springer, Lecture notes in computer science, (Vol. 12117 pp. 352–361), DOI https://doi.org/10.1007/978-3-030-59491-6_33.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In Proc. fifth Berkeley symp. on math. Statist. and Prob., (Vol. 1 pp. 281–297). University of California Press.
Shekar, B. (1988). A knowledge-based approach to pattern clustering. PhD thesis, Indian Institute of Science.
Strazzeri, F., & Sánchez-García, R.J. (2018). Morse theory and an impossibility theorem for graph clustering. arXiv:1806.06142.
Thomann, P., Steinwart, I., & Schmid, N. (2015). Towards an axiomatic approach to hierarchical clustering of measures. Journal of Machine Learning Research, 16, 1949–2002.
van Laarhoven, T., & Marchiori, E. (2014). Axioms for graph clustering quality functions. Journal of Machine Learning Research, 15, 193–215.
Wei, Jh. (2017). Two examples to show how k-means reaches richness and consistency. DEStech Transactions on Computer Science and Engineering https://doi.org/10.12783/dtcse/aita2017/16001.
Wierzchoń, S., & Kłopotek, M. (2018). Modern clustering algorithms. Studies in Big Data 34, Springer.
Wright, W. (1973). A formalization of cluster analysis. Pattern Rec, 5(3), 273–282.
Zadeh, R. (2010). Towards a principled theory of clustering. http://stanford.edu/rezab/papers/principled.pdf.
Zeng, G., Wang, Y., Pu, J., Liu, X., Sun, X., & Zhang, J. (2016). Communities in preference networks: Refined axioms and beyond. In ICDM, (Vol. 2016 pp. 599–608).
Acknowledgements
We would like to acknowledge support for this project from the Polish government fundamental research funds.
Funding
This research was funded by the Polish government fundamental research funds.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare that they have no conflict of interest.
Additional information
Availability of data and material
Only data generated as described in the paper were used.
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This is an extended version of a conference paper (Klopotek & Klopotek, 2020).
Appendix : A: Proofs of selected theorems
Appendix : A: Proofs of selected theorems
1.1 A.1 Proof of Theorem 4
Proof
Consider two alternative partitions in one dimensional space:
-
the partition \({{\varGamma }}{}_1=\{C_{1.},\dots , C_{k.}\}\) which will be base for the Γ-transform
-
and the competing partition \({{\varGamma }}{}_2=\{C_{.1},\dots , C_{.k^{\prime }}\}\).
Due to the nature of k-means let each cluster of each partition after Γ-transform be represented as an interval not intersecting with any other cluster of the same partition. For Γ1, it holds before the transform; therefore, it holds afterward. Γ2 shall be the competing optimal transform; therefore, it holds for sure afterward. We intend to demonstrate that under the Γ1 transformation, that is, assuming that the intrinsic partition is Γ1, the target function of k-means for Γ1 will decrease not less than that for Γ2. For simplicity, assume that the indices of clusters grow with the growing value of the cluster center.
For this purpose assume that Cij = Ci. ∩ C.j are non-empty intersections of clusters Ci. ∈Γ1, C.j ∈Γ2, of both partitions. Define minind(Ci.), resp. maxind(Ci.) as the minimal/maximal index j such that Cij is not empty. The Q(Γ1) will be the sum of centered sums of squares over all Cij plus the squared distances of centers of all Cij to the center of Ci. times cardinality of Cij (easily derived from formula (2)).
Please note that
The Q(Γ2) can be computed analogously, but let us follow a bit distinct path (starting from formula (3)).
Both summands of Q(Γ1), that is \(\left ({\sum }_{i,j; C_{ij}\ne \emptyset } {\sum }_{x \in C_{ij} } (x- \mu (C_{ij}) )^2 ) \right ) \) and \(\left ({\sum }_{C_{i.}\in {{\varGamma }}{}_1}\right .\) \(\left .{\sum }_{j; C_{ij}\ne \emptyset } (|C_{ij}| (\mu (C_{ij})- \mu (C_{i.}))^2 \right ) \) will decrease upon Γ1 based consistency transformation. (x − μ(Cij))2 decreases because the distance to each of elements of Cij decreases as they are all in the same cluster Ci.. Each \((\mu (C_{ij})- \mu (C_{ij^{\prime }}))^2\) decreases because all the elements constituting Cij and \(C_{ij^{\prime }}\) belong to the same cluster Ci.. Hereby there is always an extreme data point Pij ∈ Cij separating it from \(C_{ij^{\prime }}\). As the points of both Cij and \(C_{ij^{\prime }}\) get closer to Pij under Γ1 transformation, so the centers of both Cij and \(C_{ij^{\prime }}\) will get closer to Pij, so that they will move closer to each other. As summands of Q(Γ2) are concerned, the first, equal \(\left ({\sum }_{i,j; C_{ij}\ne \emptyset } \frac {|C_{ij}|}{|C_{.j}|} {\sum }_{x\in C_{ij} } (x-\mu (C_{ij}))^2 \right )\), will also decrease upon Γ1 transformation. But not by the same absolute value as the first one of Q(Γ1), that is \(\left ({\sum }_{i,j; C_{ij}\ne \emptyset } {\sum }_{x\in C_{ij} } (x-\mu (C_{ij}))^2 \right )\), because always |Cij|≤|C.j|. But the second summand of Q(Γ2), that is
will increase because x,y stem from different clusters of Γ1. Therefore, if Γ1 was the optimal clustering for k-means cost function prior to Γ1 transformation, it will remain so afterward. □
1.2 A.2 Proof of Theorem 5
Proof
Let A,B,C,D,E,F be points in three-dimensional space with coordinates: A(1, 0, 0), B(33, 32, 0), C(33,− 32, 0), D(− 1, 0, 0), E(− 33, 0,− 32), F(− 33, 0, 32). Let SAB, SAC, SDE, SDF be sets of say 1000 points each randomly uniformly distributed over line segments (except for endpoints) AB,AC,DE,EF resp. Let X = SAB ∪ SAC ∪ SDE ∪ SEF. k-means with k = 2 applied to X yields a partition Γ = {SAB ∪ SAC,SDE ∪ SDF}, as expected (see Fig. 3 left). Let us perform a Γ transformation consisting of rotating line segments AB,AC around the point A in the plane spread by the first two coordinates (X and Y) towards the first coordinate axis (X xis) so that the angle between this axis and \(AB^{\prime }\) and \(AC^{\prime }\) is say one degree. To verify that this is a Γ transformation, consider some points P,Q, P on the line segment AB and Q on the line segment AC. Their distance amounts to \(|PQ|=\sqrt {|PA|^2+|AQ|^2-2\cos \limits (\measuredangle BAC) |PA| |AQ|}\). The images of P,Q be \(P^{\prime },Q^{\prime }\) resp., whereby obviously \(|P^{\prime }A|=|PA|\) and \(|AQ^{\prime }|=|AQ|\), and \(|\measuredangle B^{\prime }AC^{\prime }|<|\measuredangle BAC|\). Therefore
as expected for Γ transformation for points of the same cluster. Let us consider a point R on the line segment DE and the distance |RP| between points from two different clusters. Let Rx,Px be orthogonal projections of R,P onto the X axis, resp. P,Q lie in two orthogonal planes, spread by X,Z and X,Y axes, resp. Therefore \(|RP|=\sqrt {|RR_x|^2+|R_xP_x|^2+|P_xP|^2}\), whereby |RxPX| = |RxD| + |DA| + |APX|. Hence
Let \(P_x^{\prime }\) be the orthogonal projection of \(P^{\prime }\) on the X xis. Then, after the Γ transformation, the distance of interest \(|RP^{\prime }|\) turns out to be \(|RP^{\prime }|=\sqrt {|RR_x|^2+|R_xP_x^{\prime }|^2+|P_x^{\prime }P^{\prime }|^2}\) that is
as expected for Γ transformation for points of two different clusters.
Now the k-means with k = 2 yields a different partition, splitting line segments \(AB^{\prime }\) and \(AC^{\prime }\) (see Fig. 3 right). Footnote 1□
1.3 A.3 Proof of Theorem 9
Proof
In particular, let us consider the set of k clusters \({{\varGamma }}{}=\{C_1,\dots ,C_k\}\) of cardinalities \(n_1,\dots ,n_k\) and with radii of balls enclosing the clusters (with centers located at cluster centers) \(r_1,\dots , r_k\).
We are interested in a gap g between clusters such that it does not make sense to split each cluster Ci into subclusters \(C_{i1},\dots , C_{ik}\) and to combine them into a set of new clusters \(\mathcal {S}=\{S_1,\dots ,S_k\}\) such that \(S_j=\cup _{i=1}^k C_{ij}\).
We seek a g such that the highest possible central sum of squares combined over the clusters Ci would be lower than the lowest conceivable combined sums of squares around respective centers of clusters Sj. Let V ar(C) be the variance of the cluster C (average squared distance to cluster gravity center). Let rij be the distance of the center of subcluster Cij to the center of cluster Ci. Let vilj be the distance of the center of subcluster Cij to the center of subcluster Clj. So the total k-means function for the set of clusters \((C_1,\dots ,C_k)\) will amount to:
And the total k-means function for the set of clusters \((S_1,\dots ,S_k)\) will amount to:
Should \((C_1,\dots ,C_k)\) constitute the absolute minimum of the k-means target function, then \(Q(\mathcal {S})\ge Q(\mathcal {C})\) should hold, that is:
This implies:
To maximize \({\sum }_{j=1}^k n_{ij}r_{ij}^2\) for a single cluster Ci of enclosing ball radius ri, note that you should set rij to ri. Let \(m_j=\arg \max \limits _{j \in \{1,\dots ,k\}} n_{ij}\). If we set rij = ri for all j except mj, then the maximal \(r_{i{m_j}}\) is delimited by the relation \({\sum }_{j=1; j\ne m_j}^k n_{ij}r_{ij}\ge n_{i{m_j}}r_{i{m_j}}\). So
So if we can guarantee that the gap between cluster balls (of clusters from Γ) amounts to g, then surely
because in such case g ≤ vilj for all i,l,j.
By combining inequalities (12), (13) and (14) we see that the global minimum is granted if the following holds:
One can distinguish two cases: either (1) there exists a cluster St containing two subclusters Cpt, Cqt such that \(t=\arg \max \limits _j |C_{pj}|\) and \(t=\arg \max \limits _j |C_{qj}|\) (maximum cardinality subclasses of their respective original clusters Cp,Cq) or (2) not.
Consider the first case. Let Cp,Cq be the two clusters where Cpt and Cqt be two subclusters of highest cardinality within Cp,Cq resp. This implies that \(n_{pt}\ge \frac 1k n_p, n_{qt}\ge \frac 1k n_q\). Also this implies that for i≠p,i≠q nit ≤ ni/2.
Note that
So, in order to fulfill inequality (15), it is sufficient to require that
This of course maximized over all combinations of p,q.
Let us proceed to the second case. Here each cluster Sj contains a subcluster of maximum cardinality of a different cluster Ci. As the relation between Sj and Ci is unique, we can reindex Sj in such a way that actually Cj contains its maximum cardinality subcluster Cjj. Let us rewrite the inequality (15).
This is met if
This is the same as:
This is fulfilled if:
Let M be the maximum over \(n_1,\dots ,n_k\). The above holds if
Let m be the minimum over \(n_1,\dots ,n_k\). The above holds if
This is the same as
The above will hold, if for every \(i=1,\dots ,k\)
So the inequality (15) is fulfilled, if both inequality (16) and inequality (17) are held by an appropriately chosen g. But relation (17) is identical with (7), and (16) is identical with (6), □
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Kłopotek, M.A., Kłopotek, R.A. Issues in clustering algorithm consistency in fixed dimensional spaces. Some solutions for k-means. J Intell Inf Syst 57, 509–530 (2021). https://doi.org/10.1007/s10844-021-00657-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-021-00657-6