1 Introduction

In his heavily cited paper (Kleinberg, 2002), Kleinberg introduced an axiomatic system for clustering functions. A clustering function applied to a dataset S produces a partition Γ. A partition Γ of a set S into k subsets (clusters) is to be understood as the set of subsets Γ = {C1,C2,...,Ck} such that \(\cup _{i=1}^{k} C_{i}=S\), CiCj = for any ij, \(C_{i}\subseteq S\) and Ci for any i. Kleinberg (2002, Section 2) defines clustering function as:

Definition 1

A clustering function is a function f that takes a distance function d on [set] S [of size n ≥ 2] and returns a partition Γ of S. The sets in Γ will be called its clusters.

where the distance is understood by him as

Definition 2

With the set \(S=\{1,2,\dots ,n\}\) [...] we define a distance function to be any function \(d: S \times S\rightarrow \mathbb {R}\) such that for distinct i,jS we have d(i,j) ≥ 0,d(i,j) = 0 if and only if i = j, and d(i,j) = d(j,i).

Out of three axioms, he proposed, two are concerned with data transformations that should produce the same clustering (partition) under the same clustering function. We can speak here about “clustering preserving transformations” induced by these axioms. The so-called consistency axiom, mentioned below, shall be of interest to us here as it provides the broadest range of transformations. Note that, following literature, we use here terms “property” and “axiom” interchangeably.

Property 1

Let Γ be a partition of S, and d and \(d^{\prime }\) two distance functions on S. We say that \(d^{\prime }\) is a Γ-transformation of d if (a) for all i,jS belonging to the same cluster of Γ, we have \(d^{\prime }(i, j) \le d(i, j)\) and (b) for all i,jS belonging to different clusters of Γ, we have \(d^{\prime }(i, j) \ge d(i, j)\). The clustering function f has the consistency property if for each distance function d and its Γ-transformation \(d^{\prime }\) the following holds: if f(d) = Γ, then \(f(d^{\prime }) = {{\varGamma }}\)

Subsequently, we will talk about Γ-transformation exchangeably with Γ-based consistency transformation or just consistency transformation. Let us mention also the other clustering preservation axiom of Kleinberg, that is the scale-invariance axiom.

Property 2

A function f has the scale-invariance property if for any distance function d and any α > 0, we have f(d) = f(αd).

The validity or non-validity of any clustering preserving axiom for a given clustering function is of vital practical importance, as it may serve as a foundation for a testbed of the correctness of the function. Any modern software developing firm creates tests for its software in order to ensure its proper quality. Generators providing versatile test data are therefore of significance because they may detect errors unforeseen by the developers. Thus the consistency axiom may be used to generate new test data from existent one knowing a priori what the true result of clustering should be. The scale-invariance axiom may be used too, but obviously, the diversity of derived sets is much smaller.

Kleinberg defined a class of clustering functions, called the centroid functions as follows: for any natural number k ≥ 2, and any continuous, non-decreasing, and unbounded function \(g: \mathbb {R}^{+}\rightarrow \mathbb {R}^{+}\), the (k;g)-centroid clustering consists of: (1) choosing the set of k centroid points \(T \subseteq S\) for which the objective function \({\Delta }^{g}_ d(T) = {\sum }_{i\in S} g(d(i, T))\) is minimized, where \( d(i, T) = \min \limits _{j\in T} d(i, j) \). (2) a partition of S into k clusters is obtained by assigning each point to the element of T closest to it. He claims that the objective function underlying k-means clustering is obtained by setting g(d) = d2. This is not quite correct because cluster centers in k-means do not necessarily belong to S, though with a dense set S, the approximation may be relatively good. It would be more appropriate if Kleinberg would speak about k-medoid algorithm

Note that his distance definition (Def. 2) is not a Euclidean one and not even metric, as he stresses. This is of vital importance because based on this he formulates and proves a theorem (his Theorem 4.1)

Theorem 1

Theorem 4.1 fromKleinberg (2002). For every k ≥ 2 and every function g [...] and for [data set size] n sufficiently large relative to k, the (k;g)-centroid clustering function [this term encompassing k-means] does not satisfy the Consistency property.

which we claim is wrong with respect to k-means for a number of reasons as we will show below. The reasons are:

  • The objective function underlying k-means clustering is not obtained by setting g(d) = d2 contrary to Kleinberg’s assumption (k-medoid is obtained).

  • k-means always works in fixed-dimensional space while his proof relies on unlimited dimensional space.

  • Unlimited dimensionality implies a serious software testing problem because the algorithm’s correctness cannot be established by testing as the number of tests is too vast.

  • The consistency property holds for k-means in one-dimensional space.

The last result opens the problem of whether or not the consistency also holds for higher dimensions.

We begin our presentation with recalling basics of the k-means algorithms in Section 2. We recall the Kleinberg’s proof of k-means inconsistency and point at its weak points in Section 3. Then we investigate the impact of dimensionality of k-means consistency in Section 4. In Section 5 we discuss the reasons for inconsistency in multi-dimensional spaces and propose a remedy in terms of gravitational consistency and generalized gravitational consistency. In Section 6, we suggest still a different way around the problem by proposing dataset consistency property. Section 7 reports on some experiments illustrating selected insights from the paper. Conclusions are presented in Section 8.

2 k-Means algorithm

The popular clustering algorithm, k-means (MacQueen, 1967) strives to minimize the partition quality function (called also partition cost function)

$$ Q(U,M)=\sum\limits_{i=1}^{m}\sum\limits_{j=1}^{k} u_{ij}\|\mathbf{x}_{i} - \boldsymbol{\mu}_{j}\|^{2} $$
(1)

where xi, \(i=1,\dots , m\) are the data points, M is the matrix of cluster centers μj, \(j=1,\dots , k\), and U is the cluster membership indicator matrix, consisting of entries uij, where uij is equal to 1 if among all of cluster (gravity) centers μj is the closest to xi, and is 0 otherwise.

It can be rewritten in various ways while the following are of interest to us here. Let the partition \({{\varGamma }}=\{C_{1},\dots ,C_{k}\}\) b a partition of the data set onto k clusters \(C_{1},\dots ,C_{k}\). Then

$$ Q({{\varGamma}})=\sum\limits_{j=1}^{k}\sum\limits_{\mathbf{x}_{i} \in C_{j}} \|\mathbf{x}_{i} - \boldsymbol{\mu}(C_{j})\|^{2} $$
(2)

where \( \boldsymbol {\mu }(C)=\frac {1}{|C|}{\sum }_{\mathbf {x}_{i} \in C} \mathbf {x}_{i}\) is the gravity center of the cluster C. The above can be presented also as

$$ Q({{\varGamma}})=\frac12 \sum\limits_{j=1}^{k} \frac{1}{|C_{j}|} \sum\limits_{\mathbf{x}_{i} \in C_{j}}\sum\limits_{\mathbf{x}_{l} \in C_{j}} \|\mathbf{x}_{i} -\mathbf{x}_{l}\|^{2} $$
(3)

The problem of seeking the pair (U,M) minimizing J from equation (1) is called k-means-problem. This problem is known as NP-hard. We will call k-means-ideal such an algorithm that finds a pair (U,M) minimizing Q from equation (1). Practical implementations of k-means usually find some local minima of Q(). There exist various variants of this algorithm. For an overview of many of them, see, e.g., Wierzchoń and Kłopotek (2018). An algorithm is said to be from the k-means family if it has the structure described by Algorithm 1. We will use a version with random initialization (randomly chosen initial seeds) as well as an artificial one initialized close to the true cluster center, which mimics k-means-ideal.

figure a

3 Kleinberg’s proof of Theorem 1 and its unlimited dimensionality deficiency

Kleinberg’s proof, delimited to the case of k = 2 only, runs as follows: Consider a set of points S = XY where X,Y are disjoint and |X| = m, |Y | = γm, where γ > 0 is “small”. ∀i,jXd(i,j) = r, ∀i,jYd(i,j) = 𝜖 < r, ∀iX,jYd(i,j) = r + δ where δ > 0 and δ is “small”. By choosing γ,𝜖,r,δ appropriately, the optimal choice of k = 2 centroids will consist of one point from X and one from Y. The resulting partition is Γ = {X,Y }. Let divide X into X = X0X1 with X0,X1 of equal cardinality. Reduce the distances so that \(\forall _{c=1,2}\forall _{i,j\in X_{c}} d^{\prime }(i,j)=r^{\prime }<r\) and \(d^{\prime }=d\) otherwise. If \(r^{\prime }\) is “sufficiently small”, then the optimal choice of two centroids for S will now consist of one point from each Xc, yielding a different partition of S. But \(d^{\prime }\) is a Γ-transform of d so that a violation of consistency occurs. So far the proof of Kleinberg of the Theorem 1.

The proof cited above is a bit excentric because the clusters are heavily unbalanced (k-means tends to produce rather balanced clusters). Furthermore, the distance function is awkward because Kleinberg’s counter-example would require an embedding in a very high dimensional space, non-typical for k-means applications. It needs to be mentioned that Kleinberg’s proof, sketchy in nature, omitted many details. Kleinberg uses a distance definition that is broader than Euclidean and therefore he does not consider space dimensionality. k-means, on the other hand, in its basic version, explicitly assumes an Euclidean space. This is the reason, why we consider Kleinberg’s proof in the light of Euclidean space embedding.

We claim in brief:

Theorem 2

Kleinberg’s proof of Kleinberg (2002) Theorem 4.1 that k-means (k = 2) is not consistent, is not valid in \(\mathbb {R}^{p}\) for data sets of cardinality n > 2(p + 1).

Proof

In terms of the concepts used in the Kleinberg’s proof, either the set X or the set Y is of cardinality p + 2 or higher. Kleinberg requires that distances between p + 2 points are all identical which is impossible in \(\mathbb {R}^{p}\) (only up to p + 1 points may be equidistant). □

Furthermore Kleinberg’s minimized target function

$$ {{\Delta}^{g}_{d}}(T) = \sum\limits_{i\in S} g(d(i, T)) $$
(4)

where \( d(i, T) =\min \limits _{j\in T} d(i, j) \), differs significantly from the formula (3). For the original set X, the formula (3) would return \(\frac 12 (m-1) r^{2} \), while Kleinberg’s would produce (m − 1)r2. For a combination of a elements from X and b elements from Y in one cluster we get \( \frac {a(a-1)r^{2}/2+b(b-1)\epsilon ^{2}/2+ab(r+\delta )^{2}}{a+b}\) from (2) or the minimum of (a − 1)r2 + b(r + δ)2 and (b − 1)𝜖2 + a(r + δ)2 for Kleinberg’s \({{\Delta }^{g}_{d}}(T)\). The discrepancy between these formulas is shown in Fig. 1. We assumed there r = 10, 𝜖 = 8, δ = 1 and m = 1000.

Fig. 1
figure 1

Quotient of Kleinberg’s k-means target (formula (4)) and the real k-means target (formula formula (3))

We see immediately that

Theorem 3

Kleinberg’s target function does not match the real k-means target function.

4 The impact of dimensionality of consistency property

As visible from Theorem 2, the dimensionlity of the space impacts the validity of Kleinberg’s proof of inconsistency of k-means. However, this does not answer the question whether or not k-means is actually consistent in a fixed dimensional space. In this section we will show that in fact k-means is consistent in one-dimensional space (Theorem 4), but it is inconsistent in 3 or more dimensions (Theorem 5) and also it is inconsistent in 2 dimensions (Theorem 6).

Theorem 4

k-means is consistent in one dimensional Euclidean space.

The proof is postponed to the Appendix A.

But what about higher dimensions?

Theorem 5

k-means in 3D is not consistent.

The proof, by example, is postponed to the Appendix A. The example used in that proof is more realistic (balanced, in Euclidean space) than that of Kleinberg and shows that inconsistency of k-means in \(\mathbb {R}^{m}\) is a real problem. With the example used in the proof of Theorem 5 not only consistency violation is shown, but also refinement-consistency violation. Not only in 3D, but also in higher dimensions (as 3D example may always be embedded in n dimensions, n > 3). So what about the case of two dimensions - 2D?

Theorem 6

k-means in 2D is not consistent.

Proof

The proof of Theorem 6 uses a less realistic example than in Theorem 5, hence Theorem 5 was worthy considering in spite of the fact that it is implied by Theorem 6. Imagine a unit circle with data points arranged as follows (Fig. 2 left): one data point in the center, and the remaining points arranged on the circle with the following angular positions with respect to the circle center.

Fig. 2
figure 2

Inconsistency of k-means in 2D Euclidean space. Left picture - data partition before consistency transform. Right picture - data partition after consistency transform. Cluster elements are marked with blue and green. Red points indicate cluster centers

Set A={\(13^{o},14^{o},\dots ,22^{o}, -13^{o},-14^{o},\dots ,-22^{o}\)}.Set B={\(133^{o},134^{o},\dots ,142^{o}, -133^{o},-134^{o},\dots ,-142^{o}\)}.k-means with k = 2 will merge points of the set B and the circle middle point as one cluster, and the set A as the other cluster. After a Γ transformation (Fig. 2 right) let A turn to \(A^{\prime }\) identical with A and and let B change to \(B^{\prime }\)\(B^{\prime }\)={\(162^{o},163^{o},\dots ,171^{o}, -162^{o},-163^{o},\dots ,-171^{o}\)}, while the point in the center of the circle remains in its position. Now k-means with k = 2 yields one cluster consisting of points of the set \(B^{\prime }\) and the second cluster consisting of the circle middle point and the set \(A^{\prime }\). The center point of the circle switches the clusters upon Γ transformation (Fig. 2 right). □

5 Reasons for multidimensional inconsistency

In order to investigate the reasons for k-means inconsistency in higher dimensions, in analogy to the proof of Theorem 4 from Section 4, let us consider two alternative partitions in a multi-dimensional space:

  • the partition \({{\varGamma }}{}_{1}=\{C_{1.},\dots , C_{k.}\}\) which will be base for the Γ-transform

  • and the competing partition \({{\varGamma }}{}_{2}=\{C_{.1},\dots , C_{.k^{\prime }}\}\).

Assume further that Cij = Ci.C.j are non-empty intersections of clusters Ci.Γ1, C.jΓ2, of both partitions. Define minind(Ci.), resp. maxind(Ci.) as the minimal/maximal index j such that Cij is not empty. The Q(Γ1) will be the sum of centered sums of squares over all Cij plus the squared distances of centers of all Cij to the center of Ci. times cardinality of Cij.

We can derive the formula for Q(Γ1) in the same way as in the proof of Theorem 4 in Appendix A (8)

$$ \begin{array}{@{}rcl@{}} Q({{\varGamma}}{}_{1})&=& \left( \sum\limits_{i,j; C_{ij}\ne \emptyset} \sum\limits_{\mathbf{x} \in C_{ij} } \|\mathbf{x}- \boldsymbol{\mu}(C_{ij}) \|^{2} \right) \\ && +\left( \sum\limits_{C_{i.}\in {{\varGamma}}{}_{1}} \sum\limits_{j; C_{ij}\ne \emptyset} |C_{ij}| \| \boldsymbol{\mu}(C_{ij})- \boldsymbol{\mu}(C_{i.})\|^{2} \right) \\ &=& \left( \sum\limits_{i,j; C_{ij}\ne \emptyset} \sum\limits_{\mathbf{x} \in C_{ij} } \|\mathbf{x}- \boldsymbol{\mu}(C_{ij}) \|^{2} \right) \\&&+\left( \sum\limits_{C_{i.}\in {{\varGamma}}{}_{1}} 0.5\sum\limits_{j; C_{ij}\ne \emptyset} \sum\limits_{j^{\prime}; C_{ij^{\prime}}\ne \emptyset} \frac{|C_{ij}| \cdot|C_{ij^{\prime}}|}{|C_{i.}|} \| \boldsymbol{\mu}(C_{ij})- \boldsymbol{\mu}(C_{ij^{\prime}})\|^{2} \right) \end{array} $$
(5)

The Q(Γ2) can be derived also in analogy to equation (8) in the proof of Theorem 4 in Appendix A as:

$$ \begin{array}{@{}rcl@{}} Q({{\varGamma}}{}_{2})&=& \left( \sum\limits_{i,j; C_{ij}\ne \emptyset} \frac{|C_{ij}|}{|C_{.j}|} \sum\limits_{\mathbf{x}\in C_{ij} } \| \mathbf{x}-\boldsymbol{\mu}(C_{ij})\|^{2} \right) \\& &+ \sum\limits_{C_{.j}\in {{\varGamma}}{}_{2}} \frac{0.5}{|C_{.j}|} \left( \sum\limits_{i^{\prime};C_{i^{\prime}j}\ne \emptyset} \sum\limits_{i^{\prime\prime};i^{\prime}\ne i^{\prime\prime}, C_{i^{\prime\prime}j}\ne \emptyset} \sum\limits_{\mathbf{x}\in C_{i^{\prime}j},\mathbf{y}\in C_{i^{\prime\prime}j} } \| \mathbf{x}-\mathbf{y}\|^{2} \right) \end{array} $$

The first summand of Q(Γ1), that is \(\left ({\sum }_{i,j; C_{ij}\ne \emptyset } {\sum }_{\mathbf {x} \in C_{ij} } \|\mathbf {x}- \boldsymbol {\mu }(C_{ij}) \|^{2} ) \right ) \) will decrease upon Γ1 based consistency transformation. The reason is that \({\sum }_{\mathbf {x}\in C_{ij} } \|\mathbf {x}- \boldsymbol {\mu }(C_{ij})\|^{2}\) is equivalent to \( \frac {0.5}{|C_{ij}|} \left ({\sum }_{\mathbf {x}\in C_{ij},\mathbf {y}\in C_{ij} } \| \mathbf {x}-\mathbf {y}\|^{2} \right ) \) which decreases because the distances between elements of Cij decreases as they are all in the same cluster Ci.. As summands of Q(Γ2) are concerned, the first, equal \(\left ({\sum }_{i,j; C_{ij}\ne \emptyset } \frac {|C_{ij}|}{|C_{.j}|} {\sum }_{\mathbf {x}\in C_{ij} } \|\mathbf {x} -\boldsymbol {\mu }(C_{ij})\|^{2} \right )\), will therefore also decrease upon Γ1 transformation. But not by the same absolute value as the first one of Q(Γ1), that is \(\left ({\sum }_{i,j; C_{ij}\ne \emptyset } {\sum }_{\mathbf {x}\in C_{ij} } \|\mathbf {x}-\mu (C_{ij})\|^2 \right )\), because always |Cij|≤|C.j|. The second summand of Q(Γ2), that is

$$\sum\limits_{C_{.j}\in {{\varGamma}}{}_{2}} \frac{0.5}{|C_{.j}|} \left( \sum\limits_{i^{\prime};C_{i^{\prime}j}\ne \emptyset} \sum\limits_{i^{\prime\prime};i^{\prime}\ne i^{\prime\prime}, C_{i^{\prime\prime}j}\ne \emptyset} \sum\limits_{\mathbf{x}\in C_{i^{\prime}j},\mathbf{y}\in C_{i^{\prime\prime}j} } \|\mathbf{x}-\mathbf{y}\|^{2} \right) $$

will increase because x,y stem from different clusters of Γ1. If Γ1 was the optimal clustering for k-means cost function prior to Γ1 transformation, it would remain so afterward if the second summand of Q(Γ1), that is \({\sum }_{C_{i.}\in {{\varGamma }}{}_1} 0.5{\sum }_{j; C_{ij}\ne \emptyset } {\sum }_{j^{\prime }; C_{ij^{\prime }}\ne \emptyset } \frac {|C_{ij}| \cdot |C_{ij^{\prime }}|}{|C_{i.}|} \| \boldsymbol {\mu }(C_{ij})- \boldsymbol {\mu }(C_{ij^{\prime }})\|^2\) , would decrease. However, in a multidimensional space, this is not granted anymore, because \(\| \boldsymbol {\mu }(C_{ij})- \boldsymbol {\mu }(C_{ij^{\prime }})\|^2\) may increase when the points of the cluster \(C_{i.^{\prime }}\) are getting closer to one another. An immediate remedy would be then to require that for any two convex subsets \(C_{ij},C_{ij^{\prime }}\) of Ci., \(\| \boldsymbol {\mu }(C_{ij})- \boldsymbol {\mu }(C_{ij^{\prime }})\|^2\) is non-increasing upon Γ1 transformation. This condition is not easy to check. However, if one decreases all distances within one cluster Ci. by the very same factor, then this condition holds. It also holds if, within an orthogonal coordinate system, one decreases all distances within one cluster Ci. along each dimension by a factor specific for the dimension and the cluster. Under such circumstances, the distances within a cluster will not be necessarily changed by the same factor.

So, define the gravitational consistency as follows:

Property 3

Let Γ be a partition of S, and d and \(d^{\prime }\) two distance functions on S. We say that \(d^{\prime }\) is a Γ-gravitational-transformation of d if (a) for all i,jS belonging to the same cluster of Γ, we have \(d^{\prime }(i, j) =\alpha d(i, j)\) where 0 < α ≤ 1 and α is specified for a given cluster (may be different for different clusters) and (b) for all i,jS belonging to different clusters of Γ, we have \(d^{\prime }(i, j) \ge d(i, j)\). The clustering function f has the gravitational consistency property if for each distance function d and its Γ-gravitational-transformation \(d^{\prime }\) the following holds: if f(d) = Γ, then \(f(d^{\prime }) = {{\varGamma }}\)

Theorem 7

k-means ideal has the gravitational consistency property.

Proof

Straightforward from the above. □

Define also the generalized gravitational consistency as follows:

Property 4

Let Γ be a partition of S, and d and \(d^{\prime }\) two distance functions on S. We say that \(d^{\prime }\) is a Γ-generalized-gravitational-transformation of d if (a) for all iS belonging to the same cluster C of Γ, with μ(C) being its gravity center, and for an orthogonal coordinate CS specific for this cluster, for each coordinate axis aCS we have \(d_a^{\prime }(i, \boldsymbol {\mu }(C)) =\alpha (C,a) d_a(i, \boldsymbol {\mu }(C))\) where 0 < α(C,a) ≤ 1, da being the length of projection of the vector (i,μ(C) on the coordinate axis a (same for \(d^{\prime }\)) and α(C,a) is specified for a given cluster and coordinate (may be different for different clusters and different coordinates) and (b) for all i,jS belonging to different clusters of Γ, we have \(d^{\prime }(i, j) \ge d(i, j)\). The clustering function f has the generalized gravitational consistency property if for each distance function d and its Γ-generalized-gravitational-transformation \(d^{\prime }\) the following holds: if f(d) = Γ, then \(f(d^{\prime }) = {{\varGamma }}\)

Theorem 8

k-means ideal has the generalized gravitational consistency property.

Proof

Straightforward from the above. □

6 Dataset consistency

The gravitational consistency can be viewed as too rigid as there exists a very strict limitation on how the distances between data elements can change. Though generalized gravitational consistency is less restrictive, the variations of distances within a cluster are nonetheless quite restricted, determined by as many factors only as there are dimensions.

Note that we had considered so far the case when any data was clustered by the clustering algorithm. Let us now investigate whether or not we can define data set properties for which Kleinberg’s consistency property would hold for k-means. We would speak then about dataset consistency.

The idea we present here is quite simplistic, but nonetheless, it demonstrates that clustering algorithm properties may be implied by data set properties.

Assume we know what properties a dataset needs to possess so that we would know in advance partition Γ0 for which the absolute minimum of k-means quality function Q(Γ) (3) is obtained. Assume that this property depends on the distances between cluster centers, among others. When performing Γ-transformation, the cluster centers can move by at most the distance between the cluster center and the most distant point of the cluster. So it is sufficient to add to the distances between the clusters the maximum relocation for each cluster. Hence after Γ transformation, the distances are still sufficient to ensure the absolute minimum of the k-means target function.

The only task to do now is to identify this property of a dataset, allowing to know in advance the aforementioned absolute minimum of k-means Q-function.

So we will investigate below under what circumstances it is possible to tell, without exhaustive check, that the well-separated clusters are the global minimum of k-means. We will see that the ratio between the largest and the smallest cluster cardinality plays here an important role.

Definition 3

There is a gap g between two clusters A,B, if the distance between (hyper)balls centered at gravity centers of these clusters and enclosing each cluster amounts to g.

Let us consider a set of clusters \({{\varGamma }}=\{C_1,\dots ,C_k\}\), where k is the number of clusters, ni is the number of elements in cluster Ci, ri is the radius of the (hyper)ball centered at gravity center of cluster Ci and containing all the datapoints of the cluster Ci, \(M=\max \limits _i n_i\), \(m=\min \limits _i n_i\). Let g be the gap between every two clusters Ci,Cj fulfilling the conditions (6) and (7)

$$ \begin{array}{@{}rcl@{}} \forall_{p,q; p\ne q;p,q=1,\dots,k } \ \ \ \ g&\ge k\sqrt{n_{p} +n_{q} +n } \sqrt{ \frac{ {\sum}_{i=1}^{k} n_{i} {r_{i}^{2}} }{ n_{p}n_{q} } } \end{array} $$
(6)
$$ \forall_{i=1,\dots,k } \ \ \ \ g \ge r_{i} \sqrt{k\frac{M+n}{ m} } $$
(7)

Theorem 9

A clustering Γ0 for which conditions (6) and (7) imposing constraints on the gap between clusters g hold, is optimal clustering that is with the lowest value of Q(Γ) among all the partitions of the same cardinality as Γ0.

Proof has been postponed to Appendix A.

Therefore we may call the above-mentioned well-separatedness as absolute clustering.

Definition 4

A clustering is called absolute if conditions (6) and (7) imposing constraints on the gap between clusters g hold.

One sees immediately that inner cluster consistency is kept, this time in terms of global optimum, under the restraint to k clusters.

Theorem 10

k-means ideal, applied to a dataset with gaps between intrinsic clusters amounting to the g plus the radii of the clusters between which the gap is measured, has the Kleinberg’s consistency property.

The proof is straightforward.

7 Experiments

7.1 Theorem 4 related experiments

Experiments have been performed to check whether or not the Theorem 4 that denies Kleinberg’s findings for one-dimensional space really holds. Samples were generated from uniform distribution (sample size 100, 200, 400, 1000, 2000, 4000, 10000) for each \(k=2,\dots ,floor(\sqrt {samplesize})\). Then the respective sample was clustered into k clusters (\(k=2,\dots ,floor(\sqrt {samplesize})\)) and k-means clustering (R package) was performed with 100k restarts. Subsequently, Γ transformation was performed where the distances within a cluster were decreased by a randomly chosen factor (a separate factor for each pair of neighboring data points), and at the same time, the clusters were moved away so that the distance between cluster elements of distinct clusters is not decreased. Then k-means clustering was performed with 100k restarts in two variants. The first variant was with random initialization. The second variant was with the initialization of the midpoint of the original (rescaled) cluster interval. Additionally, for control purposes, the original samples were reclustered. The number of partitions was counted for which errors in restoring the original clustering was observed. Experiments were repeated ten times. Table 1 presents the average results obtained.

Table 1 Validation of the Theorem 4

In this table, looking at the errors for variant 1, we see that more errors are committed with the increasing sample size (and hence increasing the maximum of k). This contrasts with the variant 2 where the number of errors is negligible. The second variant differs from the first in that seeds are distributed so that there is one in each intrinsic cluster.

Clearly the Theorem 4 holds (as visible from the variant 2). At the same time, however, the table shows that k-means with random initialization is unable to initialize properly for a larger number k of clusters in spite of a large number of restarts (variant 1). This is confirmed by the experiments with reclustering original data.

This study also shows how a test data generator may work when comparing variants of k-means algorithm (for one-dimensional data)

7.2 Theorem 5 related experiments

A simulation was performed concerning the relocation of points of the line segments AB,AC from the proof of Theorem 5.

The results are presented in Table 2 The top row, named \(\measuredangle C^{\prime }AB^{\prime }\) represents the angle between line segments \(C^{\prime }a\) and \(AB^{\prime }\) after rotation of AB and AC line segments upon Γ transformation. The effects of this rotating transformation are measured by the following quantities

  • wrong Γ - number of k-means clustering errors compared to the original clustering before Γ transformation (consisting in the rotation of AB,AC) (out of 4000 data points in both clusters).

Initially, the angle \(\measuredangle CAB\) between the line segments AB,AC was a right angle (π/2). As shown in Table 2, the angle between these line segments was decreased in steps of π/20 down to π/20 and the clustering using k-means (with 50 restarts) was performed.

Table 2 Validation of the Theorem 5 Explanations of row labels provided in the text

k-means algorithm, applied to the data set ABACDEDF returned, as expected two clusters: ABAC and DEDF (the column \(\frac {\pi }{2}\)). As visible in the row wrong Γ, the number of clustering errors compared to the original clustering was increasing up to over 4% of data points being misclassified upon rotation. It is apparent that in fact k-means is not consistent in three dimensions, as claimed in Theorem 5.

In order to illustrate better the importance of the concept of gravitational consistency, an experiment was performed related to equation (5) (first line). As previously, a data set related to \(AB^{\prime } \cup AC^{\prime }\) subsets of the data for appropriate rotations of the line segments AB,AC was considered. This data set was split into two parts: 1) subcluster Z1 consisting of points with distance to A not higher than 20, 2) subcluster Z2 consisting of the remaining points. Z1Z2 = ABAC. While the rotation was performed, the following statistics of \(Z_{1}^{\prime },Z_{2}^{\prime }\), that is images of Z1,Z2 after rotation were observed:

  • μ sc 1 - distance between means of the cluster \(AB^{\prime }\cup AC^{\prime }\) and the mean of subcluster \(Z_{1}^{\prime }\),

  • μsc 2 - distance between means of the cluster \(AB^{\prime }\cup AC^{\prime }\) and the mean of subcluster \(Z_{2}^{\prime }\),

  • SS sc 1 - contribution of subcluster \(Z_{1}^{\prime }\) to the sum of squares of the \(AB^{\prime }\cup AC^{\prime }\). SS sc 2 - contribution of subcluster \(Z_{2}^{\prime }\) to the sum of squares of the \(AB^{\prime }\cup AC^{\prime }\).

When the angle \(\measuredangle C^{\prime }AB^{\prime }\) was decreased (Γ transformation), the distances between points within both subsets \(Z_{1}^{\prime },Z_{2}^{\prime }\) as well as between both subsets \(Z_{1}^{\prime },Z_{2}^{\prime }\) were decreased. So was the distance between the gravity center of the entire data set \(A^{\prime }B^{\prime }\cup A^{\prime }C^{\prime }\) and the gravity center of the second subset \(Z_{2}^{\prime }\) was decreasing, as visible in the row μ sc 2 of the Table 2. However, the distance between the gravity center of the entire data set \(A^{\prime }B^{\prime }\cup A^{\prime }C^{\prime }\) and the gravity center of the first subset Z1 was increasing, as visible in the row μ sc 1 of the Table 2. Also the contribution of this subset to the overall sum of squares of the entire set was increasing, as visible from the row SS sc 1 of the Table 2. This demonstrates that the Γ transformation, though decreasing the distances between cluster data points, does not necessarily decrease the distance between sub-cluster centers and the cluster center which results in the inconsistency of k-means under Kleinberg’s Γ transformation.

7.3 Theorem 7 related experiments

Experiments were also performed referring to the Theorem 7 and the results are summarized in Table 3. The following metrics were used.

  • α - the contraction coefficient from Theorem 7

  • wrong α - number of k-means clustering errors compared to the original clustering before Γ-gravitational transformation of the AB,AC cluster.

Table 3 Validation of the Theorem 7. Explanations of row labels provided in the text

The experiments were performed for the same data as in previous subsection. the Γ-gravitational transformation was performed for the (original) cluster ABAC with α as indicated in the row α. The choice of α was based on the requirement that the Γ transformation and the Γ-gravitational transformation should yield a resulting cluster with the same variance of the data points in the cluster after transformation. As visible in the row wrong α, no error in data clustering was induced by Γ-gravitational transformation, as expected from the Theorem 7.

8 Conclusions

In this paper, we have provided a definite answer to the problem of whether or not k-means algorithm possesses the consistency property. The answer is negative except for one-dimensional space. Settling this problem was necessary because the proof of Kleinberg of this property was inappropriate for real application areas of k-means that it is a fixed-dimensional Euclidean space. The result precludes usage of consistency axiom as a generator of test examples for k-means clustering function (except for one-dimensional data) and implies the need to seek alternatives.

We proposed gravitational consistency, generalized gravitational consistency and dataset consistency as an alternative to Kleinberg’s consistency property. Γ-gravitational transformation, as an alternative to Γ transformation, preserves the k-means clustering, but it is a bit rigid, because it keeps the proportions between distances in a single cluster. Generalized Γ-gravitational transformation does not have this disadvantage though there is still some rigidness as the changes in distances are concerned. The dataset consistency transformation is more flexible but requires quite large distances between the clusters. We believe, however, that these three alternatives can still generate a sufficient set of datasets for software tests. Note that an orientation on k-means is not a too serious limitation of usefulness as quite a large number of modern clustering algorithms encompass k-means clustering, just to mention the whole branch of spectral clustering.

Kleinberg’s consistency was subject of strong criticism and new variants were proposed like Monotonic consistency (Strazzeri and Sánchez-García, 2018) or MST-consistency (Zadeh, 2010). See also criticism in Carlsson and Mémoli (2010) and Correa-Morrisa (2013). The mentioned new definitions of consistency are apparently restrictions of Γ-consistency, and therefore the Theorem 4 would be valid. The Monotonic consistency seems not to impose restrictions on Kleinberg’s proof on k-means violating consistency. Therefore in those cases, the consistency of k-means under higher dimensionality needs to be investigated. Note that we have also challenged the result (Wei, 2017), who claims that Kleinberg’s consistency may be achieved by k-means with random initialization (see our Theorem 5). The shift of axioms from clustering function to quality measure (Ben-David and Ackerman, 2008) was suggested to the problems with consistency, but this approach fails to tell what the outcome of clustering should be, which is not useful for the mentioned test generator application.

It should be noted that, beside the Kleinberg axiomatic system, other axiomatic frameworks have been proposed, which may serve as foundations of development of new test data sets from existent ones. For example for unsharp partitioning there was a proposal of an axiomatic system by Wright (1973), for graph clustering by van Laarhoven and Marchiori (2014), for cost function driven algorithms by Ben-David and Ackerman (2009), for linkage algorithms by Ackerman et al. (2010), for hierarchical algorithms by Carlsson and Mémoli (2010), Gower (1990), and Thomann et al. (2015), for multiscale clustering by Carlsson and Mémoli (2008). for settings with increasing sample sizes by (Hopcroft & Kannan, 2012), for community detection by Zeng et al. (2016), for pattern clustering by Shekar (1988). They were not investigated here and are a bit hard to compare because they were proposed for different classes of clustering algorithms that do not cover the settings relevnt for k-means that is the embedding in the Euclidean space and partition of not only the sample but of the sample space.