Issues in clustering algorithm consistency in fixed dimensional spaces. Some solutions for k-means

Kleinberg introduced an axiomatic system for clustering functions. Out of three axioms, he proposed, two (scale invariance and consistency) are concerned with data transformations that should produce the same clustering under the same clustering function. The so-called consistency axiom provides the broadest range of transformations of the data set. Kleinberg claims that one of the most popular clustering algorithms, k-means does not have the property of consistency. We challenge this claim by pointing at invalid assumptions of his proof (infinite dimensionality) and show that in one dimension in Euclidean space the k-means algorithm has the consistency property. We also prove that in higher dimensional space, k-means is, in fact, inconsistent. This result is of practical importance when choosing testbeds for implementation of clustering algorithms while it tells under which circumstances clustering after consistency transformation shall return the same clusters. Two types of remedy are proposed: gravitational consistency property and dataset consistency property which both hold for k-means and hence are suitable when developing the mentioned testbeds.


Introduction
In his heavily cited paper (Kleinberg, 2002), Kleinberg introduced an axiomatic system for clustering functions. A clustering function applied to a dataset S produces a partition Γ . This is an extended version of a conference paper (Klopotek & Klopotek, 2020 (i, j ). The clustering function f has the consistency property if for each distance function d and its Γ -transformation d the following holds: if f (d) = Γ , then f (d ) = Γ Subsequently, we will talk about Γ -transformation exchangeably with Γ -based consistency transformation or just consistency transformation. Let us mention also the other clustering preservation axiom of Kleinberg, that is the scale-invariance axiom.
Property 2 A function f has the scale-invariance property if for any distance function d and any α > 0, we have f (d) = f (α · d).
The validity or non-validity of any clustering preserving axiom for a given clustering function is of vital practical importance, as it may serve as a foundation for a testbed of the correctness of the function. Any modern software developing firm creates tests for its software in order to ensure its proper quality. Generators providing versatile test data are therefore of significance because they may detect errors unforeseen by the developers. Thus the consistency axiom may be used to generate new test data from existent one knowing a priori what the true result of clustering should be. The scale-invariance axiom may be used too, but obviously, the diversity of derived sets is much smaller.
Kleinberg defined a class of clustering functions, called the centroid functions as follows: for any natural number k ≥ 2, and any continuous, non-decreasing, and unbounded function g : R + → R + , the (k; g)-centroid clustering consists of: (1) choosing the set of k centroid points T ⊆ S for which the objective function g d (T ) = i∈S g (d(i, T )) is minimized, where d(i, T ) = min j ∈T d (i, j ).
(2) a partition of S into k clusters is obtained by assigning each point to the element of T closest to it. He claims that the objective function underlying k-means clustering is obtained by setting g(d) = d 2 . This is not quite correct because cluster centers in k-means do not necessarily belong to S, though with a dense set S, the approximation may be relatively good. It would be more appropriate if Kleinberg would speak about k-medoid algorithm Note that his distance definition (Def. 2) is not a Euclidean one and not even metric, as he stresses. This is of vital importance because based on this he formulates and proves a theorem (his Theorem 4.1) Theorem 1 Theorem 4.1 from Kleinberg (2002). For every k ≥ 2 and every function g [...] and for [data set size] n sufficiently large relative to k, the (k; g)-centroid clustering function [this term encompassing k-means] does not satisfy the Consistency property.
which we claim is wrong with respect to k-means for a number of reasons as we will show below. The reasons are: -The objective function underlying k-means clustering is not obtained by setting g(d) = d 2 contrary to Kleinberg's assumption (k-medoid is obtained). k-means always works in fixed-dimensional space while his proof relies on unlimited dimensional space. -Unlimited dimensionality implies a serious software testing problem because the algorithm's correctness cannot be established by testing as the number of tests is too vast. -The consistency property holds for k-means in one-dimensional space.
The last result opens the problem of whether or not the consistency also holds for higher dimensions.
We begin our presentation with recalling basics of the k-means algorithms in Section 2. We recall the Kleinberg's proof of k-means inconsistency and point at its weak points in Section 3. Then we investigate the impact of dimensionality of k-means consistency in Section 4. In Section 5 we discuss the reasons for inconsistency in multi-dimensional spaces and propose a remedy in terms of gravitational consistency and generalized gravitational consistency. In Section 6, we suggest still a different way around the problem by proposing dataset consistency property. Section 7 reports on some experiments illustrating selected insights from the paper. Conclusions are presented in Section 8.

k-Means algorithm
The popular clustering algorithm, k-means (MacQueen, 1967) strives to minimize the partition quality function (called also partition cost function) where x i , i = 1, . . . , m are the data points, M is the matrix of cluster centers μ j , j = 1, . . . , k, and U is the cluster membership indicator matrix, consisting of entries u ij , where u ij is equal to 1 if among all of cluster (gravity) centers μ j is the closest to x i , and is 0 otherwise.
It can be rewritten in various ways while the following are of interest to us here. Let the partition Γ = {C 1 , . . . , C k } b a partition of the data set onto k clusters C 1 , . . . , C k . Then where μ(C) = 1 |C| x i ∈C x i is the gravity center of the cluster C. The above can be presented also as The problem of seeking the pair (U, M) minimizing J from equation (1) is called kmeans-problem. This problem is known as NP-hard. We will call k-means-ideal such an algorithm that finds a pair (U, M) minimizing Q from equation (1). Practical implementations of k-means usually find some local minima of Q(). There exist various variants of this algorithm. For an overview of many of them, see, e.g., Wierzchoń and Kłopotek (2018). An algorithm is said to be from the k-means family if it has the structure described by Algorithm 1. We will use a version with random initialization (randomly chosen initial seeds) as well as an artificial one initialized close to the true cluster center, which mimics k-means-ideal. Kleinberg's proof, delimited to the case of k = 2 only, runs as follows: Consider a set of

Kleinberg's proof of Theorem 1 and its unlimited dimensionality deficiency
where δ > 0 and δ is "small". By choosing γ, , r, δ appropriately, the optimal choice of k = 2 centroids will consist of one point from X and one from Y . The resulting partition is Γ = {X, Y }. Let divide X into X = X 0 ∪ X 1 with X 0 , X 1 of equal cardinality. Reduce the distances so that ∀ c=1,2 ∀ i,j ∈X c d (i, j ) = r < r and d = d otherwise. If r is "sufficiently small", then the optimal choice of two centroids for S will now consist of one point from each X c , yielding a different partition of S. But d is a Γ -transform of d so that a violation of consistency occurs. So far the proof of Kleinberg of the Theorem 1.
The proof cited above is a bit excentric because the clusters are heavily unbalanced (k-means tends to produce rather balanced clusters). Furthermore, the distance function is awkward because Kleinberg's counter-example would require an embedding in a very high dimensional space, non-typical for k-means applications. It needs to be mentioned that Kleinberg's proof, sketchy in nature, omitted many details. Kleinberg uses a distance definition that is broader than Euclidean and therefore he does not consider space dimensionality. k-means, on the other hand, in its basic version, explicitly assumes an Euclidean space. This is the reason, why we consider Kleinberg's proof in the light of Euclidean space embedding.
We claim in brief: Theorem 2 Kleinberg's proof of Kleinberg (2002) Theorem 4.1 that k-means (k = 2) is not consistent, is not valid in R p for data sets of cardinality n > 2(p + 1).
Proof In terms of the concepts used in the Kleinberg's proof, either the set X or the set Y is of cardinality p + 2 or higher. Kleinberg requires that distances between p + 2 points are all identical which is impossible in R p (only up to p + 1 points may be equidistant).
We see immediately that Theorem 3 Kleinberg's target function does not match the real k-means target function.

The impact of dimensionality of consistency property
As visible from Theorem 2, the dimensionlity of the space impacts the validity of Kleinberg's proof of inconsistency of k-means. However, this does not answer the question whether or not k-means is actually consistent in a fixed dimensional space. In this section we will show that in fact k-means is consistent in one-dimensional space (Theorem 4), but it is inconsistent in 3 or more dimensions (Theorem 5) and also it is inconsistent in 2 dimensions (Theorem 6).

Theorem 4 k-means is consistent in one dimensional Euclidean space.
The proof is postponed to the Appendix A.
1. But what about higher dimensions?
Theorem 5 k-means in 3D is not consistent.
The proof, by example, is postponed to the Appendix A.2. The example used in that proof is more realistic (balanced, in Euclidean space) than that of Kleinberg and shows that inconsistency of k-means in R m is a real problem. With the example used in the proof of Theorem 5 not only consistency violation is shown, but also refinement-consistency violation. Not only in 3D, but also in higher dimensions (as 3D example may always be embedded in n dimensions, n > 3). So what about the case of two dimensions -2D?
Theorem 6 k-means in 2D is not consistent.
Proof The proof of Theorem 6 uses a less realistic example than in Theorem 5, hence Theorem 5 was worthy considering in spite of the fact that it is implied by Theorem 6. Imagine a unit circle with data points arranged as follows (

Reasons for multidimensional inconsistency
In order to investigate the reasons for k-means inconsistency in higher dimensions, in analogy to the proof of Theorem 4 from Section 4, let us consider two alternative partitions in a multi-dimensional space: -the partition Γ 1 = {C 1. , . . . , C k. } which will be base for the Γ -transform -and the competing partition Γ 2 = {C .1 , . . . , C .k }.
Assume further that C ij = C i. ∩ C .j are non-empty intersections of clusters C i. ∈ Γ 1 , C .j ∈ Γ 2 , of both partitions. Define minind(C i. ), resp. maxind(C i. ) as the minimal/maximal index j such that C ij is not empty. The Q(Γ 1 ) will be the sum of centered sums of squares over all C ij plus the squared distances of centers of all C ij to the center of C i. times cardinality of C ij .
We can derive the formula for Q(Γ 1 ) in the same way as in the proof of Theorem 4 in Appendix A.1 (8) The Q(Γ 2 ) can be derived also in analogy to equation (8) in the proof of Theorem 4 in Appendix A.1 as: x∈C ij ,y∈C ij x − y 2 which decreases because the distances between elements of C ij decreases as they are all in the same cluster C i. . As summands of Q(Γ 2 ) are concerned, the first, equal will therefore also decrease upon Γ 1 transformation. But not by the same absolute value as the first one of will increase because x, y stem from different clusters of Γ 1 . If Γ 1 was the optimal clustering for k-means cost function prior to Γ 1 transformation, it would remain so afterward if the second summand of However, in a multidimensional space, this is not granted anymore, because μ(C ij ) − μ(C ij ) 2 may increase when the points of the cluster C i. are getting closer to one another. An immediate remedy would be then to require that for any two convex subsets C ij , C ij of C i. , μ(C ij )−μ(C ij ) 2 is non-increasing upon Γ 1 transformation. This condition is not easy to check. However, if one decreases all distances within one cluster C i. by the very same factor, then this condition holds. It also holds if, within an orthogonal coordinate system, one decreases all distances within one cluster C i. along each dimension by a factor specific for the dimension and the cluster. Under such circumstances, the distances within a cluster will not be necessarily changed by the same factor. So, define the gravitational consistency as follows: Proof Straightforward from the above.
Define also the generalized gravitational consistency as follows: Property 4 Let Γ be a partition of S, and d and d two distance functions on S. We say that d is a Γ -generalized-gravitational-transformation of d if (a) for all i ∈ S belonging to the same cluster C of Γ , with μ(C) being its gravity center, and for an orthogonal coordinate CS specific for this cluster, for each coordinate axis a ∈ CS we have Proof Straightforward from the above.

Dataset consistency
The gravitational consistency can be viewed as too rigid as there exists a very strict limitation on how the distances between data elements can change. Though generalized gravitational consistency is less restrictive, the variations of distances within a cluster are nonetheless quite restricted, determined by as many factors only as there are dimensions. Note that we had considered so far the case when any data was clustered by the clustering algorithm. Let us now investigate whether or not we can define data set properties for which Kleinberg's consistency property would hold for k-means. We would speak then about dataset consistency.
The idea we present here is quite simplistic, but nonetheless, it demonstrates that clustering algorithm properties may be implied by data set properties.
Assume we know what properties a dataset needs to possess so that we would know in advance partition Γ 0 for which the absolute minimum of k-means quality function Q(Γ ) (3) is obtained. Assume that this property depends on the distances between cluster centers, among others. When performing Γ -transformation, the cluster centers can move by at most the distance between the cluster center and the most distant point of the cluster. So it is sufficient to add to the distances between the clusters the maximum relocation for each cluster. Hence after Γ transformation, the distances are still sufficient to ensure the absolute minimum of the k-means target function.
The only task to do now is to identify this property of a dataset, allowing to know in advance the aforementioned absolute minimum of k-means Q-function.
So we will investigate below under what circumstances it is possible to tell, without exhaustive check, that the well-separated clusters are the global minimum of k-means. We will see that the ratio between the largest and the smallest cluster cardinality plays here an important role.
Definition 3 There is a gap g between two clusters A, B, if the distance between (hyper)balls centered at gravity centers of these clusters and enclosing each cluster amounts to g.
Let us consider a set of clusters Γ = {C 1 , . . . , C k }, where k is the number of clusters, n i is the number of elements in cluster C i , r i is the radius of the (hyper)ball centered at gravity center of cluster C i and containing all the datapoints of the cluster C i , M = max i n i , m = min i n i . Let g be the gap between every two clusters C i , C j fulfilling the conditions (6) and (7) ∀ p,q;p =q;p,q=1,...,k g ≥ k √ n p + n q + n k i=1 n i r 2 i n p n q Theorem 9 A clustering Γ 0 for which conditions (6) and (7) imposing constraints on the gap between clusters g hold, is optimal clustering that is with the lowest value of Q(Γ ) among all the partitions of the same cardinality as Γ 0 .
Proof has been postponed to Appendix A.3. Therefore we may call the above-mentioned well-separatedness as absolute clustering.
Definition 4 A clustering is called absolute if conditions (6) and (7) imposing constraints on the gap between clusters g hold.
One sees immediately that inner cluster consistency is kept, this time in terms of global optimum, under the restraint to k clusters.
Theorem 10 k-means ideal, applied to a dataset with gaps between intrinsic clusters amounting to the g plus the radii of the clusters between which the gap is measured, has the Kleinberg's consistency property.
The proof is straightforward.

Theorem 4 related experiments
Experiments have been performed to check whether or not the Theorem 4 that denies Kleinberg's findings for one-dimensional space really holds. Samples were generated from uniform distribution (sample size 100, 200, 400, 1000, 2000, 4000, 10000) for each k = 2, . . . , f loor( √ samplesize). Then the respective sample was clustered into k clusters (k = 2, . . . , f loor( √ samplesize)) and k-means clustering (R package) was performed with 100k restarts. Subsequently, Γ transformation was performed where the distances within a cluster were decreased by a randomly chosen factor (a separate factor for each pair of neighboring data points), and at the same time, the clusters were moved away so that the distance between cluster elements of distinct clusters is not decreased. Then k-means clustering was performed with 100k restarts in two variants. The first variant was with random initialization. The second variant was with the initialization of the midpoint of the original (rescaled) cluster interval. Additionally, for control purposes, the original samples were reclustered. The number of partitions was counted for which errors in restoring the original clustering was observed. Experiments were repeated ten times. Table 1 presents the average results obtained.
In this table, looking at the errors for variant 1, we see that more errors are committed with the increasing sample size (and hence increasing the maximum of k). This contrasts with the variant 2 where the number of errors is negligible. The second variant differs from the first in that seeds are distributed so that there is one in each intrinsic cluster.
Clearly the Theorem 4 holds (as visible from the variant 2). At the same time, however, the table shows that k-means with random initialization is unable to initialize properly for a larger number k of clusters in spite of a large number of restarts (variant 1). This is confirmed by the experiments with reclustering original data.
This study also shows how a test data generator may work when comparing variants of k-means algorithm (for one-dimensional data)

Theorem 5 related experiments
A simulation was performed concerning the relocation of points of the line segments AB, AC from the proof of Theorem 5.
The results are presented in Table 2 The top row, named C AB represents the angle between line segments C a and AB after rotation of AB and AC line segments upon Γ transformation. The effects of this rotating transformation are measured by the following quantities -wrong Γ -number of k-means clustering errors compared to the original clustering before Γ transformation (consisting in the rotation of AB, AC) (out of 4000 data points in both clusters).
Initially, the angle CAB between the line segments AB, AC was a right angle (π/2). As shown in Table 2, the angle between these line segments was decreased in steps of π/20 down to π/20 and the clustering using k-means (with 50 restarts) was performed.
k-means algorithm, applied to the data set AB ∪ AC ∪ DE ∪ DF returned, as expected two clusters: AB ∪ AC and DE ∪ DF (the column π 2 ). As visible in the row wrong Γ , the number of clustering errors compared to the original clustering was increasing up to over 4% of data points being misclassified upon rotation. It is apparent that in fact k-means is not consistent in three dimensions, as claimed in Theorem 5.
In order to illustrate better the importance of the concept of gravitational consistency, an experiment was performed related to equation (5) (first line). As previously, a data set related to AB ∪ AC subsets of the data for appropriate rotations of the line segments  AB, AC was considered. This data set was split into two parts: 1) subcluster Z 1 consisting of points with distance to A not higher than 20, 2) subcluster Z 2 consisting of the remaining points. Z 1 ∪ Z 2 = AB ∪ AC. While the rotation was performed, the following statistics of Z 1 , Z 2 , that is images of Z 1 , Z 2 after rotation were observed: μ sc 1 -distance between means of the cluster AB ∪ AC and the mean of subcluster Z 1 , μsc 2 -distance between means of the cluster AB ∪ AC and the mean of subcluster Z 2 , -SS sc 1 -contribution of subcluster Z 1 to the sum of squares of the AB ∪ AC . SS sc 2 -contribution of subcluster Z 2 to the sum of squares of the AB ∪ AC .
When the angle C AB was decreased (Γ transformation), the distances between points within both subsets Z 1 , Z 2 as well as between both subsets Z 1 , Z 2 were decreased. So was the distance between the gravity center of the entire data set A B ∪ A C and the gravity center of the second subset Z 2 was decreasing, as visible in the row μ sc 2 of the Table 2. However, the distance between the gravity center of the entire data set A B ∪ A C and the gravity center of the first subset Z 1 was increasing, as visible in the row μ sc 1 of the Table 2. Also the contribution of this subset to the overall sum of squares of the entire set was increasing, as visible from the row SS sc 1 of the Table 2. This demonstrates that the Γ transformation, though decreasing the distances between cluster data points, does not necessarily decrease the distance between sub-cluster centers and the cluster center which results in the inconsistency of k-means under Kleinberg's Γ transformation.

Theorem 7 related experiments
Experiments were also performed referring to the Theorem 7 and the results are summarized in Table 3. The following metrics were used.
α -the contraction coefficient from Theorem 7 -wrong α -number of k-means clustering errors compared to the original clustering before Γ -gravitational transformation of the AB, AC cluster.
The experiments were performed for the same data as in previous subsection. the Γgravitational transformation was performed for the (original) cluster AB ∪ AC with α as indicated in the row α. The choice of α was based on the requirement that the Γ transformation and the Γ -gravitational transformation should yield a resulting cluster with the same variance of the data points in the cluster after transformation. As visible in the row wrong α, no error in data clustering was induced by Γ -gravitational transformation, as expected from the Theorem 7.

Conclusions
In this paper, we have provided a definite answer to the problem of whether or not kmeans algorithm possesses the consistency property. The answer is negative except for one-dimensional space. Settling this problem was necessary because the proof of Kleinberg of this property was inappropriate for real application areas of k-means that it is a fixed-dimensional Euclidean space. The result precludes usage of consistency axiom as a generator of test examples for k-means clustering function (except for one-dimensional data) and implies the need to seek alternatives. We proposed gravitational consistency, generalized gravitational consistency and dataset consistency as an alternative to Kleinberg's consistency property. Γ -gravitational transformation, as an alternative to Γ transformation, preserves the k-means clustering, but it is a bit rigid, because it keeps the proportions between distances in a single cluster. Generalized Γ -gravitational transformation does not have this disadvantage though there is still some rigidness as the changes in distances are concerned. The dataset consistency transformation is more flexible but requires quite large distances between the clusters. We believe, however, that these three alternatives can still generate a sufficient set of datasets for software tests. Note that an orientation on k-means is not a too serious limitation of usefulness as quite a large number of modern clustering algorithms encompass k-means clustering, just to mention the whole branch of spectral clustering.
Kleinberg's consistency was subject of strong criticism and new variants were proposed like Monotonic consistency (Strazzeri & Sánchez-García, 2018) or MST-consistency (Zadeh, 2010). See also criticism in Carlsson and Mémoli (2010) and Correa-Morrisa (2013). The mentioned new definitions of consistency are apparently restrictions of Γconsistency, and therefore the Theorem 4 would be valid. The Monotonic consistency seems not to impose restrictions on Kleinberg's proof on k-means violating consistency. Therefore in those cases, the consistency of k-means under higher dimensionality needs to be investigated. Note that we have also challenged the result (Wei, 2017), who claims that Kleinberg's consistency may be achieved by k-means with random initialization (see our Theorem 5). The shift of axioms from clustering function to quality measure (Ben-David & Ackerman, 2008) was suggested to the problems with consistency, but this approach fails to tell what the outcome of clustering should be, which is not useful for the mentioned test generator application.
It should be noted that, beside the Kleinberg axiomatic system, other axiomatic frameworks have been proposed, which may serve as foundations of development of new test data sets from existent ones. For example for unsharp partitioning there was a proposal of an axiomatic system by Wright (1973), for graph clustering by van Laarhoven and Marchiori (2014), for cost function driven algorithms by Ben-David and Ackerman (2009), for linkage algorithms by Ackerman et al. (2010), for hierarchical algorithms by Carlsson and Mémoli (2010), Gower (1990), and Thomann et al. (2015), for multiscale clustering by Carlsson and Mémoli (2008). for settings with increasing sample sizes by (Hopcroft & Kannan, 2012), for community detection by Zeng et al. (2016), for pattern clustering by Shekar (1988). They were not investigated here and are a bit hard to compare because they were proposed for different classes of clustering algorithms that do not cover the settings relevnt for k-means that is the embedding in the Euclidean space and partition of not only the sample but of the sample space.
Due to the nature of k-means let each cluster of each partition after Γ -transform be represented as an interval not intersecting with any other cluster of the same partition. For Γ 1 , it holds before the transform; therefore, it holds afterward. Γ 2 shall be the competing optimal transform; therefore, it holds for sure afterward. We intend to demonstrate that under the Γ 1 transformation, that is, assuming that the intrinsic partition is Γ 1 , the target function of kmeans for Γ 1 will decrease not less than that for Γ 2 . For simplicity, assume that the indices of clusters grow with the growing value of the cluster center.
For this purpose assume that C ij = C i. ∩ C .j are non-empty intersections of clusters C i. ∈ Γ 1 , C .j ∈ Γ 2 , of both partitions. Define minind(C i. ), resp. maxind(C i. ) as the minimal/maximal index j such that C ij is not empty. The Q(Γ 1 ) will be the sum of centered sums of squares over all C ij plus the squared distances of centers of all C ij to the center of C i. times cardinality of C ij (easily derived from formula (2)).
The Q(Γ 2 ) can be computed analogously, but let us follow a bit distinct path (starting from formula (3)).
Both summands of Q(Γ 1 ), that is )) 2 will decrease upon Γ 1 based consistency transformation. (x − μ(C ij )) 2 decreases because the distance to each of elements of C ij decreases as they are all in the same cluster C i. . Each (μ(C ij ) − μ(C ij )) 2 decreases because all the elements constituting C ij and C ij belong to the same cluster C i. . Hereby there is always an extreme data point P ij ∈ C ij separating it from C ij . As the points of both C ij and C ij get closer to P ij under Γ 1 transformation, so the centers of both C ij and C ij will get closer to P ij , so that they will move closer to each other. As summands of Q(Γ 2 ) are concerned, the first, equal ) 2 , will also decrease upon Γ 1 transformation. But not by the same absolute value as the first one of Q(Γ 1 ), that is will increase because x, y stem from different clusters of Γ 1 . Therefore, if Γ 1 was the optimal clustering for k-means cost function prior to Γ 1 transformation, it will remain so afterward.

A.2 Proof of Theorem 5
Proof Let A, B, C, D, E, F be points in three-dimensional space with coordinates: A(1, 0, 0), B(33, 32, 0), C(33, −32, 0), D(−1, 0, 0), E(−33, 0, −32), F (−33, 0, 32). Let S AB , S AC , S DE , S DF be sets of say 1000 points each randomly uniformly distributed over line segments (except for endpoints) AB, AC, DE, EF resp. Let X = S AB ∪ S AC ∪ S DE ∪ S EF . k-means with k = 2 applied to X yields a partition Γ = {S AB ∪ S AC , S DE ∪ S DF }, as expected (see Fig. 3 left). Let us perform a Γ transformation consisting of rotating line segments AB, AC around the point A in the plane spread by the first two coordinates (X and Y ) towards the first coordinate axis (X xis) so that the angle between this axis and AB and AC is say one degree. To verify that this is a Γ transformation, consider some points P , Q, P on the line segment AB and Q on the line segment AC. Their distance amounts to |P Q| = |P A| 2 + |AQ| 2 − 2 cos( BAC)|P A||AQ|. The images of P , Q be P , Q resp., whereby obviously |P A| = |P A| and |AQ | = |AQ|, and | B AC | < | BAC|. Therefore as expected for Γ transformation for points of the same cluster. Let us consider a point R on the line segment DE and the distance |RP | between points from two different clusters. Let R x , P x be orthogonal projections of R, P onto the X axis, resp. P , Q lie in two orthogonal planes, spread by X, Z and X, Y axes, resp. Therefore |RP | =  Fig. 3 Inconsistency of k-means in 3D Euclidean space. Left picture -data partition before consistency transform. Right picture -data partition after consistency transform Now the k-means with k = 2 yields a different partition, splitting line segments AB and AC (see Fig. 3

A.3 Proof of Theorem 9
Proof In particular, let us consider the set of k clusters Γ = {C 1 , . . . , C k } of cardinalities n 1 , . . . , n k and with radii of balls enclosing the clusters (with centers located at cluster centers) r 1 , . . . , r k .
We are interested in a gap g between clusters such that it does not make sense to split each cluster C i into subclusters C i1 , . . . , C ik and to combine them into a set of new clusters S = {S 1 , . . . , S k } such that S j = ∪ k i=1 C ij . We seek a g such that the highest possible central sum of squares combined over the clusters C i would be lower than the lowest conceivable combined sums of squares around respective centers of clusters S j . Let V ar(C) be the variance of the cluster C (average squared distance to cluster gravity center). Let r ij be the distance of the center of subcluster C ij to the center of cluster C i . Let v ilj be the distance of the center of subcluster C ij to the center of subcluster C lj . So the total k-means function for the set of clusters (C 1 , . . . , C k ) will amount to: And the total k-means function for the set of clusters (S 1 , . . . , S k ) will amount to: (11) Should (C 1 , . . . , C k ) constitute the absolute minimum of the k-means target function, then Q(S) ≥ Q(C) should hold, that is: (n ij V ar(C ij ) + n ij r 2 ij ) This implies: To maximize k j =1 n ij r 2 ij for a single cluster C i of enclosing ball radius r i , note that you should set r ij to r i . Let m j = arg max j ∈{1,...,k} n ij . If we set r ij = r i for all j except m j , then the maximal r im j is delimited by the relation k j =1;j =m j n ij r ij ≥ n im j r im j . So So if we can guarantee that the gap between cluster balls (of clusters from Γ ) amounts to g, then surely because in such case g ≤ v ilj for all i, l, j . By combining inequalities (12), (13) and (14) we see that the global minimum is granted if the following holds: One can distinguish two cases: either (1) there exists a cluster S t containing two subclusters C pt , C qt such that t = arg max j |C pj | and t = arg max j |C qj | (maximum cardinality subclasses of their respective original clusters C p , C q ) or (2) not.
Consider the first case. Let C p , C q be the two clusters where C pt and C qt be two subclusters of highest cardinality within C p , C q resp. This implies that n pt ≥ 1 k n p , n qt ≥ 1 k n q . Also this implies that for i = p, i = q n it ≤ n i /2. ≥ n pt n qt n p /2 + n q /2 + k i=1 n i /2 = n pt n qt n p /2 + n q /2 + n/2 ≥ 1 k 2 n p n q n p /2 + n q /2 + n/2 So, in order to fulfill inequality (15), it is sufficient to require that g ≥ 2 k i=1 n i r 2 i 1 k 2 n p n q n p /2+n q /2+n/2 = k n p /2 + n q /2 + n/2 2 k i=1 n i r 2 i n p n q (16) = k n p + n q + n k i=1 n i r 2 i n p n q This of course maximized over all combinations of p, q.