Introduction

Clusterability: The Intuition and the Failed Formalizations

It is a commonly observed phenomenon that the optimality criteria of most practically used clustering algorithms (like k-means) have a high theoretical computational complexity (are NP-hard). However, these algorithms perform quite well (converge quickly enough) yielding usable output in many (though not all) practical application cases. Apparently, the data must have the property that some data sets are better clusterable than other.

Though a number of attempts have been made to capture formally the intuition behind clusterability, none of these efforts seems to have been successful, as Ben-David exhibits in [10] in depth. He points at three important shortcomings of current state-of-the-art research results: the clusterability cannot be checked prior to applying a potentially NP-hard clustering algorithm to the data, known clusterability criteria impose strong (impractical) separation constraints and the research nearly does not address popular algorithms. A recent paper by Ackerman et al. [1] partially eliminates some of these problems, but regrettably at the expense of introducing user-defined parameters that do not seem to be intuitive (in terms of one’s imagination about what well-clusterable data are).

What Shall be Excluded from/Included into a Clusterability Criterion

Therefore, a different approach to defining what well-clusterable data are attempted in this paper. As Ben-David mentioned, the research in the area does not address popular algorithms except for \(\epsilon\)-separatedness clusterability criterion related to k-means proposed by Ostrovsky et al. [20]. This paper is intended to contribute to applicability of clusterability criteria in that, following Ostrovsky’s example, k-means is considered, and in particular its special version called k-means++, as in fact Ostrovsky did.Footnote 1

Furthermore, Ben-David expressed the concern that it is not a solution to the problem if the NP hardness is shifted from the data clustering algorithm to the data clusterability checking algorithm, because the problem becomes even worse. This paper addresses this concern. Last not least, the issue of the impractical gaps imposed by the clusterability criteria in the literature has been raised. Ben-David argues in [10] (see his “Smaller Gaps Between Clusters” section) that apparently the efforts in clusterability research on finding support for the hypothesis of “clustering is hard only if the clustering does not matter” have failed, mainly due to the fact that gaps between clusters that are required are too large for practical applications, as the popular algorithms behave reasonably even with significantly smaller gaps. But a closer look at the k-means objective shows (see “Failure of Gaps as a Criterion for Clusterability” section) that it does not make sense to build a clusterability criterion solely on the grounds of the gaps because k-means criterion does not rely solely on gaps between clusters, but also on their cardinalities as well as on their internal spread, which cannot be known in advance. If cardinalities are changed, k-means may prefer to split larger clusters and merge parts of them with smaller, though clearly separated clusters. If the k-means criterion shall coincide with separation of clusters, the gaps need to be large. Therefore, instead of seeking smaller gaps, this research concentrates on redefining the goal of clusterability research efforts. So it is proposed here to change the perspective. Instead of (or in addition to) seeking conditions for easiness of clustering of a given data set, one shall look for definition of a data set (data set generator) for which optimal clustering is known in advance and an algorithm returns the ground truth nearly for sure. This change of perspective will lead to a practical application of clusterability concept consisting in testing the algorithm behaviour under varying degrees of violating the clusterability conditions.

Having been freed from the need to seek the smallest possible gaps, one can also weaken the problems with NP hardness of stating the clusterability of the data set. In particular, one shall not require that it has to be said beforehand (before clustering) whether or not the data is well-clusterable. Instead one shall require that one is able to state aposteriorically whether or not the data is well-clusterable according to well-clusterability criteria that were assumed, in polynomial time. Note that this is a tremendous progress over clusterability criteria defined so far. None of the clusterability criteria discussed by Ben-David [10] fits this requirement and the criterion proposed in [1] can be shown to be invalid for simple data sets (see “Failure of Multi-modality Detection for Detection of Gaps Between Data” section, Fig. 1). This paper resolves in this way a serious bottleneck in the clusterability research. The issue of measuring the deviation from well-separatedness is not covered here. Nonetheless, the presented clusterability criterion opens a way to attack this issue also, because the criterion is verifiable ex post and the data can be checked for clusterability by at least one popular algorithm.

Fig. 1
figure 1

Illustration of a special case where Ackerman’s method [1] would falsely recognize clustering structure in the data a the data, b the histogram of pair-wise distances—two modes visible

Clusterability Challenge in the k-Means Algorithm Family

This study is restricted to the k-means family of algorithms. The restriction is not too serious as the algorithms of this family are broadly used and there exist quite large number of variants, starting with early work of Lloyd, Forgy, MacQueen and Hartigan–Wong, to k-means++, spherical k-means, their fuzzified version, kernel variants, k-harmonic means, and many other.Footnote 2k-means is used also as a part of other algorithms, like those of spectral analysis [23].

While studying clusterability, we concentrate on the k-means++ algorithm’s initialization phase, introduced by Arthur and Vassilvitskii [5]. It is a well-known problem with k-means basic version (with random initialization) that it tends to stick in local minima if the initial seeding is not very good. Ways to overcome this shortcoming have been sought, like the mentioned k-means++ algorithm that ensures a better seeding. However, other approaches like k-harmonic-means by Zhang et al. [26], power-k-means by Xu et al. [25] and entropy-regularized-power-k-means by Chakraborty et al. [13] that attempt to reduce the dependence of the performance on the initialization. Their advantage is that they smoothen the landscape of the optimized function so that the chance of sticking into local optima is reduced. These approaches are, however, not suitable for our purposes. The entropy-regularized-power-k-means has a target function (quality function or optimization criterion) that is definitely different from that of k-means, though many interesting theoretical properties like strong consistency have been shown. These properties do not warrant however reaching their global optimum. The k-harmonic-means target function differs also from that of k-means though it may be used as an initial phase of the algorithm (instead of a clever seeding) followed by proper k-means, and it was shown empirically that better target function values are found. Only power-k-means has the target function that agrees with that of k-means. However, its complex mathematical form makes it difficult to derive a clusterability criterion that would be superior to that proposed here for k-means++, though this would be worth investigation. Nonetheless, any data set clusterable for k-means++ should also turn out to be clusterable for power-k-means as it is less sensible to bad initialization (that can still happen with k-means++).

The following challenges have to be faced within the k-means family according to Ben-David [10]:

  • the clusterability criteria in the literature (e.g. Ostrovsky et al. [20]) refer to the optimal cost function value of k-means [see Eq. (1)]—but the actual value of this optimal solution is not known

  • people are accustomed to associate well-clusterable data with ones of large gaps between clusters—but the optimal cost function of k-means is also influenced by cluster sizes, so that the gap sufficient for one set of clusters will prove insufficient for some other (see “Failure of Gaps as a Criterion for Clusterability” section)

  • the cost function of k-means usually has multiple local minima and the real-world k-means algorithms usually tend to stick at some local minimum (see e.g. [24, Chapter 3]).

For these reasons, when comparing the results of various k-means brands on real data, it is hard to distil the reason why their results differ: is it because the data are not clusterable or that the cost function optimum does not agree with common sense split into well-separated clusters or the algorithm is unable to discover the optimal clustering (systematically misses it).

Therefore, a clusterability criterion is sought in this paper such that:

  • it is based on the gap size between clusters and other cluster characteristics, that can be computed by inspection of an obtained clustering (not referring to the optimal one),

  • if the clustering obtained meets the clusterability criteria, then this is the real optimal clustering,

  • if a special algorithm (k-means++ is meant here) fails to find a clustering meeting the clusterability criteria, then with high probability the data is not well-clusterable at all by any algorithm,

  • there exists the possibility to generate a data set matching the clusterability conditions for various constellations of the cluster sizes (cardinality, spread), dimensionalities, number of clusters etc.

Given such a tool at disposal, one can investigate algorithm’s capability to find the optimal clustering in the easy case, compare the algorithms in their performance in an easy case, and then compare their relative performance when the clusterability property degenerates, for example via decreasing the size of the gap between clusters.

Our Contribution

This research is confined to providing the tool in terms of the new clusterability criterion, and only a small demonstration is made, how the degenerative behaviour of algorithms may be studied. Our contribution encompasses:

  • Two brands of well-clusterability criteria for data to be clustered via k-means algorithm, that can be verified ex post (both positively and negatively) without great computational burden [inequalities (2) and (3) in “Our Basic Approach to Clusterability” section, and inequalities (15) and (16) in “Smaller Gaps Between Clusters” section].

  • Demonstration, that the structure of well-clusterable data (according to these criteria) is easy to recover [see Theorems 1(i) and 5(i)].

  • Demonstration that if well-clusterable data structure (in that sense) was not discovered by k-means++, then there is no such structure in the data [with high probability—see Theorems 1(ii) and 5(ii)].

  • Demonstration that large gaps between data clusters are not sufficient to ensure well-clusterability by k-means (see “Failure of Gaps as a Criterion for Clusterability” section).

Paper Structure

The structure of this paper is as follows: in “The Problem of Clusterability in the Previous Work” section, the previous work on the topic of clusterability is recalled and a brief introduction to the k-means algorithm and its special case k-means++ is given. “Failure of Gaps as a Criterion for Clusterability” section shows that large gaps are not sufficient for well-clusterability. “Our Basic Approach to Clusterability” section introduces the first version of well-clusterability concept and shows that data well-clustered in this sense are easily learnable via k-means++. This concept has the drawback that no data points (outliers) can lie in wide areas between the clusters. Therefore “Core-Based Approach to Clusterability” section proposes a core-based well-clusterability concept and shows that data well-clustered in this sense are also easily learnable via k-means++. The concept of cluster core itself is introduced and investigated in “Smaller Gaps Between Clusters” section and a method determining proper gap size under these new conditions is derived in “Core-Based Global k-Means Minimum” section. In “Experimental Results” section, some experimental results are reported concerning performance of various brands of k-means algorithms for data fulfilling the clusterability criteria proposed in this paper. “Discussion” section contains a brief comparison of our clusterability criteria with those discussed by Ben-David [10]. In “Conclusion” section some conclusions are drawn.

The Problem of Clusterability in the Previous Work

Clusterability Property: Easy to Use, Hard to Verify

Intuitively the clusterability, as suggested by Ackerman et al. [2], shall be a function taking a set of points and returning a real value saying how “strong” or “conclusive” is the clustering structure of the data. This intuition, however, turns out not to be formalized in a uniform way. A large number of formal definitions have been proposed. Ackerman and Ben-David [2] studied several of them. They concluded that two phenomena co-occur across the various formalizations: on the one hand, well-clusterable data sets (with high “clusterability” value) are computationally easy to cluster (in polynomial time), but on the other hand identification whether or not the data is well-clusterable is NP-hard.

Major Brands of Clusterability Criteria and Their Shortcomings

Ben-David [10] performed an investigation of the concepts of clusterability from the point of view of the capability of “not too complex” algorithms to discover the cluster structure. He verified negatively the working hypothesis that “Clustering is difficult only when it does not matter” (the CDNM thesis).

He considered the following notions of clusterability, present in the literature:

  • Perturbation Robustness meaning that small perturbations of distances/positions in space of set elements do not result in a change of the optimal clustering for that data set. Two brands may be distinguished: additive, proposed by Ackerman and Ben-David [2], and multiplicative ones, proposed by Bilu and Linial [12] (the limit of perturbation is upper-bounded either by an absolute value or by a coefficient).

  • \(\epsilon\)-Separatedness meaning that the cost of optimal clustering into k clusters is less than \(\epsilon ^2\) times the cost of optimal clustering into \(k-1\) clusters, by Ostrovsky et al. [20]—here an explicit reference to the k-means objective is made.

  • \((c, \epsilon )\)-Approximation-Stability by Balcan et al. [8] meaning that if the cost function values of two partitions differ by the factor c, then the distance (in some space) between the partitions is at most \(\epsilon\). As Ben-David recalls, this implies the uniqueness of optimal solution.

  • \(\alpha\)-Centre Stability by Awasthi et al. [7] meaning, for any centric clustering, that the distance of an element to its cluster centre is \(\alpha\) times smaller than the distance to any other cluster centre under optimal clustering.

  • \((1+\alpha )\) Weak Deletion Stability by Awasthi et al. [6] meaning that given an optimal cost function value \(\mathrm{OPT}\) for k centric clusters, then the cost function of a clustering obtained by deleting one of the cluster centres and assigning elements of that cluster to one of the remaining clusters should be bigger than \((1+\alpha )\cdot \mathrm{OPT}\).

Under these notions of clusterability, algorithms have been developed clustering the data nearly optimally in polynomial times, when some constraints are matched by the mentioned parameters. However, these conditions seem to be rather extreme. Ben-David [10] points at concrete numerical disadvantages. For example, given the \((c, \epsilon )\)-Approximation-Stability [8], polynomial time clustering requires that, in the optimal clustering (beside its uniqueness), all but an \(\epsilon\)-fraction of the elements, are 20 times closer to their own cluster centre than to every other cluster centre. \(\epsilon\)-Separatedness requires that the distance to its own cluster centre must be at least 200 times closer than to every other cluster element [20]. And this is still insufficient if the clusters are not balanced. A ratio of \(10^7\) is deemed by these authors as sufficient. (\(1+\alpha\)) Weak Deletion Stability [6] demands distances to other clusters being \(\log (k)\) times the “average radius” of the own cluster. The perturbational stability [2] induces exponential dependence on the sample size.

Gap as a Component of Clusterability Criteria

One can draw a certain important conclusion from the concepts of clusterability mentioned above: people agree that a data set is well clusterable if each cluster is distant (widely separated) from the other clusters. This idea occurs in many other clusterability concepts. Epter et al. [15] consider the data as clusterable when the minimum between-cluster separation exceeds the maximum in-cluster distance (called elsewhere “perfect separation”).Footnote 3 Balcan et al. [9] propose to consider data as clusterable if each element is closer to all elements in its cluster than to all other data (called also “nice separation”).Footnote 4k-means reflects the Balcan concept “on average” that is each element average squared distance to elements of the same cluster is smaller than the minimum (over other clusters) averaged squared distance to elements of a different cluster. Kumar and Kannan [18], explicitly concentrating on k-means objective, define clusterability via a proximity condition as follows: Any point projected on a line connecting its own cluster centre and some other cluster centre should be closer to its own cluster centre by a “sufficiently large” gap depending on the number of clusters and inverted squared cluster cardinalities.

Kushagra et al. [19] consider clusterability from the point of view of a structure in the data. They allow for noise in the data, but insist that the noise does not create structures by itself. They refrain from optimizing a cost function. They show that discovery of clusters is not possible without assumption of structure in the data or without assumption of structureless noise. Ackerman et al. [3] consider the clusterability from the perspective of distortion of clusters by malicious points. It turns out that from this perspective k-means performs better than various other algorithms. With respect to our research, they also insist that the proportions between cluster sizes play a significant role ensuring proper clustering.

Clusterability as a Function of Both Data and the Algorithm

Ackerman and Dasgupta [4] move the focus on clusterability from the clusterability as a property of the data alone to the pair of (data type, algorithm type). In that paper, they are interested in incremental algorithms only and show that an incremental version of k-means performs poorly under perfect and nice separation. In the same spirit Ben-David and Haghtala [11] investigated clusterability by k-centroidal algorithms (a class of algorithms including k-means) via robustifying an algorithm against noise in the data by either clustering the noise into separate clusters or cutting off too distant points. Tang [22] investigates a clusterability criterion for his own version of k-means, based on the requirement that the cluster centres are separated by some distance, which is dependent upon ground truth optimal clustering. Recently Ackerman et al. [1] derived a method for testing clusterability of data based on the large gap assumption. They investigate the histogram of (all) mutual dissimilarities between data points. If there is no data structure, the distribution should be unimodal. If there are distant clusters, then there will occur one mode for short distances (within clusters) and at least one for long distances (between clusters). Hence, to detect clusterability, they apply tests of multi-modality, namely the Dip and Silverman tests, as suggested by Hartigan and Hartigan [16] and Silverman [21] resp.. But the criterion of a sufficiently large gap between clusters is not reflected in various clustering function objectives, for example k-means which may reach an optimum with poorly separated clusters in spite of the fact that there exists an alternative partition of data with a clear separation between clusters in the data, as demonstrated in “Failure of Gaps as a Criterion for Clusterability” section. It will be also demonstrated in “Failure of Multi-modality Detection for Detection of Gaps Between Data” section that multimodal distributions can be detected by Ackerman’s method even if there is no structure in the data.

Clusterability as Agreement Between Global and Local Clustering

Cohen-Addad [14] raises the claim that data are clusterable (in terms of various stability criteria) if the global clustering can be well approximated by local one. Our work can be perceived in this spirit in that we try to achieve coincidence of clusters based on separability with global cost function minimum.

Clusterability Criteria not Verifiable a Priori

Ben-David [10] raises a further important point: It is impossible to verify a priori if the data fulfils the clusterability criterion in practically all abovementioned methods (except [1], which has a flaw by itself, see “Failure of Multi-modality Detection for Detection of Gaps Between Data” section). The reason is that these criteria refer either to all possible clusterings or to optimal clustering so that one does not have the possibility to verify in polynomial time whether or not the data set is clusterable, before one starts clustering (but usually computing the optimum is NP-hard). In this paper, however, we stress that the situation is even worse. Even at the termination of the clustering algorithm, one is unable to say whether or not the clustered data set turned out to be well-clusterable. For example, the \(\epsilon\)-Separatedness criterion requires that the nearly optimal solution for clustering into k and \(k-1\) elements is known. While one can usually get the upper approximations for the cost functions in both cases, one needs actually the lower approximation for \(k-1\) in order to decide ex post if the data was well-clusterable, and hence whether or not one can say that one approximated the correct solution in some way. But this lower approximation for \(k-1\) can only be obtained if \(k-1=1\). Therefore, the issue is not decidable for \(k>2\). Tang’s [22] criterion is certainly better, though also based on solution to optimality criterion, because one can sometimes decide ex post that the clusterability criterion was fulfilled. (the distance between clusters needs to be greater than a product of optimal clustering cost function and reversed squared roots of cluster cardinalities, which may be upper-bounded by the actual clustering cost function and the number 2.) But even if the algorithm actually finds optimal clustering, one cannot be sure that the solution is optimal even if the clusterability criterion is met.

The issue of ex post decision on clusterability seems nevertheless to be simpler to solve than the a priorical one; therefore, we will attack it in this paper. We are unaware that such an issue was even raised in the past. Though the criteria of [9, 15] can clearly be applied ex post to see if the clusterability criteria hold in the resulting clustering, these approaches fail to solve the inverse issue: what if the clusterability criteria are not matched by the result clustering—is the data non-clusterable? Could no other algorithm discover the clusterable structure? One shall note at this point that the approach in [1] is different with this respect. Compared to methods requiring finding the optimum first, Ackerman’s approach seems to fulfil Ben-David requirement, that one can see if there is clusterability in the data before starting the clustering process. Ackerman’s clusterability criterion can be computed in polynomial time because the computation of the histogram of dissimilarities is quadratic in sample size. But at an in-depth-investigation, the Ackerman’s clusterability determination method misses one important point: it requires a user-defined parameter and the user may or may not make the right guess. Furthermore, even if clusterability is decided by Ackerman’s tests, it is still uncertain if k-means algorithm will be willing to find such a clustering that fits Ackerman’s clusterability criterion. Beside this, as visible in Fig. 1, one can easily find counterexamples to their concept of clusterability. The left image shows that there is a single cluster there, but the histogram to the right has two modes, indicating a two-cluster structure.

In summary, the issue of an aposteriorical determination if the data were clusterable, remains an open issue. Therefore, it seems to be justified to restrict oneself to a problem as simple as possible in order to show that the issue is solvable at all. In this paper, we will limit ourselves to the issue of clusterability for the purposes of k-means algorithm.Footnote 5 Furthermore, we restrict ourselves to determine such cases when the clusterability is decidable “for sure”.

The first problem to solve seems to be to get rid of the dependence on the undecidedness of optimality of the obtained solution.

A Brief Introduction to the k-Means Family

Before proceeding let us recall the definition of k-means cost function \(Q({\mathcal {C}})\) for the clustering \({\mathcal {C}}\).

$$\begin{aligned} Q({\mathcal {C}})=\sum _{i=1}^m\sum _{j=1}^k u_{ij}\Vert \mathbf{x }_i - \varvec{\mu }_j\Vert ^2 =\sum _{j=1}^k \frac{1}{2n_j} \sum _{{\mathbf {x}}_i\in C_j} \sum _{{\mathbf {x}}_l \in C_j} \Vert {\mathbf {x}}_i - {\mathbf {x}}_l\Vert ^2 \end{aligned}$$
(1)

for a dataset \({\mathbf {X}}\) under some partition \({\mathcal {C}}=\{C_1,\dots ,C_k\}\) into the predefined number k of clusters, \(C_1\cup \cdots \cup C_k={\mathbf {X}}\), where \(u_{ij}\) is an indicator of the membership of data point \(\mathbf{x }_i\) in the cluster \(C_j\) having the centre at \(\varvec{\mu }_j=\frac{1}{|C_j|}\sum _{\mathbf{x }\in C_j}\mathbf{x }\).

The k-means algorithm starts with some initial guess of the positions of \(\varvec{\mu }_j\) for \(j=1,\dots , k\) and then alternating two steps: cluster assignment and centre update till some convergence criterion is reached, e.g. no changes in cluster membership. The cluster assignment step updates \(u_{ij}\) values so that each element \(\mathbf{x }_i\) is assigned to a cluster represented by the closest \(\varvec{\mu }_j\). The centre update step uses the update formula \(\varvec{\mu }_j=\frac{1}{|C_j||}\sum _{\mathbf{x }\in C_j}\mathbf{x }\).

The k-means++ algorithm is a special case of k-means where the initial guess of cluster centres proceeds as follows. \(\varvec{\mu }_1\) is set to be a data point uniformly sampled from \({\mathbf {X}}\). The subsequent cluster centres are data points picked from \({\mathbf {X}}\) with probability proportional to the squared distance to the closest cluster centre chosen so far. For details check the paper of Arthur and Vassilvitskii [5]. Note that the algorithm proposed by Ostrovsky et al. [20] differs from the k-means++ only by the non-uniform choice of the first cluster centre. (The first pair of cluster centres should be distant, and the choice of this pair is proportional in probability to the squared distances between data elements.)

Non-suitability of Gap-Based Clusterability Criteria for k-Means

Failure of Multi-modality Detection for Detection of Gaps Between Data

Let us discuss more closely the relationship between the gap-based well-clusterability concepts developed in the literature and the actual optimality criterion of k-means. Specifically let us consider the approaches to clusterability of [1, 7, 9, 15].

Intuitively, if the groups of data points occur in the data and there are large spaces between these groups, then it should be these groups that will be chosen as the actual clustering. On the other hand, if there are no gaps between the groups of data points, then one would expect that the data are not considered as well clusterable. Furthermore, if the data is well clusterable, one would expect a reasonable clustering algorithm to discover easily such a well-clusterable data structure.

However, these intuitions prove wrong in case of k-means. Let us first point to the fact that the method of [1] may indicate a clear bimodal structure in the data where there are no gaps in the data. We are unaware of anybody pointing at this weakness of well-clusterability in [1]: imagine a thin ring uniformly covered with data points (see Fig. 1a). One would be reluctant to say that there is a clustering structure in such data. Nonetheless, two obvious modes in such data are visible. The thinner the ring (the closer to a circle), the more obvious the reason for the multi-modality will be: one gets closer and closer to the following function. Consider the angle \(\alpha\) centred at the centre of the circle (“thin ring”). To calculate distances between points, one can restrict himself to angles with measure \(0^{\circ }\le \alpha \le 180^{\circ }\) (or \(0\le \alpha \le \pi\)). The number of elements within the angle would be approximately proportional to this angle. The distance between cutting points of this angle on the circle, given a radius r of the circle, will amount to \(x=2r\sin \frac{\alpha }{2}\). Consequently \(\alpha =2\arcsin \frac{x}{2r}\). To determine the density of distances, a derivative has to be computed \(\frac{{\mathrm{d}}\alpha }{{\mathrm{d}}x} {=\frac{1}{{\mathrm{d}}x/{\mathrm{d}}\alpha } = \frac{1}{{\mathrm{d}}\left(2r\sin \frac{\alpha }{2}\right) /{\mathrm{d}}\alpha } =\frac{1}{r\cos \frac{\alpha }{2} } = \frac{1}{r\sqrt{1-\sin ^2 \frac{\alpha }{2} }} } = \frac{1}{r\sqrt{1- \frac{x^2}{4r^2} }}.\) This function has a minimum at \(x=r\) and grows towards both \(x=0\) and \(x=2r\). It is reflected in the histogram in Fig. 1b, where \(r=5\) is assumed. As the points occur not on a circle, but on a ring, more distances close to zero occur in the histogram.

Failure of Gaps as a Criterion for Clusterability

On the other hand, even if there are gaps between groups of data, for example those required by [7, 9, 15], k-means optimum may not lie in the partition exhibiting gap-based well-clusterability property in spite of its existence, and not only for these gaps, but also for any arbitrary many times larger ones. As [15] is concerned, it may be considered as a special case of [9]. Awasthi et al. [7] may be viewed in turn as a strengthening of the concept of [15]. Therefore, let us discuss a situation in which both perfect and nice separation criteria are identical that is of two clusters. It will be shown that whatever \(\alpha\) is assumed in the \(\alpha\)-stability concept, k-means fails to be optimal under unequal class cardinalities. Let these clusters, \(C_A,C_B\) be each enclosed in a ball of radius r and the distance between ball centres should be at least 4r. It has been demonstrated in [17] that under these circumstances the clustering of data into \(C_A,C_B\) reflects a local minimum of k-means cost function. But it is not the global minimum, as will be shown subsequently. So the criteria of Epter et al. [15], Balcan et al. [9] and Awasthi et al. [7] cannot be viewed as realistic definitions of well-clusterability at least for k-means. Subsequently, whenever we say that a cluster is enclosed in a ball of radius r, we mean at the same time that the ball is centred at gravity centre of the cluster.

Assume, for purposes of demonstration, that both clusters are of different cardinalities \(n_A, n_B\) and let \(n_A> n_B\). It will be shown that whatever distance between both clusters, one can get such a proportion of \(n_A/n_B\) that the clustering into \(C_A,C_B\) is not optimal. Let us consider a d-dimensional space. Let us select the dimension that contributes most to the variance in cluster \(C_A\). So the variance along this direction amounts to at least the overall variance divided by d. Let us denote this variance component as \(V_d\). Consider this coordinate axis responsible for \(V_d\) to have the origin at the cluster centre of \(C_A\). Project all the points of cluster \(C_A\) on this axis. The variance of projected points will be just \(V_d\). Split the projected data set into two parts \(P_1,P_3\), one with coordinate above 0 and the rest. Let \(P_1',P_3'\) be the original data points which were projected onto \(P_1,P_3\) resp. Then \(C_A= P_1' \cup P_3'\) and \(P_1'\cap P_3'=\emptyset\) Let the centres of \(P_1,P_3\) lie \(x_1,x_3\) away from the cluster centre. Let \(n_1\) data points of \(P_1\) be at most \(x_1\) distant from the origin, and \(n_2\) more than \(x_1\) from the origin. Let there be \(n_3\) data points of \(P_3\) be at most \(x_3\) distant from the origin, and \(n_4\) more than \(x_3\) from the origin. Obviously, \(n_1+n_2+n_3+n_4=n_A\), \(|P_1|=n_1+n_2\), \(|P_3|=n_3+n_4\). As zero is assumed to be the \(C_A\) cluster centre on this line, also \(x_1 \cdot (n_1+n_2)=x_3 \cdot (n_3+n_4)\) holds. Furthermore, as the cluster is enclosed in a ball of radius r centred at its gravity centre, both \(x_1\le r\) and \(x_3\le r\). Under these circumstances, let us ask the question whether for a \(V_d\) some minimal values of \(x_1,x_3\) are implied. Because if so, then by splitting the cluster \(C_A\) into \(P_1',P_3'\) and by increasing the cardinality of \(C_A\), the split of data into \(P_1', P_3'\cup C_B\) will deliver a lower Q value so that for sure the clustering into \(C_A,C_B\) will not be optimal.

Note that \(V_d=({\mathrm{Var}}(P_1)+x_1^2)\cdot (n_1+n_2)+ ({\mathrm{Var}}(P_3)+x_3^2)\cdot (n_3+n_4))/n_A\). The \(n_1\) points of \(P_1\) closer to origin than \(x_1\) are necessarily not more than \(x_1\) distant from \(P_1\) gravity centre. Therefore, the remaining \(n_2\) points cannot be more distant than \(x_1\frac{n_1}{n_2}\). Hence \({\mathrm{Var}}(P_1)\le x_1^2 n_1+\left( x_1\frac{n_1}{n_2}\right) ^2n_2\). By analogy \({\mathrm{Var}}(P_3)\le x_3^2 n_3+\left( x_3\frac{n_3}{n_4}\right) ^2n_4\).

It is visible that

$$\begin{aligned} V_d\le \frac{ \frac{x_1^2 (n_1 +n_1^2/n_2 +n_1+n_2)}{ n_1+n_2}\cdot (n_1+n_2)+ \frac{x_3^2\cdot (n_3 +n_3^2/n_4 +n_3+n_4)}{(n_3+n_4)} \cdot (n_3+n_4) }{n_A} \end{aligned}$$

that is

$$\begin{aligned} V_d\le \frac{ x_1^2\cdot \left( n_1 +n_1^2/n_2 +n_1+n_2\right) + x_3^2\cdot \left( n_3 +n_3^2/n_4 +n_3+n_4\right) }{n_A}. \end{aligned}$$

Note that one can delimit \(n_2, n_4\) from below due to the relationship: \((r-x_1)\cdot n_2\ge n_1\cdot x_1\), \((r-x_3)\cdot n_4\ge n_3\cdot x_3\).

Therefore

$$\begin{aligned} V_d & {}\le \left( x_1^2\cdot (n_1 +n_1^2\cdot (r-x_1)/n_1/x_1 +n_1+n_2) \right. \\&\left. \quad +\,x_3^2\cdot (n_3 +n_3^2\cdot (r-x_3)/n_3/x_3 +n_3+n_4) \right) / n_A. \end{aligned}$$

Hence

$$\begin{aligned} V_d& {} \le \left( x_1^2\cdot (2\cdot n_1+n_2) +n_1^2\cdot (r-x_1)\cdot x_1/n_1 \right. \\&\left. \quad +\,x_3^2\cdot (2\cdot n_3+n_4) +n_3^2\cdot (r-x_3)\cdot x_3/n_3\right) / n_A \\ V_d& {} \le \left( x_1^2\cdot (2\cdot n_1+n_2) +n_1\cdot (r-x_1)\cdot x_1\right. \\&\left. \quad +\,x_3^2\cdot (2\cdot n_3+n_4) +n_3\cdot (r-x_3)\cdot x_3 \right) / n_A \\ V_d& {} \le \left( x_1^2\cdot (n_1+n_2) +n_1\cdot r\cdot x_1\right. \\&\left. \quad +\,x_3^2\cdot (n_3+n_4) +n_3\cdot r\cdot x_3 \right) / n_A. \end{aligned}$$

Recall that \(x_1 \cdot (n_1+n_2)=x_3 \cdot (n_3+n_4)\). Therefore

$$\begin{aligned} V_d& {} \le \left( x_1^2\cdot (n_1+n_2) +n_1\cdot r\cdot x_1 \right. \\&\left. \quad +\,(x_1\cdot (n_1+n_2)/(n_3+n_4)) ^2\cdot (n_3+n_4) +n_3\cdot r\cdot x_3 \right) / n_A \end{aligned}$$

which is equivalent to

$$\begin{aligned} V_d& {} \le \left( x_1^2\cdot (n_1+n_2) +n_1\cdot r\cdot x_1 \right. \\&\quad \left. +\,x_1^2\cdot (n_1+n_2)^2/(n_3+n_4) +n_3\cdot r\cdot x_3 \right) / n_A \end{aligned}$$

By rearranging the terms one gets:

$$\begin{aligned} V_d& {} \le \left( x_1^2\cdot (n_1+n_2)\cdot n_A/(n_3+n_4)\right. \\&\left. \quad +\,n_1\cdot r\cdot x_1 + +n_3\cdot r\cdot x_3 \right) / n_A. \end{aligned}$$

Let us increase the right-hand side by adding to the nominator \(n_2 \cdot r\cdot x_1 +n_4 \cdot r\cdot x_3\). This implies

$$\begin{aligned} V_d& {} \le \left( x_1^2 \cdot (n_1+n_2)\cdot n_A/(n_3+n_4) +(n_1+n_2)\cdot r\cdot x_1\right. \\&\quad \left. +\,(n_3+n_4)\cdot r\cdot x_3 \right) / n_A. \end{aligned}$$

Let us substitute \(x_1=\frac{x_3 \cdot (n_3+n_4)}{n_1+n_2}\). Then we obtain

$$\begin{aligned} V_d& {} \le \left( \left( x_3\cdot (n_3+n_4)/(n_1+n_2)\right) ^2\cdot (n_1+n_2)\cdot n_A/(n_3 \right. \\&\left. \quad +\,n_4)+2\cdot (n_3+n_4)\cdot r\cdot x_3 \right) / n_A. \end{aligned}$$

Hence

$$\begin{aligned} V_d& {} \le \left( x_3^2 (n_3+n_4)^2/(n_1+n_2) \cdot n_A/(n_3+n_4) \right. \\&\left. \quad +\,2\cdot (n_3+n_4)\cdot r\cdot x_3 \right) / n_A. \end{aligned}$$

One can delimit \(n_1+n_2\) from below due to relationship \(x_3 \cdot (n_3+n_4)=(n_1+n_2) \cdot x_1 \le (n_1+n_2)\cdot r\) because \(x_1\le r\). It implies that \(\frac{1}{n_1+n_2}\le \frac{ r}{x_3 \cdot\,(n_3+n_4)}\). Therefore

$$\begin{aligned} V_d& {} \le \left( x_3^2\cdot (n_3+n_4)^2\cdot r/x_3/(n_3+n_4) \cdot n_A/(n_3+n_4) \right. \\&\left. \quad +\,2\cdot (n_3+n_4)\cdot r\cdot x_3 \right) / n_A \end{aligned}$$

which simplifies to

$$\begin{aligned}&V_d\le \left( x_3 \cdot r \cdot n_A +2\cdot (n_3+n_4)\cdot r\cdot x_3 \right) / n_A \\&V_d\le x_3 \cdot r \cdot (n_A +2\cdot (n_3+n_4) )/ n_A \end{aligned}$$

Clearly \(n_3+n_4<n_A\), which implies \(V_d\le 3\cdot x_3 \cdot r\). This means that

$$\begin{aligned} x_3\ge V_d/3/r \end{aligned}$$

Now let us show that when scaling up \(n_A\) it pays off to split the first cluster and to attach the contents of the second one to one of the parts of the first. Let us increase the cardinality of \(C_A\) b times simply by replacing each data element by b data elements collocated at the same place in space. In this way, one keeps \(V_d\) when increasing \(|C_A|\). So the sum of squared distances between centre and elements of the cluster \(C_A\), \({\mathrm{SSC}}(C_A)\) will be kept below \(V_d \cdot d \cdot n_A b\) (\({\mathrm{SSC}}(C_A)\le V_d \cdot d \cdot n_A b\)).

Let \(n_1+n_2\) be the minority among data points—then \(x_1\) is larger and \(x_3\) is smaller of the two, because of \(x_1 \cdot (n_1+n_2)=x_3 \cdot (n_3+n_4)\). Let \(P'_1,P'_3\) be the subsets of \(C_A\) yielding upon the aforementioned projection the mentioned sets \(P_1,P_3\). Then if one would split \(C_A\) into \(P'_1, P'_3\), the sum of squared distances to respective cluster centres of \(P'_1, P'_3\) would decrease by at least \(x_3^2 n_A b\), because \({\mathrm{SSC}}(P_1\cup P_3)-x_3^2 n_A b \ge {\mathrm{SSC}}(P_1\cup P_3)-x_1^2(n_1+n_2)b-x_3^2(n_3+n_4) b \ge {\mathrm{SSC}}(P_1)+{\mathrm{SSC}}(P_3)\), and the distances between elements of \(P'_1\) and \(P'_3\) (and so respective gravity centres) are at least as big as between \(P_1\) and \(P_3\), so that \({\mathrm{SSC}}(C_A)-x_3^2 n_A b = {\mathrm{SSC}}(P'_1\cup P'_3)-x_3^2 n_A b \ge {\mathrm{SSC}}(P'_1)+{\mathrm{SSC}}(P'_3)\),

On the other hand, combining \(P'_1, P'_3\) with disjoint parts \(P'_6,P'_7\) of \(C_B\) will increase the sum of squared distances by at most \(n_B x_5^2\), where \(x_5\) is the distance between extreme elements of \(C_A\) and \(C_B\): \({\mathrm{SSC}}(P'_1\cup P'_6)+{\mathrm{SSC}}(P'_3\cup P'_7)\le {\mathrm{SSC}}(P'_1)+| P_6|x_5^2+{\mathrm{SSC}}(P'_3)+| P'_7|x_5^2 ={\mathrm{SSC}}(P'_1)+ {\mathrm{SSC}}(P'_3)+n_Bx_5^2\).

Combining these two relations, one gets

$$\begin{aligned} {\mathrm{SSC}}(P'_1\cup P'_6)+{\mathrm{SSC}}(P'_3\cup P'_7)\le {\mathrm{SSC}}(C_A)-x_3^2 n_A b +n_Bx_5^2. \end{aligned}$$

Therefore, as soon as one sets \(b \ge \frac{n_Bx_5^2}{(V_d/3/r)^2 n_A} \ge \frac{n_Bx_5^2}{x_3^2 n_A}\), one will obtain

$$\begin{aligned}&Q(\{P'_1\cup P'_6, P'_3\cup P'_7\}) \\&\quad = {\mathrm{SSC}}(P'_1\cup P'_6)+{\mathrm{SSC}}(P'_3\cup P'_7)\le {\mathrm{SSC}}(C_A) \\&\quad \le {\mathrm{SSC}}(C_A)+{\mathrm{SSC}}(C_B)=Q(\{C_A,C_B\}) \end{aligned}$$

that is that for suitably large b it pays off to split \(C_A\) and merge \(C_B\) into parts of \(C_A\), because the optimum lies at other partition than the one of well-separatedness in terms of big distance between centres of cluster enclosing balls. See also the discussion in “Experimental Results” section on Table 1.

Table 1 Dependence of the number of errors on M to m proportion under fixed gap \(g/r_{\mathrm{max}}= 2\)

Our Basic Approach to Clusterability

Let us stress at this point that the issue of well clusterability is not only a theoretical issue, but it is of practical interest too. For example when one intends to create synthetic data sets for investigating suitability of various clustering algorithms. But also after having performed the clustering process with whatever method one has. The important question needs to be answered: whether or not the obtained clustering meets the expectation of the analyst. These expectations may be divided into several categories:

  • matching business goals,

  • matching underlying algorithm assumptions,

  • proximity to the optimal solutions.

Business goals of the clustering may be difficult to express in terms of data for an algorithm or may not fit the algorithm domain or data may be too expensive to collect prior to performing an approximate clustering. For example, when one seeks a clustering that would enable efficient collection of cars to be scrapped (disassembly network), then one has to match multiple goals, like covering the whole country, maximum distance from client to the disassembly station, and of course the number of prospective clients, which is known with some degree of uncertainty. The distances to the clients are frequently not Euclidean in nature (due to geographical obstacles like rivers, mountains, etc.), while the preferred k-means algorithm works best with geometrical distances, no upper distance can be imposed, etc. Other algorithms may induce same or different problems. So one has to check a posteriori if the obtained solution meets all criteria, does not violate constraints and is stable under fluctuation of the actual set of clients. The other two problems are somehow related to one another. For example, one may have clustered the data being a subsample of the proper data set and the question may be raised how close the subsample cluster centres are to the cluster centres of the proper data set. Known methods allow to estimate this discrepancy given that one knows that the cluster sizes do not differ too much. So prior to evaluating the correctness of cluster centre estimation one has to check if cluster proportions are within a required range (or if subsample size is relevant for such a verification). As another example consider methods of estimating closeness to optimal clustering solution under some general data distributions (like for the k-means++ [5]), but the guarantees are quite loose. But at the same time the guarantees can be much tighter if the clusters are well separated in some sense. So if one wants to be sure with a reasonable probability that the obtained solution is sufficiently close to the optimum, one would need to check if the obtained clusters are well separated in the defined sense.

With this in mind, as mentioned, a number of researchers developed the concept of data clusterability. The notion of clusterability should intuitively reflect the following idea: if it is easy to see that there are clear-cut clusters in the data, then one would say that the data set is clusterable. “Easy to see” may mean either a visual inspection or some algorithm that quickly identifies the clusters. The well-established notion of clusterability would improve our understanding of the concept of the cluster itself—a well-defined clustering would be a clustering of clusterable points. This also would be a foundation for objective evaluation of clustering algorithms. The algorithm shall perform well for well-clusterable data and when the clusterability condition would be violated to some degree, the performance of a clustering algorithm is allowed to deteriorate also, but the algorithm quality would be measured on how the clusterability violation impacts the deterioration of algorithm performance.

However, the issue turns out not to be that simple. As is well known, each algorithm seeking to discover a clustering may be betrayed somehow to fail to discover a clustering structure that is visible upon human inspection of data. So instead of trying to reflect human vision of clusterability of the data set independently of the algorithm, let us rather concentrate on finding a concept of clusterability that is both reflecting human perception and the minimum of cost function of a concrete algorithm, in our case k-means. This paper concentrates particularly on its version called k-means++.

Let us define:

Definition 1

A data set is well-clusterable with respect to k-means cost function [Eq. (1)] for a given k (in brief: is well-clusterable) if (a) the data points may be split into subsets that are clearly separated by an appropriately chosen gap such that (b) the global minimum of k-means cost function coincides with this split and (c) with high probability (over 0.95) the k-means++ algorithm discovers this split and (d) if the split was found, it may be verified that the data subsets are separated by the abovementioned gap and (e) if the k-means++ did not discover a split of the data fulfilling the requirement of the existence of the gap, then with high probability the split described by points (a) and (b) does not exist.

Conditions, under which one can ensure that the minimum of k-means cost function is related to a clustering with (wide) gaps between clusters, were investigated in the paper [17].

The conditions for clusterable data set therein are rather rigid, but serve the purpose of demonstration that it is possible to define properties of the data set that ensure this property of the minimum of k-means. Let us recall below the main result in this respect.

Assume that the data set encompassing n data points consists of k subsets such that each subset \(i=1,\dots ,k\) can be enclosed in a ball of radius \(r_i\). Let the gap (distance between surfaces of enclosing balls) between each pair of subsets amount to at least g, that is described below, as proposed in the paper [17] to ensure that the split of data into these subsets is the global optimum of k-means.

$$\begin{aligned} g \ge r_{\mathrm{max}} \sqrt{k\frac{M+n}{ m} } \end{aligned}$$
(2)

and

$$\begin{aligned} g\ge k r_{\mathrm{max}} \sqrt{n_p/2+n_q/2+n/2} \sqrt{ \frac{2 n}{ n_{p}n_{q} } } \end{aligned}$$
(3)

for any \(p,q=1,\dots ,k;\,p\ne q\), when \(n_i, i=1,\dots ,k\) is the cardinality of the cluster i, \(M=\max _i n_i\), \(m=\min _i n_i\). These conditions are taken from [17, formulas (21) and (22)], and they ensure that the global optimum agrees with a split into clusters separated by the mentioned gap.

Please note that the quotient of the cardinality of the largest to the smallest cluster increases the size of the required gap, as may be expected from “Failure of Gaps as a Criterion for Clusterability” section. It is visible from formula (2) that both the relationship M/m and n/m matter. This formula gives the impression that this relationship may be like square root of the sum of the two. But note that g is controlled also by formula (3), where the dependence of g on n/m may become close to linear, while that on M/m will still be close to square root. As visible from “Failure of Gaps as a Criterion for Clusterability” section, the sum of squared distances to cluster centre within the cluster and between clusters decides on the point when the shift in minimal costs occurs when the disproportion between cluster sizes grows. Hence g needs to grow as square root with this disproportion M/m. The impact of n/m shall be rather viewed in the context of the number of clusters k, as with fixed m and growing n n/m may be deemed as a reflection of k. If one looks at formula (5), one sees that g depends approximately quadratically on k. This relates probably to the fact that the number of possible misassignments between clusters grows quadratically with k.

It is claimed in [17] that the optimum of k-means objective is reached when splitting the data into the aforementioned subsets. The most fundamental implication is that the problem is decidable.

Theorem 1

(i) If the data set is well-clusterable (into k clusters) with a gap defined by formulas (2) and (3), then with high probability k-means++ (after an appropriately chosen number of repetitions) will discover the respective clustering. (ii) If k-means++ (after an appropriately chosen number of repetitions) does not discover a clustering matching formulas (2) and (3), then with high probability the data set is not well clusterable with a gap defined by formulas (2) and (3).

The rest of the current section is devoted to the proof of the claims of this new theorem, proposed in the current paper.

If one obtained the split, then for each cluster one is able to compute the cluster centre, the radius of the ball containing all the data points of the cluster, and finally one can check if the gaps between the clusters meet the requirement of formulas (2) and (3). If this requirement is met, then definitely the data set is well clusterable.

Let us look at the claim (i). As shown in [17, section 12], the global minimum of k-means coincides with the separation by abovementioned gaps. Hence if there exists a positive probability, that k-means++ discovers the appropriate split, then the probability of finding the global minimum is increased by repeating independent runs of k-means++ and picking the split minimizing k-means cost function. It will be shown that the number of repetitions needed is known in advance, if the value of the quotient M/m is bounded from above by a known number.

First consider the easiest case of all clusters being of equal sizes (\(M=m\)). Then the above Eqs. (2) and (3) can be reduced to (\(r=r_{\mathrm{max}}\))

$$\begin{aligned} g \ge r \sqrt{k(k+1) } \end{aligned}$$
(4)
$$\begin{aligned} g\ge r k\sqrt{2k +k^2}. \end{aligned}$$
(5)

A diagram of dependence of g/r on k is depicted in Fig. 2.

Fig. 2
figure 2

Dependence of the relative gap g/r on k for clusters of equal radius and equal cardinalities

Now let us turn to k-means++ seeding. During initialization of the algorithm, the first cluster centre \({\varvec{\mu }}_1\) is selected randomly (with uniform distribution) from among all the data points \({\mathbf {X}}\). The subsequent data points are selected randomly from the (remaining) data points \({\mathbf {x}}\in {\mathbf {X}}\) with probability \({\mathrm{Prob}}({\mathbf {X}})=\frac{ D({\mathbf {x}}, \{{\varvec{\mu }}_1,\dots ,{\varvec{\mu }}_i\} )^2 }{\sum _{{\mathbf {y}}\in {\mathbf {X}}} D({\mathbf {y}}, \{{\varvec{\mu }}_1,\dots ,{\varvec{\mu }}_i\} )^2 }\), where \(D({\mathbf {x}}, \{{\varvec{\mu }}_1,\dots ,{\varvec{\mu }}_i\} )\) denotes the distance of \({\mathbf {x}}\) to the closest point in the seed set \(\{{\varvec{\mu }}_1,\dots ,{\varvec{\mu }}_i\}\), if already i distinct clusters were seeded. Recall that with k clusters, we have km data points as all the clusters are of equal size. \((k-i)m\) of them are in clusters that do not have a seed yet, im are in those with a seed. The data points \({\mathbf {x}}\) in clusters without a seed have the distance to the closest seed of at least the size g. On the other hand, the data points \(\mathbf {x'}\) in clusters with a seed have the distance to the closest seed of at most the size 2r. Let \({\mathbf {S}}\subset {\mathbf {X}}\) be the set of data points in seeded clusters. Then the probability that a new cluster will be seeded amounts to \(\sum _{{\mathbf {x}}\in {\mathbf {X}}-{\mathbf {S}}} {\mathrm{Prob}}({\mathbf {X}})= \sum _{{\mathbf {x}}\in {\mathbf {X}}-{\mathbf {S}}} \frac{ D({\mathbf {x}}, \{{\varvec{\mu }}_1,\dots ,{\varvec{\mu }}_i\} )^2 }{\sum _{{\mathbf {y}}\in {\mathbf {X}}} D({\mathbf {y}}, \{{\varvec{\mu }}_1,\dots ,{\varvec{\mu }}_i\} )^2 } = \frac{ \sum _{{\mathbf {x}}\in {\mathbf {X}}-{\mathbf {S}}} D({\mathbf {x}}, \{{\varvec{\mu }}_1,\dots ,{\varvec{\mu }}_i\} )^2 }{ \sum _{{\mathbf {x}}\in {\mathbf {X}}-{\mathbf {S}}} D({\mathbf {x}}, \{{\varvec{\mu }}_1,\dots ,{\varvec{\mu }}_i\} )^2 + \sum _{{\mathbf {y}}\in {\mathbf {S}}} D({\mathbf {y}}, \{{\varvec{\mu }}_1,\dots ,{\varvec{\mu }}_i\} )^2 } \ge \frac{ \sum _{{\mathbf {x}}\in {\mathbf {X}}-{\mathbf {S}}} D({\mathbf {x}}, \{{\varvec{\mu }}_1,\dots ,{\varvec{\mu }}_i\} )^2 }{ \sum _{{\mathbf {x}}\in {\mathbf {X}}-{\mathbf {S}}} D({\mathbf {x}}, \{{\varvec{\mu }}_1,\dots ,{\varvec{\mu }}_i\} )^2 +im(2r)^2 } \ge \frac{(k-i)mg^2}{(k-i)mg^2 +im(2r)^2 }= \frac{(k-i)g^2}{(k-i)g^2 +4ir^2 }\). The last inequalities are justified first by substituting a positive number in the denominator by a bigger one, and then both in nominator and in denominator (that is bigger than he nominator) same number is substituted by the same smaller one.

Note that the inequality (5) implies \(g \ge 2r \sqrt{k(k+1) }\) if \(k\ge 2\) which we assume.

$$\begin{aligned}&\frac{(k-i) g^2}{(k-i) g^2+4i r^2} \ge \frac{(k-i) 4r^2 k^2 (k+1)^2}{(k-i) 4r^2 k^2 (k+1)^2+i 4r^2} \\&\quad = \frac{(k-i) k^2 (k+1)^2}{(k-i) k^2 (k+1)^2+i } \ge \frac{ k^2 (k+1)^2}{ k^2 (k+1)^2+(k-1) } .\end{aligned}$$

Hence the probability of accurate seeding (\({\mathrm{PAS}}(k)\)) amounts to

$$\begin{aligned} {\mathrm{PAS}}(k)\ge \left( \frac{ k^2 (k+1)^2}{ k^2 (k+1)^2+(k-1) }\right) ^{k-1} \end{aligned}$$

The diagram of dependence of this expression on k is depicted in Fig. 3.

Fig. 3
figure 3

Probability of seeding each cluster on k for clusters of equal radius and equal cardinalities

Let us denote with \(Pr_{\mathrm{succ}}\) the required probability of success in finding the global minimum. To ensure that the seeding was successful in \(Pr_{\mathrm{succ}}\) (e.g. 95%) of cases, k-means++ has to be rerun at least R times, with R given by

$$\begin{aligned} (1-{\mathrm{PAS}}(k))^R& {} \le \left( 1-\left( \frac{ k^2 (k+1)^2}{ k^2 (k+1)^2+(k-1) }\right) ^{k-1}\right) ^R \\& {} < 1- Pr_{\mathrm{succ}} \\ R& {} \ge \frac{\log (1- Pr_{\mathrm{succ}} )}{\log \left( 1-\left( \frac{ k^2 (k+1)^2}{ k^2 (k+1)^2+(k-1) }\right) ^{k-1}\right) } \end{aligned}$$

R does not depend on sample size But look at the following relationship:

$$\begin{aligned}&\left( \frac{ k^2 (k+1)^2}{ k^2 (k+1)^2+(k-1) }\right) ^{k-1} \\&\quad =\left( 1-\frac{ k-1 }{ k^2 (k+1)^2+(k-1) }\right) ^{k-1} \\&\quad =\left( 1-\frac{ (k-1)^2 }{ k^2 (k+1)^2+(k-1) }\frac{1}{k-1}\right) ^{k-1} \\&\quad \approx e^{-\left( \frac{ (k-1)^2 }{ k^2 (k+1)^2+(k-1) }\right) }. \end{aligned}$$

The exponent of the last expression approaches rapidly zero, so that with increasing k within a single pass of k-means++ the optimum is reached. In fact, already for \(k=2\) an error of below 3% is obtained, for k = 8, below 1%, for \(k=30\) below 0.1%. See Fig. 3 for illustration.

Let us discuss clusters with same radius, but different cardinalities. Let m be the cluster minimum cardinality, and M, respectively, the maximum.

$$\begin{aligned} g \ge r \sqrt{k\frac{M+n}{ m} } \end{aligned}$$
(6)
$$\begin{aligned} g \ge k r \sqrt{ \frac{ n (n_p +n_q +n )}{ n_{p}n_{q} } } \end{aligned}$$
(7)

for any \(p,q=1,\dots ,k;\,p\ne q\), when \(n_i, i=1,\dots ,k\) is the cardinality of the cluster i, \(M=\max _i n_i\), \(m=\min _i n_i\), Worst case g/r values are illustrated in Fig. 4.

Fig. 4
figure 4

Dependence of the gap g for \(k=5\) for clusters of equal radius when varying cluster cardinalities

Now let us turn to k-means++ seeding. Note that with \(k\ge 4\) \(g \ge 2r \sqrt{k\frac{M+n}{ m} }\) from the above relations. If already i distinct clusters were seeded, then the probability that a new cluster will be seeded (under our assumptions) amounts to at least

$$\begin{aligned}&\frac{(k-i) m g^2}{(k-i) m g^2+i M 4r^2} \\&\quad \ge \frac{(k-i) m k^2 n (1/m +1/m +n/m^2 ) }{(k-i) m k^2 n (1/m +1/m +n/m^2 ) +i M } \\&\quad = \frac{(k-i) k^2 n (2 +n/m ) }{(k-i) k^2 n (2 +n/m ) +i M } \\&\quad \ge \frac{ k^2 n (2 +n/m ) }{ k^2 n (2 +n/m ) +(k-1) M }. \end{aligned}$$

So again the probability of successful seeding will amount to at least:

$$\begin{aligned}&\left( \frac{ k^2 n (2 +n/m ) }{ k^2 n (2 +n/m ) +(k-1) M } \right) ^{k-1} \\&\quad =\left( 1- \frac{ (k-1) M }{ k^2 n (2 +n/m ) +(k-1) M } \right) ^{k-1} \\&\quad =\left( 1- \frac{ (k-1)^2 M }{ k^2 n (2 +n/m ) +(k-1) M }\frac{1}{k-1} \right) ^{k-1} \\&\quad \approx {\mathrm{exp}} \left( - \frac{ (k-1)^2 M }{ k^2 n (2 +n/m ) +(k-1) M } \right) . \end{aligned}$$

Even if M is 20 times as big as m, still the convergence to 1 is so rapid that already for \(k=2\) the clustering success is achieved with \(95\%\) success probability in a single repetition. An illustration is visible in Fig. 5.

Fig. 5
figure 5

Dependence of the accurate seeding on the quotient M/m for \(k=5\)

It has been shown that if the data is well clusterable, then within practically a single clustering run the seeding will have the property that each cluster obtains a single seed. But what about the rest of the run of k-means? As in all these cases \(g \ge 2r\), then, as shown in [17], the cluster centres will never switch to balls encompassing other clusters, so that eventually the true cluster structure is detected and minimum of Q is reached. This would complete the proof of claim (i). The demonstration of claim (ii) is straight forward. Note that if a clustering discovered by k-means fulfils the conditions of well-clusterability, then the data set is clusterable for sure, by definition. If the data were not well-clusterable, then k-means++ for sure not find a clustering with the property of being well-clusterable, because it does not exist. If the data were well clusterable, then k-means++ would have failed to identify it with probability of at most \(1-{\mathrm{Pr}}_{\mathrm{succ}}\).

So denote with W the event that the data is well-clusterable. Further denote with D the event that the k-means++ algorithm states that the data is well clusterable. It will be shown below that the probability \(P(\lnot W | \lnot D)\) is high.

$$\begin{aligned}&Pr(\lnot W | \lnot D) \\&\quad = \frac{Pr(\lnot D | \lnot W) Pr( \lnot W) }{Pr(\lnot D | \lnot W) Pr( \lnot W) + Pr(\lnot D | W) Pr( W) } \\&\quad = \frac{ Pr( \lnot W) }{ Pr( \lnot W) + Pr(\lnot D | W) Pr( W) } \\&\quad = \frac{1 }{ 1 + Pr(\lnot D | W) \frac{Pr( W)}{Pr( \lnot W)} } \\&\quad \ge \frac{1 }{ 1 + (1-{\mathrm{Pr}}_{\mathrm{succ}}) \frac{Pr( W)}{Pr( \lnot W)} } \\&\quad \ge 1 - (1-{\mathrm{Pr}}_{\mathrm{succ}}) \frac{Pr( W)}{Pr( \lnot W)} \ge {\mathrm{Pr}}_{\mathrm{succ}} \end{aligned}$$

The last inequality is true because the well-clusterable data are in practice extremely rare, and for sure less frequent than not well-clusterable ones.

Please note that a kind of worst-case analysis was presented above. It is obvious from this discussion that the probability of seeding of k distinct clusters depends on the characteristics of the data. The smallest gap between clusters was referred to in the above, but it may turn out that some clusters are stronger separated. This will automatically increase their probability of being hit so that the overall probability of hitting unhit clusters will increase significantly.

Smaller Gaps Between Clusters

Well-clusterability was considered in the previous section under the assumption of large areas between clusters where no data points of any cluster will occur. It will be shown subsequently that this assumption may be relaxed so that spurious points are allowed between the major concentrations of cluster points. But to ensure that the presence of such points will not lead the k-means procedure astray, core parts of the clusters will be distinguished and the subsequent Theorem 3 demonstrates that once a cluster core is hit by k-means initialization procedure, the cluster is preserved over subsequent k-means iterations.

It has been proven in [17, Theorem 17] that

Theorem 2

Let AB be cluster centres. Let \(\rho _{AB}\) be the radius of a ball centred at A and enclosing its cluster and it also is the radius of a ball centred at B and enclosing its cluster. If the distance between the cluster centres AB amounts to \(2\rho _{AB}+g\), \(g>0\) (g being the “gap” between clusters), if one picks any two points, X from the cluster of A and Y from the cluster of B, and recluster both clusters around X and Y, then the new clusters will preserve the balls centred at A and B of radius g/2 (called subsequently “cores”) each (X the core of A, Y the core of B).

The validity of a complementary theorem will be demonstrated here.

Theorem 3

Let AB be cluster centres. Let \(\rho _{AB}\) be the radius of a ball centred at A and enclosing its cluster and it also is the radius of a ball centred at B and enclosing its cluster. Let \(\rho _{cAB}\) be the radius of a ball centred at A and enclosing “vast majority” of its cluster and it also is the radius of a ball centred at B and enclosing “vast majority” of its cluster. If the distance between the cluster centres AB amounts to \(2\rho _{AB}+g\), \(g>0\) (\(g=2 r_{cAB}\) being the “gap” between clusters), if one picks any two points, X from the ball \(B(A,r_{cAB})\) and Y from the ball \(B(A,r_{cAB})\), and recluster both clusters around X and Y, then the new clusters will be identical to the original clusters around A and B.

Definition 2

If the gap between each pair of clusters fulfils the condition of either of the above two theorems, then we say that we have core-clustering.

Proof

For the illustration of the proof see Fig. 6.

Fig. 6
figure 6

An illustrative figure for the proof of the cluster preservation under a gap between cluster enclosing balls. The figure represents a projection of two clusters centred at A and B onto a plane containing the line AB. The inner circles centred at A and B contain the projections of cores of the respective clusters and outer circles contain the projections of whole clusters. The gap between clusters is represented by the circle with centre at O. It is proven that if the cluster centres change their positions (via the initialization procedure of k-means, for example) to any other place within the clusters A and B, then the core of A will remain in A, and the core of B will remain in B, in spite of the fact that other elements of the clusters may change cluster membership. Notation is explained in detail in the text

The proof does not differ too much from the previous one, and the previous Theorem 2 is a special case of Theorem 3.

Consider the two points AB being the two centres of double balls. The inner call represents the core of radius \(r_{cAB}=g/2\), the outer ball of radius \(\rho\) (\(\rho =\rho _{AB}\)), enclosing the whole cluster. Consider two points, XY, one being in each core ball (presumably the cluster centres at some stage of the k-means algorithm). To represent their distances faithfully, a 3D space is needed at most.

Let us consider the plane established by the line AB and parallel to the line XY. Let \(X'\) and \(Y'\) be orthogonal projections of XY onto this plane. Now let us establish that the hyperplane \(\pi\) orthogonal to XY and passing through the middle of the line segment XY, that is the hyperplane containing the boundary between clusters centred at X and Y does not cut any of the balls centred at A and B. This hyperplane will be orthogonal to the plane of Fig. 6 and so it will manifest itself as an intersecting line l that should not cross outer circles around A and B, being projections of the respective balls. Let us draw two solid lines km between circles \(O(A,\rho _{AB})\) and \(O(B,\rho _{AB})\) tangential to each of them. Line l should lie between these lines, in which case the cluster centre will not jump to the other ball.

Let the line \(X'Y'\) intersect with the circles \(O(A,r_{cAB})\) and \(O(B,r_{cAB})\) at points CDEF as in the figure.

It is obvious that the line l would get closer to circle A, if the points \(X', Y'\) would lie closer to C and E, or closer to circle B if they would be closer to D and F.

Therefore, to show that it does not cut the circle \(O(A,\rho )\) it is sufficient to consider \(X'=C\) and \(Y'=E\). (The case with ball \({\mathrm{Ball}}(B,\rho )\) is symmetrical).

Let O be the centre of the line segment AB. Let us draw through this point a line parallel to CE that cuts the circles at points \(C', D', E'\) and \(F'\). Now notice that centric symmetry through point O transforms the circles \(O(A,r_{cAB})\),\(O(B,r_{cAB})\) into one another, and point \(C'\) in \(F'\) and \(D'\) in \(E'\). Let \(E^*\) and \(F^*\) be images of points E and F under this symmetry.

In order for the line l to lie between m and k, the middle point of the line segment CE shall lie between these lines.

Let us introduce a planar coordinate system centred at O with \({\mathcal {X}}\) axis parallel to lines mk, such that A has both coordinates non-negative and B non-positive. Let us denote with \(\alpha\) the angle between the lines AB and k. Given the assumption that the distance between A and B equals \(2\rho +2r_{cAB}\), then the distance between lines k and m amounts to \(2((\rho +r_{cAB})\sin (\alpha )-\rho )\). Hence the \({\mathcal {Y}}\) coordinate of line k equals \(((\rho +r_{cAB})\sin (\alpha )-\rho )\).

So the \({\mathcal {Y}}\) coordinate of the centre of line segment CE shall be not higher than this. Let us express this in the coordinate system:

$$\begin{aligned} 4(y_{OC}+y_{OE})/2 \le ((\rho +r_{cAB})\sin (\alpha )-\rho ) \end{aligned}$$

where \(y_{OC}\) is the y-coordinate of the vector \(\overrightarrow{OC}\), etc.

Note, however, that

$$\begin{aligned} y_{OC}+y_{OE}& {} = y_{OA}+y_{AC}+y_{OB}+y_{BE}= y_{AC}+y_{BE} \\& {} = y_{AC}-y_{AE^*} =y_{AC}+y_{E^*A}. \end{aligned}$$

So let us examine the circle with centre at A. Note that the lines CD and \(E^*F^*\) are at the same distance from the line \(C'D'\). Note also that the absolute values of direction coefficients of tangentials of circle A at \(C'\) and \(D'\) are identical. The more distant these lines are, as line CD gets closer to A, the \(y_{AC}\) gets bigger, and \(y_{E^*A}\) becomes smaller. But from the properties of the circle, it is visible that \(y_{AC}\) increases at a decreasing rate, while \(y_{E^*A}\) decreases at an increasing rate. So the sum \(y_{AC}+y_{E^*A}\) has the biggest value when C is identical with \(C'\) and it has to be proven only that

$$\begin{aligned} (y_{AC'}+y_{D'A} )/2=y_{AC'} \le ((\rho +r_{cAB})\sin (\alpha )-\rho ) .\end{aligned}$$

Let M denote the middle point of the line segment \(C'D'\). As point A has the coordinates \(((\rho +r_{cAB}) \cos (\alpha ), (\rho +r_{cAB}) \sin (\alpha )),\) the point M is at distance of \((\rho +r_{cAB}) \cos (\alpha )\) from A. But \(C'M^2= r_{cAB}^2-((\rho +r_{cAB}) \cos (\alpha ))^2\).

Hence, it is necessary to show that

$$\begin{aligned} r_{cAB}^2-((\rho +r_{cAB}) \cos (\alpha ))^2 \le ((\rho +r_{cAB})\sin (\alpha )-\rho )^2 \end{aligned}$$

But this condition is equivalent to:

$$\begin{aligned}&r_{cAB}^2-((\rho +r_{cAB}) \cos (\alpha ))^2 \le ((\rho +r_{cAB})\sin (\alpha ))^2 +\rho ^2 -2(\rho +r_{cAB}) \rho \sin (\alpha ) \\&r_{cAB}^2 \le (\rho +r_{cAB})^2+\rho ^2 -2(\rho +r_{cAB}) \rho \sin (\alpha ) \\&r_{cAB}^2-\rho ^2 \le (\rho +r_{cAB})^2 -2(\rho +r_{cAB})\rho \sin (\alpha ) \\&(r_{cAB} -\rho )(r_{cAB} +\rho ) \le (\rho +r_{cAB})^2 -2(\rho +r_{cAB})(\rho )\sin (\alpha ) \\&(r_{cAB} -\rho ) \le (\rho +r_{cAB}) -2 \rho \sin (\alpha ) \\&0 \le 2\rho -2 \rho \sin (\alpha ) \\&0 \le 1- \sin (\alpha ) \end{aligned}$$

which is obviously true, as \(\sin\) never exceeds 1. \(\square\)

Core-Based Global k-Means Minimum

In the paper [17], conditions have been investigated under which one can ensure that the minimum of k-means cost function is related to a clustering with (wide) gaps between clusters.

Based on the result of the preceding “Smaller Gaps Between Clusters” section, these conditions will be weakened requiring only that the big gaps exist between cluster cores and the clusters themselves are separated by much smaller gaps, equal to the size of the core.

In particular, let us consider the set of k clusters \(\overline{{\mathcal {C}}}=\{\overline{C_1},\dots ,\overline{C_k}\}\) of cardinalities \(\overline{n_1},\dots ,\overline{n_k}\) and with radii of balls enclosing the clusters (with centres located at cluster centres) \(\overline{r_1},\dots , \overline{r_k}\). Let each of these clusters \(\overline{C_i}\) have a core \(C_i\) around the cluster \(\overline{C_i}\) centre of radius \(r_i\) and cardinality \(n_i\) such that for \({\mathfrak {p}}\in [0,1)\)

$$\begin{aligned} Q(\{C_i\})/Q(\{\overline{C_i}\})\ge 1-{\mathfrak {p}}. \end{aligned}$$

One is interested in a gap g between cluster cores \(C_1,\dots ,C_k\) such that it does not make sense to split each cluster \(\overline{C_i}\) into subclusters \(\overline{C_{i1}},\dots , \overline{C_{ik}}\) and to combine them into a set of new clusters \({\mathcal {S}}=\{S_1,\dots ,S_k\}\) such that \(S_j=\cup _{i=1}^k \overline{C_{ij}}\).

A g is to be found such that the highest possible central sum of squares combined over the clusters \(\overline{C_i}\) would be lower than the lowest conceivable combined sums of squares around respective centres of clusters \(S_j\). Let \({\mathrm{Var}}(C)\) be the variance of the cluster C (average squared distance to set C gravity centre; with one exception, however: if referring to the core of any of the clusters \(\overline{C_i}\), The cluster \(\overline{C_i}\) gravity centre is computed, not the core \(C_i\) gravity centre, so also with the Q function). Let \(C_{ij}=\overline{C_{ij}}\cap C_i\) be the core part of the subcluster \(\overline{C_{ij}}\). Let \(r_{ij}\) be the distance of the centre of core subcluster \(C_{ij}\) to the centre of cluster \(\overline{C_i}\). Let \(v_{ilj}\) be the distance of the centre of core subcluster \(C_{ij}\) to the centre of core subcluster \(C_{lj}\). So the total k-means function for the set of clusters \((C_1,\dots ,C_k)\) will amount to:

$$\begin{aligned} Q(\overline{{\mathcal {C}}}) =\frac{1}{1-{\mathfrak {p}}} Q( {\mathcal {C}} ) =\frac{1}{1-{\mathfrak {p}}} \sum _{i=1}^k \sum _{j=1}^k (n_{ij}{\mathrm{Var}}(C_{ij})+n_{ij}r_{ij}^2) \end{aligned}$$
(8)

And the total k-means function for the set of clusters \((S_1,\dots ,S_k)\) will amount to:

$$\begin{aligned} Q({\mathcal {S}})\ge \sum _{j=1}^k \left( \left( \sum _{i=1}^k n_{ij}{\mathrm{Var}}(C_{ij})\right) + \left( {\sum _{i=1}^k n_{ij}} \right) \left( \sum _{i=1}^{k-1} \sum _{l=i+1}^k \frac{n_{ij}}{\sum _{i=1}^k n_{ij}}\frac{n_{lj}}{\sum _{i=1}^k n_{ij}} v_{ilj}^2 \right) \right) . \end{aligned}$$
(9)

Should \((\overline{C_1},\dots ,\overline{C_k})\) constitute the absolute minimum of the k-means target function, then \(Q({\mathcal {S}})\ge Q(\overline{{\mathcal {C}}})\) should hold, which is fulfilled if:

$$\begin{aligned}&\sum _{j=1}^k \left( \left( \sum _{i=1}^k n_{ij}{\mathrm{Var}}(C_{ij})\right) + \left( {\sum _{i=1}^k n_{ij}} \right) \left( \sum _{i=1}^{k-1} \sum _{l=i+1}^k \frac{n_{ij}}{\sum _{i=1}^k n_{ij}}\frac{n_{lj}}{\sum _{i=1}^k n_{ij}} v_{ilj}^2 \right) \right) \\&\quad \ge \frac{1}{1-{\mathfrak {p}}}\sum _{i=1}^k \sum _{j=1}^k (n_{ij} {\mathrm{Var}}(C_{ij})+n_{ij}r_{ij}^2). \end{aligned}$$

Note that on the left-hand side of the inequality, the portion of the data outside of the cores has been ignored. This portion of the data would have made the left-hand side even bigger.

The above inequality is implied by:

$$\begin{aligned} \sum _{j=1}^k \left( \sum _{i=1}^{k-1} \sum _{l=i+1}^k \frac{n_{ij}n_{lj}}{\sum _{i=1}^k n_{ij}} v_{ilj}^2 \right) \ge \frac{1}{1-{\mathfrak {p}}}\sum _{i=1}^k \sum _{j=1}^k ({\mathfrak {p}}n_{ij}{\mathrm{Var}}(C_{ij})+n_{ij}r_{ij}^2). \end{aligned}$$
(10)

Note that \({\mathrm{Var}}(C_{ij})\le r_{ij}^2\), so

$$\begin{aligned} \frac{1}{1-{\mathfrak {p}}}\sum _{i=1}^k \sum _{j=1}^k ({\mathfrak {p}}n_{ij}{\mathrm{Var}}(C_{ij})+n_{ij}r_{ij}^2)&\le \frac{1}{1-{\mathfrak {p}}}\sum _{i=1}^k \sum _{j=1}^k (1+{\mathfrak {p}})n_{ij} n_{ij}r_{ij}^2 \nonumber \\&= \frac{1+{\mathfrak {p}}}{1-{\mathfrak {p}}}\sum _{i=1}^k \sum _{j=1}^k n_{ij} n_{ij}r_{ij}^2. \end{aligned}$$
(11)

To maximise \(\sum _{j=1}^k n_{ij}r_{ij}^2\) for a single cluster \(C_i\) of enclosing ball radius \(r_i\), note that you should set \(r_{ij}\) to \(r_i\). Let \(m_j=\arg \max _{j \in \{1,\dots ,k\}} n_{ij}\). If one sets \(r_{ij}=r_i\) for all j except \(m_j\), then the maximal \(r_{i{m_j}}\) is delimited by the relation \(\sum _{j=1; j\ne m_j}^k n_{ij}r_{ij}\ge n_{i{m_j}}r_{i{m_j}}\). So

$$\begin{aligned} \sum _{j=1}^k n_{ij}r_{ij}^2&\le \left( \sum _{j=1; j\ne m_j}^k n_{ij}\right) r_i^2\min \left( 2,\left( 1+\frac{\sum _{j=1; j\ne m_j}^k n_{ij}}{n_{i{m_j}}} \right) \right) \nonumber \\&\le 2 \left( \sum _{j=1; j\ne m_j}^k n_{ij}\right) r_i^2. \end{aligned}$$
(12)

So if one can guarantee that the gap between cluster balls (of clusters from \({\mathcal {C}}\)) amounts to g then surely

$$\begin{aligned} \sum _{j=1}^k \left( \sum _{i=1}^{k-1} \sum _{l=i+1}^k \frac{n_{ij}n_{lj}}{\sum _{i=1}^k n_{ij}} v_{ilj}^2 \right) \ge g^2 \sum _{j=1}^k \left( \sum _{i=1}^{k-1} \sum _{l=i+1}^k \frac{n_{ij}n_{lj}}{\sum _{i=1}^k n_{ij}} \right) \end{aligned}$$
(13)

because in such case \(g\le v_{ilj}\) for all ilj.

By combining inequalities (10), (12) and (13), one sees that the global minimum is granted if the following holds:

$$\begin{aligned} g^2 \sum _{j=1}^k \left( \sum _{i=1}^{k-1} \sum _{l=i+1}^k \frac{n_{ij}n_{lj}}{\sum _{i=1}^k n_{ij}} \right) \ge 2 \frac{1+{\mathfrak {p}}}{1-{\mathfrak {p}}}\sum _{i=1}^k \left( \sum _{j=1; j\ne m_j}^k n_{ij}\right) r_i^2. \end{aligned}$$
(14)

One can distinguish two cases: either (1) there exists a cluster \(S_t\) containing two subclusters \(C_{pt}\), \(C_{qt}\) such that \(t=\arg \max _j |C_{pj}|\) and \(t=\arg \max _j |C_{qj}|\) (maximum cardinality subclasses of their respective original clusters \(C_p, C_q\) or (2) not.

Consider the first case. Let \(C_p,C_q\) be the two clusters where \(C_{pt}\) and \(C_{qt}\) be two subclusters of highest cardinality within \(C_p,C_q\) resp. This implies that \(n_{pt}\ge \frac{1}{k} n_p, n_{qt}\ge \frac{1}{k} n_q\). Also this implies that for \(i\ne p, i\ne q\) \(n_{it}\le n_i/2\).

$$\begin{aligned}&\sum _{j=1}^k \sum _{i=1}^{k-1} \sum _{l=i+1}^k \frac{n_{ij}n_{lj}}{\sum _{i=1}^k n_{ij}} \ge \sum _{i=1}^{k-1} \sum _{l=i+1}^k \frac{n_{it}n_{lt}}{\sum _{i=1}^k n_{it}} \ge \frac{n_{pt}n_{qt}}{\sum _{i=1}^k n_{it}} \\&\quad \ge \frac{n_{pt}n_{qt}}{n_p/2+n_q/2+\sum _{i=1}^k n_{i}/2 } = \frac{n_{pt}n_{qt}}{n_p/2+n_q/2+n/2} \\&\quad \ge \frac{1}{k^2} \frac{n_{p}n_{q}}{n_p/2+n_q/2+n/2}. \end{aligned}$$

Note that

$$\begin{aligned} 2 \sum _{i=1}^k \left( \sum _{j=1; j\ne m_j}^k n_{ij}\right) r_i^2 \le 2 \sum _{i=1}^k n_{i} r_i^2 \end{aligned}$$

So, in order to fulfil inequality (14), it is sufficient to require that

$$\begin{aligned} g&\ge \sqrt{ \frac{2\frac{1+{\mathfrak {p}}}{1-{\mathfrak {p}}} \sum _{i=1}^k n_{i} r_i^2 }{ \frac{1}{k^2} \frac{n_{p}n_{q}}{n_p/2+n_q/2+n/2} } } \nonumber \\&= k\sqrt{n_p/2+n_q/2+n/2} \sqrt{ \frac{2 \frac{1+{\mathfrak {p}}}{1-{\mathfrak {p}}}\sum _{i=1}^k n_{i} r_i^2 }{ n_{p}n_{q} } } \nonumber \\&= k\sqrt{n_p +n_q +n } \sqrt{ \frac{ \frac{1+{\mathfrak {p}}}{1-{\mathfrak {p}}} \sum _{i=1}^k n_{i} r_i^2 }{ n_{p}n_{q} } } \end{aligned}$$
(15)

This of course maximized over all combinations of pq.

Let us proceed to the second case. Here each cluster \(S_j\) contains a subcluster of maximum cardinality of a different cluster \(C_i\). As the relation between \(S_j\) and \(C_i\) is unique, one can reindex \(S_j\) in such a way that actually \(C_j\) contains its maximum cardinality subcluster \(C_{jj}\). Let us rewrite the inequality (14).

$$\begin{aligned} g^2 \sum _{j=1}^k \left( \sum _{i=1}^{k-1} \sum _{l=i+1}^k \frac{n_{ij}n_{lj}}{\sum _{i=1}^k n_{ij}} \right) - 2 \frac{1+{\mathfrak {p}}}{1-{\mathfrak {p}}}\sum _{i=1}^k \left( \sum _{j=1; j\ne m_j}^k n_{ij}\right) r_i^2 \ge 0. \end{aligned}$$

This is met if

$$\begin{aligned} g^2 \sum _{j=1}^k \left( \sum _{i=1}^{j-1} \frac{n_{ij}n_{jj}}{\sum _{i=1}^k n_{ij}} + \sum _{l=j+1}^k \frac{n_{jj}n_{lj}}{\sum _{i=1}^k n_{ij}} \right) - 2 \frac{1+{\mathfrak {p}}}{1-{\mathfrak {p}}}\sum _{i=1}^k (n_i- n_{ii}) r_i^2 \ge 0 \end{aligned}$$

This is the same as:

$$\begin{aligned} g^2 \sum _{j=1}^k \left( \sum _{i=1,\dots , {j-1},{j+1},\dots ,k} \frac{n_{ij}n_{jj}}{\sum _{i=1}^k n_{ij}} \right) - 2 \frac{1+{\mathfrak {p}}}{1-{\mathfrak {p}}}\sum _{i=1}^k (n_i- n_{ii}) r_i^2 \ge 0. \end{aligned}$$

This is fulfilled if:

$$\begin{aligned} g^2 \sum _{j=1}^k \left( \sum _{i=1,\dots , {j-1},{j+1},\dots ,k} \frac{n_{ij}n_{j}/k}{n_j/2+\sum _{i=1}^k n_{i}/2} \right) - 2 \frac{1+{\mathfrak {p}}}{1-{\mathfrak {p}}}\sum _{i=1}^k (n_i- n_{ii}) r_i^2 \ge 0. \end{aligned}$$

Let M be the maximum over \(n_1,\dots ,n_k\). The above holds if

$$\begin{aligned} g^2 \sum _{j=1}^k \left( \sum _{i=1,\dots , {j-1},{j+1},\dots ,k} \frac{n_{ij}n_{j}/k}{M/2+n/2} \right) - 2 \frac{1+{\mathfrak {p}}}{1-{\mathfrak {p}}}\sum _{i=1}^k (n_i- n_{ii}) r_i^2 \ge 0 \end{aligned}$$

Let m be the minimum over \(n_1,\dots ,n_k\). The above holds if

$$\begin{aligned} g^2 \sum _{j=1}^k \left( \sum _{i=1,\dots , {j-1},{j+1},\dots ,k} \frac{n_{ij}m/k}{M/2+n/2} \right) - 2 \frac{1+{\mathfrak {p}}}{1-{\mathfrak {p}}}\sum _{i=1}^k (n_i- n_{ii}) r_i^2 \ge 0 \end{aligned}$$

This is the same as

$$\begin{aligned}&g^2 \frac{m/k}{M/2+n/2} \left( \sum _{j=1}^k \sum _{i=1,\dots , {j-1},{j+1},\dots ,k} {n_{ij} } \right) - 2 \frac{1+{\mathfrak {p}}}{1-{\mathfrak {p}}}\sum _{i=1}^k (n_i- n_{ii}) r_i^2 \ge 0 \\&g^2 \frac{m/k}{M/2+n/2} \left( \sum _{j=1}^k \left( \left( \sum _{i=1}^k {n_{ij} } \right) - n_{jj} \right) - 2 \frac{1+{\mathfrak {p}}}{1-{\mathfrak {p}}}\sum _{i=1}^k (n_i- n_{ii}) r_i^2 \right) \ge 0 \\&g^2 \frac{m/k}{M/2+n/2} \left( \left( \sum _{j=1}^k\sum _{i=1}^k {n_{ij} } \right) - (\sum _{j=1}^kn_{jj} ) \right) - 2 \frac{1+{\mathfrak {p}}}{1-{\mathfrak {p}}}\left( \sum _{i=1}^k (n_i- n_{ii}) r_i^2 \right) \ge 0 \\&g^2 \frac{m/k}{M/2+n/2} \left( \left( \sum _{i=1}^k {n_{i} } \right) - \left( \sum _{j=1}^kn_{jj} \right) \right) - 2 \frac{1+{\mathfrak {p}}}{1-{\mathfrak {p}}}\sum _{i=1}^k (n_i- n_{ii}) r_i^2 \ge 0 \\&g^2 \frac{m/k}{M/2+n/2} \left( \sum _{i=1}^k \left( {n_{i} -n_{ii}} \right) \right) - 2 \frac{1+{\mathfrak {p}}}{1-{\mathfrak {p}}}\sum _{i=1}^k (n_i- n_{ii}) r_i^2 \ge 0 \\&\sum _{i=1}^k \left( {n_{i} -n_{ii}} \right) \left( g^2 \frac{m/k}{M/2+n/2} - 2 \frac{1+{\mathfrak {p}}}{1-{\mathfrak {p}}} r_i^2 \right) \ge 0 \end{aligned}$$

The above will hold, if for every \(i=1,\dots ,k\)

$$\begin{aligned} g& {} \ge r_i \sqrt{\frac{1+{\mathfrak {p}}}{1-{\mathfrak {p}}}\frac{2}{ \frac{m/k}{M/2+n/2} }} \nonumber \\ g& {} \ge r_i \sqrt{k\frac{1+{\mathfrak {p}}}{1-{\mathfrak {p}}}\frac{M+n}{ m} }. \end{aligned}$$
(16)

So the inequality (14) is fulfilled, if both inequality (15) and inequality (16) are held by an appropriately chosen g.

In summary, it has been shown that

Theorem 4

Let \(\overline{{\mathcal {C}}}=\{\overline{C_1},\dots ,\overline{C_k}\}\) be a partition of a data set into k clusters of cardinalities \(\overline{n_1},\dots ,\overline{n_k}\) and with radii of balls enclosing the clusters (with centres located at cluster centres) \(\overline{r_1},\dots , \overline{r_k}\) . Let each of these clusters \(\overline{C_i}\) have a core \(C_i\) of radius \(r_i\) and cardinality \(n_i\) around the cluster centre such that for \(p\in [0,1)\)

$$\begin{aligned} Q(\{C_i\})/Q(\{\overline{C_i}\})\ge 1-{\mathfrak {p}} \end{aligned}$$

Then if the gap g between cluster cores \(C_1,\dots ,C_k\) fulfils conditions expressed in formulas (15) and (16) then the partition \(\overline{{\mathcal {C}}}\) coincides with the global minimum of the k-means cost function for the data set.

Core-Based Approach to Clusterability

After the preceding preparatory work, a theorem analogous to Theorem 1 will be proven, but now allowing for smaller gaps between clusters.

Theorem 5

(i) If the data set is well-clusterable with a gap defined by formulas (16) and (15), with \(r_i\) replaced by their maxima, then with high probability k-means++ (after an appropriately chosen number of repetitions) will discover the respective clustering. (ii) If k-means++ (after an appropriately chosen number of repetitions) does not discover a clustering matching formulas (16) and (15) (with \(r_i\) replaced by their maxima), then with high probability the data set is not well clusterable with a gap defined by formulas (16) and (15).

The rest of the current section is devoted to the proof of the claims of this theorem.

If one obtained the split, then one is able to compute for each cluster the cluster centre, the radius of the ball containing all the data points of the cluster but the most distant ones, constituting at most \({\mathfrak {p}}\) of the quality function for the cluster, and finally one can check if the gaps between the cluster cores meet the requirement of formulas (16) and (15). So one is able to decide that one has found that the data set is well clusterable.

So let us look at the claim (i). As demonstrated in preceding “Core-Based Global k-Means Minimum” section, the global minimum of k-means coincides with the separation by abovementioned gaps. Hence if there exists a positive probability, that k-means++ discovers the appropriate split, then by repeating independent runs of k-means++ and picking the split minimizing k-means cost function one will increase the probability of finding the global minimum. It will be shown that one can know the number of repetitions needed in advance, if an upper bound of the quotient M/m is assumed.

Let us assume it is granted that

$$\begin{aligned} g \ge r \sqrt{k\frac{1+{\mathfrak {p}}}{1-{\mathfrak {p}}}\frac{M+n}{ m} } \end{aligned}$$
(17)

for any \(i=1,\dots ,k\)

$$\begin{aligned} g\ge k r \sqrt{\frac{1+{\mathfrak {p}}}{1-{\mathfrak {p}}} \frac{ n (n_p +n_q +n )}{ n_{p}n_{q} } } \end{aligned}$$
(18)

for any \(p,q=1,\dots ,k;\,p\ne q\), when \(n_i, i=1,\dots ,k\) is the cardinality of the cluster i, \(M=\max _i n_i\), \(m=\min _i n_i\). For an illustration of this dependence see Fig. 7.

Fig. 7
figure 7

Dependence of g/r for \(k=5\) on the value of \({\mathfrak {p}}\)

So let us turn to k-means++ seeding. If already i distinct cluster cores were seeded, then the probability that a new cluster core will be seeded (under our assumptions, when \(k\ge 4\) as previously) amounts to at least

$$\begin{aligned}&\frac{(k-i) m (1-{\mathfrak {p}}) g^2}{(k-i) m g^2+i M \frac{1}{1-{\mathfrak {p}}}4r^2} \\&\quad \ge \frac{(k-i) m k^2 (1-{\mathfrak {p}}) \frac{1+{\mathfrak {p}}}{1-{\mathfrak {p}}} n (1/m +1/m +n/m^2 ) }{(k-i) m k^2 \frac{1+{\mathfrak {p}}}{1-{\mathfrak {p}}} n (1/m +1/m +n/m^2 ) +i M \frac{1}{1-{\mathfrak {p}}} } \\&\quad = \frac{(k-i) k^2 (1-{\mathfrak {p}}) (1+{\mathfrak {p}}) n (2 +n/m ) }{(k-i) k^2 (1+{\mathfrak {p}}) n (2 +n/m ) +i M } \\&\quad \ge \frac{ k^2 (1-{\mathfrak {p}}) (1+{\mathfrak {p}}) n (2 +n/m ) }{ k^2 (1+{\mathfrak {p}}) n (2 +n/m ) +(k-1) M } \end{aligned}$$

So again the probability of successful seeding will amount to at least:

$$\begin{aligned}&\left( \frac{ k^2 (1-{\mathfrak {p}}) (1+{\mathfrak {p}}) n (2 +n/m ) }{ k^2 (1+{\mathfrak {p}}) n (2 +n/m ) +(k-1) M } \right) ^{k-1} \\&\quad =(1-{\mathfrak {p}})^{k-1}\left( 1- \frac{ (k-1) M }{ k^2 (1+{\mathfrak {p}}) n (2 +n/m ) +(k-1) M } \right) ^{k-1} \\&\quad =(1-{\mathfrak {p}})^{k-1}\left( 1- \frac{ (k-1)^2 M }{ k^2 (1+{\mathfrak {p}}) n (2 +n/m ) +(k-1) M } \frac{1}{k-1} \right) ^{k-1} \\&\quad \approx (1-{\mathfrak {p}})^{k-1} exp \left( - \frac{ (k-1)^2 M }{ k^2 (1+{\mathfrak {p}}) n (2 +n/m ) +(k-1) M } \right) . \end{aligned}$$

For an illustration of this dependence see Fig. 8.

Fig. 8
figure 8

Dependence of the accurate seeding on share \({\mathfrak {p}}\) of variance outside the core for \(k=5\) and \(M/m=4\)

Apparently in the limit the above expression lies at about \((1-{\mathfrak {p}})^{k-1}\).

So to achieve the identification of the clustering with probability of at least \({\mathrm{Pr}}_{\mathrm{succ}}\) (e.g. \(95\%\)), one will need R runs of k-means++ where

$$\begin{aligned} R=\frac{\log (1-{\mathrm{Pr}}_{\mathrm{succ}})}{\log (1-(1-{\mathfrak {p}})^{k-1})}. \end{aligned}$$

Note that

$$\begin{aligned} 1-(1-{\mathfrak {p}})^{k-1}\approx 1-e^{-{\mathfrak {p}}(k-1)}\approx 1- e^{-{\mathfrak {p}}k}. \end{aligned}$$

The effect of doubling k is

$$\begin{aligned} \frac{1- e^{-{\mathfrak {p}}2k}}{1- e^{-{\mathfrak {p}}k}} = \frac{(1- e^{-{\mathfrak {p}} k})\,(1+ e^{-{\mathfrak {p}}2k})}{1- e^{-{\mathfrak {p}}k}} =1+ e^{-{\mathfrak {p}}2k} \end{aligned}$$

that is it is sublinear in the expression \(1-(1-{\mathfrak {p}})^{k-1}\), hence R grows slower than reciprocally logarithmically in k and p. For an illustration of this relation, see Fig. 9.

Fig. 9
figure 9

Repetitions needed to get accurate seeding on share \({\mathfrak {p}}\) of variance outside the core for \(k=5\) and \(M/m=4\)

So far we have concentrated on showing that if the data is well clusterable, then within practically reasonable number of k-means++ runs the seeding will have the property that each cluster obtains a single seed. But what about the rest of the run of k-means? As shown in “Smaller Gaps Between Clusters” section, the cluster centres will never switch to balls encompassing other clusters, so that eventually the true cluster structure is detected and minimum of Q is reached. This would complete the proof of claim (i). The demonstration of claim (ii) is straight forward. If the data were well clusterable then k-means++ would have failed to identify it with probability of at most \(1-{\mathrm{Pr}}_{\mathrm{succ}}\). As the well-clusterable data are in practice extremely rare, the failure of the algorithm to identify a well-clusterable structure induces with probability of at least \({\mathrm{Pr}}_{\mathrm{succ}}\) that no such structure exists in the data. A detailed proof follows the reasoning of the last part of the proof of Theorem 1.

Experimental Results

Goals of Experiments

In order to illustrate the issues raised in this paper, three types of experiments were performed. The first experiment, performed on synthetic data, is devoted to the mismatch of gaps between subsets of data and the clusterings obtained by common clustering algorithms. The results are shown in Table 1. The second, performed on synthetic data (Table 2), and third, performed on real data (Table 3), are devoted to demonstration that k-means++ is able to discover well-clusterable data. In particular, it is shown that: (1) If a dataset is well clusterable as defined in Theorems 1 or  5 (based on Definition 1) then k-means++ is able to identify the best clustering (both for real-world datasets and synthetic ones), (2) If k-means++ cannot find a clustering satisfying well-clusterability, there is no good clustering structure, fitting those definitions, hidden in data (with high probability) for all k-means style algorithms.

Table 2 Reconstruction of the clusters in synthetic data. Notation in text
Table 3 Reconstruction of the clusters in real data from R library datasets

Problem of Gap Insufficiency

In “Failure of Gaps as a Criterion for Clusterability” section, we drew attention to the fact that fixing gap is an insufficient criterion to define well-clusterability for k-means like (or centre-based) algorithms. The unbalanced sample can lead to change in clustering optimum position. To confirm it, an experiment reported in Table 1 on synthetic data for a number of algorithms from k-means family and some other was performed. The k-means implementation kmeans from the R package stats in versions “Hartigan–Wong”, “Lloyd”, “Forgy”, “MacQueen” was used. The algorithms were run with nstart parameters equal 1, 10 and 20. Additionally, experiments with the function cmeans (implementing Fuzzy c-means algorithm) from the package e1071 in versions “cmeans” and “ufcl” were performed. For comparison, the single link algorithm as implemented in the hclust function of the package stats was run. We implemented our own version of initialization of the k-means++ algorithm and used it in combination with the standard kmeans of R. Two variants of k-means++ were used: one with a single start and one with two starts.

We fix the gap parameter to be equal \(g/r_{\mathrm{max}}=2\), as, e.g. proposed in so-called perfect clusterability criterion.

With increase of the spread between cluster sizes, all but single link algorithm have increasing difficulty in detecting the “perfect” clusters. The additionally displayed indicator relQ (relative quality, the quotient of the achieved k-means cost function to the smallest one) explains the reason: k-means optimizes k-means clustering function and single link does not.

k-Means++ Discovering Well-Clusterable Data for Synthetic Data

In Tables 23 and 4, illustrating the second and third experiment sets, the following abbreviations in column titles were used: Errors k-means means the number of errors (not recognizing the correct clustering) per 100 runs by the k-means algorithm with random initialization with 1 start. Errors k-means++ means the same, respectively. WC disc. means the number of times per 100 runs that the k-means++ discovered the well-clusterable structure in the data, whereas WC not disc. means the number of times k-means++ did not discover such a structure. Both are split into two categories: cc—correct clustering in the sense of the assumed cluster structure, wc—wrong clustering in the same sense.

Table 4 Reconstruction of the clusters in real data from the kddcup

The set-up of the second experiment is as follows: a generator with parameters k (number of clusters), M, m (minimum and maximum cluster size), d (dimensionality), \({\mathfrak {p}}\) (share of variance outside of core) and gp (the minimum gap size compared to the one required to testify that well-clusterable structure was found). Some parameters are fixed for this experiment: \(M= 75, m= 45, d= 10\), and only \(k, {\mathfrak {p}}\) and gp are varied. In each run, a new sample is generated.

You see from Table 2 that k-means++ almost always discovers the well-clusterable structure in the data when it exists (it failed one time in 1800 experiments) and when it failed, the discovered structure was non-well-clusterable one. In 1800 experiments, when the well-clusterable structure did not exist, it did not discover a well-clusterable structure. It is worth noting that given the gap between the clusters is half as big as the well-clusterability gap, k-means++ discovered nearly always (except for 4 cases) the intrinsic structure in the data. Interestingly, k-means discovers the intrinsic structure in the data when \(k=2\). But it fails when the number of clusters is bigger and the increase of the gaps between clusters does not improve its performance.

k-Means++ Discovering Well-Clusterable Data for Real Data

The set-up for the third experiment is as follows. Datasets from the R library with at least 100 records were selected (“DNase”, “iris”, “Theoph”, “attenu”, “faithful”, “infert”, “quakes”, “randu” ). Non-numeric columns were removed, the numerical columns were normalized (mean \(=\) 0, SD \(=\) 1) and a clustering via k-means++ into k (= 2,3,5) clusters with 200 starts was performed and this was considered as the “ground truth” clustering. Then 100 times clustering experiments were performed computing the statistics as in the second experiment for the original data set and for data set modified so that the gaps between clusters were equal g/2, g, 2g where g is the gap required by the well-clusterability condition.

In an analogous way, kddcup data downloaded from the website https://sci2s.ugr.es/keel/category.php?cat=clas&order=ins#sub2 was processed. Not the entire set but rather randomly selected some 10,000 records from it were used. This extra data set was used to show the performance of the clustering for a larger number of clusters (2, 4, 8, 16, 32, 64 and 128). As visible, the algorithm tackles the increasing number of clusters correctly.

One sees that both k-means++ and k-means do not perform well for the original data set. But if the gaps between clusters amount to at least half the well-clusterability gap, k-means++ discovers the intrinsic clustering. k-means does so only when \(k=2\). k-means++ discovers always the well-clusterable structure when it exists and does not if does not exist. If \(k>2\), k-means is not helped by increasing the gap between clusters.

Summarizing, it has been demonstrated empirically that (1) The gap size alone does not guarantee the discovery of clusters seen by human inspection for the class of k-means-like algorithms. (2) If a dataset is well-clusterable as defined in Theorems 1 or 5 (based on Definition 1) then k-means++ is able to identify the best clustering (both for real-world datasets and synthetic ones). (3) If k-means++ cannot find a clustering satisfying well-clusterability, there is no good clustering structure, fitting those definitions, hidden in data (with high probability) for all k-means style algorithms.

Impact of k, Data Dimensionality and Gap on Clustering Performance

A number of further experiments have been performed on synthetic data in order to show the behaviour of several algorithms from k-means family on data that were generated with k-means++ clusterability in mind.

The impact of the parameters k (the number of clusters), d (the number of dimensions), the cluster size and the quotient of the maximum to the minimum cluster size were investigated. Additionally, the impact of the reduction of the gap between clusters on the performance of the clustering was studied (breakgap parameter indicates by what number the calculated clusterability gap was divided).

The impact was measured as follows: 10 different cluster sets were generated randomly for each value of the parameter (e.g. for \(k=\)2, 4, 8, and 16 clusters), separated by appropriate gap. Then for 1000 times each of the algorithms was applied. Each outcome was classified as either correct or incorrect. The outcome was correct if exactly the predefined set of clusters was discovered by the run of the algorithm. Otherwise, an error was reported. Each cell in the tables contains four numbers: the average number of errors over the 10 cluster sets (number of errors out of 1000 runs), plus the standard deviation, the time in seconds taken by 1000 runs of the algorithm and the average obtained relative quality. The relative quality was measured as a quotient of the Q cost function value divided by the minimal one obtained by any algorithm for a given cluster set.

Both the case of gap separation of clusters (see Tables 5, 6, 7, 8, 9) and the gap separation of cores (see Tables 10, 11, 12, 13, 14, 15) were investigated. In the case of cores, correctness of proper classification of core elements was considered only.

Table 5 Dependence of the number of errors on k
Table 6 Dependence of the number of errors on d
Table 7 Dependence of the number of errors on M to m proportion
Table 8 Dependence of the number of errors on cluster size, M and m kept proportional
Table 9 Dependence of the number of errors on breakgap
Table 10 Dependence of the number of errors on k
Table 11 Dependence of the number of errors on d
Table 12 Dependence of the number of errors on M to m proportion
Table 13 Dependence of the number of errors on cluster size, M and m kept proportional
Table 14 Dependence of the number of errors on breakgap
Table 15 Dependence of the number of errors on \({\mathfrak {p}}\)

Two basic insights can be gained from all these tables. The well-clusterability concept defined in this paper is well suited for the single link algorithm. Single link is designed to identify data where there are large gaps between clusters. It is also significantly quicker than k-means++. k-means++ is the second best performing algorithm, but at the same time the slowest one. The poor comparison of speed results from the fact that only small data sets were used. Single link speed grows however quadratically in the size of the data set, while k-means speed depends quadratically on k and linearly on the size of the data set. Besides, the memory consumption of single link is quadratic in the size of the data, while that of k-means++ is linear. Hence single link is a competitor for toy examples only.

Table 5 demonstrates that two-cluster task is easy for all the algorithms except for ufcl variant of Fuzzy-c-means. However, with an increase in the number of clusters, the capability of detecting the optimal clustering deteriorates strongly except for the two mentioned algorithms. The cmeans variant of Fuzzy c-means is the least affected among algorithms 1:14. Understandably, the variants of k-means with more restarts perform better than those with smaller number of restarts. Same conclusions can be drawn from the core-based clusterability experiment summarized in Table 10, but of course the off-core elements contribute to worsening of the performance.

The impact of dimensionality is shown in Tables 6 and 11. The increase of dimensionality slightly deteriorates the performance (compared to the impact of the number of clusters), and in case of cmeans it seems to improve it.

Tables 8 and 13 present the influence of the cluster sizes on the performance. The increase of cluster size negatively influences the performance except for cmeans which seems to take advantage of the increased sample size.

The spread of sample sizes, as illustrated by Tables 7 and 12 apparently does not affect the performance. (Though again cmeans takes advantage of it in the case represented by Table 7.) This would mean that the spread is quite well captured by our formula on g.

The effect of decreasing the gap size below the value determined by our formulas, as shown in Tables 9 and 14, surprisingly worsens the performance of cmeans and k-means++ only. Traditional k-means algorithms with 20 restarts outperform k-means++ apparently. This would mean that k-means++ loses its advantage of high probability of hitting distant points in case of small gaps because these distant points may be points within the same cluster rather than points of distinct clusters.

As visible from Table 15, the share of off-core variance seems not to affect the performance of the algorithm, except for deteriorating cmeans performance.

Discussion

Ben-David [10] investigated several clusterability notions under the following criteria:

  • practical relevance—most real-world data sets should be clusterable;

  • existence of quick clustering algorithms

  • existence of quick clusterability testing algorithms

  • some popular algorithms should perform well on clusterable data.

As the practical relevance is concerned, Ben-David found that the clusterability criteria he investigated are usually not met because of the impractically large gap between cluster centres is required. Our proposed clusterability criteria are as impractical as the ones he investigated—large gaps are required and their size grows with the spread of cluster sizes, the number of clusters, the cluster radius. This is due among others to the fact that k-means does not explore directly the gaps between clusters. With this respect, our proposal does not constitute a progress.

As the algorithm complexity is concerned, k-means++ has no computational disadvantage over the algorithms Ben-David considered. The gain one has is that usually one iteration of k-means is needed to establish the final result (the whole complexity resides in the initialization stage).

With respect to efficient testability of the clusterability conditions Ben-David complains about the problem that checking clusterability condition requires knowledge of the optimal clustering that is NP-hard to find. With our proposal, we make here a major progress. Given an output clustering, if one checks that the distances between clusters match the requirements imposed on the gap g, one is assured that one has found the optimal clustering. On the other hand, if the data is clusterable in our sense, then it is highly unlikely that the standard k-means++ algorithm does not find it.

With respect to the last requirement, one can say that k-means++ is actually a popular algorithm and it behaves reasonably.

Clusterable Data Sets for Varying Values of k

The described clusterability criteria are based on the knowledge of the k value to check for. The following questions can be raised in this context:

  • Does there exist some another k for which the clusterability criterion holds?

  • Can one “guess” beforehand what k shall one check in order to check for potential clusterability?

Let us restrain to non-core cases only here.

The first question has one trivial and some non-trivial answers. If k is equal to sample size, then the clusterability condition is trivially fulfilled.

But given a data set is clusterable for k, what can one say about clusterability for \(k+k'\) (\(k'\ge 1\))? Assume that the optimal clustering for \(k+k'\), \({\mathcal {C}}^{(k+k')}\), engages such clusters \(C_j^{(k+k')},\,j=1,\dots ,\,k+k'\) that each contains objects from different clusters from the optimal clustering for k,\({\mathcal {C}}^{(k)}\). Then one can have the case that cluster centres of clusters \(C_j^{(k+k')}\) are for each element in the cluster \(C_i{\mathcal {C}}^{(k)}\) of k clustering at least \(r^{(k)}_i=r_i\) away from the border of the ball enclosing cluster \(C_i{\mathcal {C}}^{(k)}\). This would be impossible because in such a case clustering \({\mathcal {C}}^{(k)}\) would have a cost function lower than the clustering \({\mathcal {C}}^{(k+k')}\). Therefore, consider a cluster \(C_j^{(k+k')}\) with centre at the distance at least \(r_i\) and at most \(2r_i\) from the cluster \(C_i^{(k)}\). It has to have a radius of at least \(g-r_i>r_i\), where g is the gap for k clusterability of the data set. Each element of the cluster \(C_i^{(k)}\) is then at most \(3r_i\) away from the centre of \(C_j^{(k+k')}\). The required gap \(g'\) for \(k+k'\) clusterability would amount to more than \(2(g-r_i)>2r_i\). Which means that all elements of the cluster \(C_i^{(k )}\) must belong to \(C_j^{(k+k')}\). Under these circumstances only \(k-1\) clusters of the clustering \({\mathcal {C}}^{(k+k')}\)are non-empty which is a contradiction. If one of the clusters of \({\mathcal {C}}^{(k+k')}\) is identical with one from \({\mathcal {C}}^{(k)}\), then one can drop it and reason as above.

Hence, if a data set is k and \(k+k'\) clusterable, then each of the clusters in \({\mathcal {C}}^{(k+k')}\) must be contained in one of the clusters from \({\mathcal {C}}^{(k)}\).

Furthermore, the cluster with the largest radius needs to be split for sure.

Now consider a \(k+k'\)-means++ clustering where the data set is k-clusterable. As each of the clusters \({\mathcal {C}}^{(k+k')}\) is a subset of one of the clusters from \({\mathcal {C}}^{(k)}\), the smallest distance between clusters in \({\mathcal {C}}^{(k+k')}\) would be for sure smaller than the required gap g for \({\mathcal {C}}^{(k)}\) clusters. Hence seeking for \(k<K\) clusterability could be run as follows. Perform K-means++. Check for clusterability. remove one of the centres of two closest clusters (\(K:=K-1\)). Recluster with seeds being the remaining centres. And repeat the procedure till clusterability is found or \(K=1\).

Clusterability Criteria for Other Clustering Algorithms and Cluster Models

Algorithms from the k-means family are known to work well for Gaussian-like data, where clusters are generated based on compactness. However, for data distributed differently (e.g., concentric rings), k-means usually fails. Therefore, in such situations, clustering algorithms, such as spectral clustering or kernel function based clustering are claimed to better handle such data by leveraging the connectivity information.

This picture needs to be modified slightly. Notably, the k-means underpins also several spectral clustering algorithms like ones based on the combinatorial Laplacians, normalised Laplacians and random-walk Laplacians. There exist also kernel-based variants of k-means algorithms. Therefore, the results about cluster separation presented in this research will apply in those cases also. This means that the questions asked about easiness of clustering for spectral methods may be studied in a similar way as discussed in this paper, but not in the original data space but rather in the spectral space (spanned by eigenvectors assigned to lowest eigenvalues). Of course, an investigation would be necessary into the issue how the well-separatedness in the spectral space would transpose to properties of the clusters in the original space. The very same applies to kernel-k-means, where the clustering in the feature space follows the paradigm of classical k-means, so that the results of this paper affect the feature space. Again, a study of the various kernel functions that are typically applied, would be necessary in order to determine the relationship between well-separatedness in the feature space and the cluster shapes and/or separation in the original space.

Note also, that, as shown in this paper (see “Failure of Gaps as a Criterion for Clusterability” section), the claim that k-means works well for Gaussian distributions needs also to be verified in case of unbalanced clusters. Visually separated clusters can be clumped together and a visually uniform cluster may be split into parts already by the k-means cost function. Furthermore, with high probability, the k-means algorithm with random initialization will not find the intrinsic clustering even if the clusters are well separated and the cost function points at the clustering into well-separated clusters (see “Experimental Results” section).

Conclusions

We have defined the notion of a well-clusterable data set from the point of view of the objective of k-means clustering algorithm and common sense in two variants—without any data points in the large gaps between clusters and with data points there. The novelty introduced here, compared to other work on this topic, is that one can a posteriori (after running k-means) check if the data set is well-clusterable or not.

Let us make a comparison to the results of other investigators in the realm of well-clusterability, in particular presented in [1, 2, 7,8,9, 15, 20].

If the data is well clusterable according to criteria of Perturbation Robustness, or \(\epsilon\)-Separatedness, or \((c, \epsilon )\)-Approximation-Stability or \(\alpha\)-Centre Stability or \((1+\alpha )\) Weak Deletion Stability or Perfect Separation, one can reconstruct the well-clustered structure using appropriate algorithm. But only in case of Perfect Separation or Nice Separation, you can decide that you have found the structure, if you have found it. Note that you have no warranty that you will find Nice Separation if it is there. But for none of these ways of understanding well-clusterability one is able to decide (neither a priori nor a posteriori) that the data is not well-clusterable if the well-clustered structure was not found (unless by brute force).

The only exception constitutes formally the method of Multi-modality Detection, as proposed in [1], which tells us a priori that the data is or is not well-clusterable. However, as we have demonstrated, by the example in Fig. 1 in “Failure of Multi-modality Detection for Detection of Gaps Between Data” section, data can be easily found that can foolish this method, so that it discovers well-clusterability in case when there is none.

Under the definitions of well-clusterability presented in this paper, one gets a completely new situation. It is guaranteed that if the well-clusterable structure is there, it will be detected with high probability. One can check a posteriori that the structure found is the well-clusterable structure if it is so, with 100% certainty. Furthermore, if the (k-means++) algorithm did not find a well-clusterable structure then with high probability it is not there in the data.

The paper contains a couple of other, minor contributions. The concept of cluster cores has been introduced such that if a seed of k-means once hits each core then there is guarantee that none of the cluster centres will ever leave the cluster. It has been shown that the number of reruns of k-means++ is small when a desired probability of success in finding the well-clusterability is being targeted. Numerical examples show that several orders of magnitude smaller gaps between clusters, compared to [20], are required in order to state that the data is well clusterable, and still the probability of detecting the well-clusterable structure is much higher (even close to one in a single k-means++ run).

The procedure elaborated for constructing a well-clusterable data set, ensuring that the k-means cost function absolute minimum is reached for a predefined data partition may find applications in some testing procedures of clustering algorithms.

Of course a number of research questions with respect to the topic of this paper remain open. First is the issue of constructing tight (or at least tighter) bounds for estimation of required gaps between clusters. Second is an investigation how the violations of these minimum values influence the capability of k-means algorithms to detect either the absolute minimum of their cost function or achieving a partition that is intuitively considered by humans as “good clustering”.