Even Faster Exact k-Means Clustering

. A na¨ıve implementation of k -means clustering requires computing for each of the n data points the distance to each of the k cluster centers, which can result in fairly slow execution. However, by storing distance information obtained by earlier computations as well as information about distances between cluster centers, the triangle inequality can be exploited in diﬀerent ways to reduce the number of needed distance computations, e.g. [3–5,7,11]. In this paper I present an improvement of the Exponion method [11] that generally accelerates the computations. Furthermore, by evaluating several methods on a fairly wide range of artiﬁcial data sets, I derive a kind of map, for which data set parameters which method (often) yields the lowest execution times.


Introduction
The k-means algorithm [9] is, without doubt, the best known and (among) the most popular clustering algorithm(s), mainly because of its simplicity.However, a naïve implementation of the k-means algorithm requires O(nk) distance computations in each update step, where n is the number of data points and k is the number of clusters.This can be a severe obstacle if clustering is to be carried out on truly large data sets with hundreds of thousands or even millions of data points and hundreds to thousands of clusters, especially in high dimensions.
Hence, in our "big data" age, considerable effort was spent on trying to accelerate the computations, mainly by reducing the number of needed distance computations.This led to several very clever approaches, including [3][4][5]7,11].These methods exploit that for assigning data points to cluster centers knowing actual distances is not essential (in contrast to e.g.fuzzy c-means clustering [2]).All one really needs to know is which center is closest.This, however, can sometimes be determined without actually computing (all) distances.
A core idea is to maintain, for each data point, bounds on its distance to different centers, especially to the closest center.These bounds are updated by exploiting the triangle inequality, and can enable us to ascertain that the center that was closest before the most recent update step is still closest.Furthermore, by maintaining additional information, tightening these bounds can sometimes be done by looking at only a subset of the cluster centers.
In this paper I present an improvement of one of the most sophisticated of such schemes: the Exponion method [11].In addition, by comparing my new approach to other methods on several (artificial) data sets with a wide range of number of dimensions and number of clusters, I derive a kind of map, for which data set parameters which method (often) yields the lowest execution times.

k-Means Clustering
The k-means algorithm is a very simple, yet effective clustering scheme that finds a user-specified number k of clusters in a given data set.This data set is commonly required to consist of points in a metric space.The algorithm starts by choosing an initial set of k cluster centers, which may naïvely be obtained by sampling uniformly at random from the given data points.In the subsequent cluster center optimization phase, two steps are executed alternatingly: (1) each data point is assigned to the cluster center that is closest to it (that is, closer than any other cluster center) and (2) the cluster centers are recomputed as the vector means of the data points assigned to them (to enable these mean computations, the data points are supposed to live in a metric space).
Using ν m (x) to denote the cluster center m-th closest to a point x in the data space, this update scheme can be written (for n data points x 1 , . . ., x n ) as ∀i; , where the indices t and t + 1 indicate the update step and the function 1(φ) yields 1 if φ is true and 0 otherwise.Here ν t 1 (x j ) represents the assignment step and the fraction computes the mean of the data points assigned to center c i .
It can be shown that this update scheme must converge, that is, must reach a state in which another execution of the update step does not change the cluster centers anymore [14].However, there is no guarantee that the obtained result is optimal in the sense that it yields the smallest sum of squared distances between the data points and the cluster centers they are assigned to.Rather, it is very likely that the optimization gets stuck in a local optimum.It has even been shown that k-means clustering is NP-hard for 2-dimensional data [10].
Furthermore, the quality of the obtained result can depend heavily on the choice of the initial centers.A poor choice can lead to inferior results due to a local optimum.However, improvements of naïvely sampling uniformly at random from the data points are easily found, for example the Maximin method [8] and the k-means++ procedure [1], which has become the de facto standard.

Bounds-Based Exact k-Means Clustering
Some approaches to accelerate the k-means algorithm rely on approximations, which may lead to different results, e.g.[6,12,13].Here, however, I focus on methods to accelerate exact k-means clustering, that is, methods that, starting from the same initialization, produce the same result as a naïve implementation.The core idea of these methods is to compute for each update step the distance each center moved, that is, the distance between the new and the old location of the center.Applying the triangle inequality one can then derive how close or how far away an updated center can be from a data point in the worst possible case.For this we distinguish between the center closest (before the update) to a data point x j on the one hand and all other centers on the other.
k Distance Bounds.The first approach along these lines was developed in [5] and maintains one distance bound for each of the k cluster centers.
For the center closest to a data point x j an upper bound u t j on its distance is updated as shown in Fig. 1(a): If we know before the update that the distance between x j and its closest center c t j1 = ν t 1 (x j ) is (at most) u t j , and the update moved the center c t j1 to the new location c t * j1 , then the distance d(x j , c t * j1 ) between the data point and the new location of this center1 cannot be greater than ).This bound is actually reached if before the update the bound was tight and the center c t j1 moves away from the data point x j on the straight line through x j and c t j1 (that is, if the triangle is "flat").For all other centers, that is, centers that are not closest to the point x j , lower bounds ji , i = 2, . . ., k, are updated as shown in Fig. 1(b): If we know before the update that the distance between x j and a center c t ji = ν t i (x j ), is (at least) t ji , and the update moved the center c t ji to the new location c t * ji , then the distance d(x j , c t * ji ) between the data point and the new location of this center cannot be less than t+1 ji . This bound is actually reached if before the update the bound was tight and the center c t ji moves towards the data point x j on the straight line through x j and c t ji ("flat" triangle).
These bounds are easily exploited to avoid distance computations for a data point x j : If we find that u t+1 j ji , that is, if the upper bound on the distance to the center that was closest before the update (in step t) is less than the smallest lower bound on the distances to any other center, the center that was closest before the update must still be closest after the update (that is, in step t + 1).Intuitively: even if the worst possible case happens, namely if the formerly closest center moves straight away from the data point and the other centers move straight towards it, no other center can have been brought closer than the one that was already closest before the update.
And even if this test fails, one first computes the actual distance between the data point x j and c t * j1 .That is, one tightens the bound u t+1 j to the actual distance and then reevaluates the test.If it succeeds now, the center that was closest before the update must still be closest.Only if the test fails also with the tightened bound, the distances between the data point and the remaining cluster centers have to be computed in order to find the closest center and to reinitialize the bounds (all of which are tight after such a computation).
This scheme leads to considerable acceleration, because the cost of computing the distances between the new and the old locations of the cluster centers as well as the cost of updating the bounds is usually outweighed by the distance computations that are saved in those cases in which the test succeeds.
2 Distance Bounds.A disadvantage of the scheme just described is that k bound updates are needed for each data point.In order to reduce this cost, in [7] only two bounds are kept per data point: u t j and t j , that is, all non-closest centers are captured by a single lower bound.This bound is updated according to . Even though this leads to worse lower bounds for the non-closest centers (since they are all treated as if they moved by the maximum of the distances any one of them moved), the fact that only two bounds have to be updated leads to faster execution, at least in many cases.
YinYang Algorithm.Instead of having either one distance bound for each center (k bounds) or capturing all non-closest centers by a single bound (2 bounds), one may consider a hybrid approach that maintains lower bounds for subsets of the non-closest centers.This improves the quality of bounds over the 2 bounds approach, because bounds are updated only by the maximum distance a center in the corresponding group moved (instead of the global maximum).On the other hand, (considerably) fewer than k bounds have to be updated.This is the idea of the YinYang algorithm [4], which forms the groups of centers by clustering the initial centers with k-means clustering.The number of groups is chosen as k/10 in [4], but other factors may be tried.The groups found initially are maintained, that is, there is no re-clustering after an update.
However, apart from fewer bounds (compared to k bounds) and better bounds (compared to 2 bounds), grouping the centers has yet another advantage: If the bounds test fails, even with a tightened bound u t j , the groups and their bounds may be used to limit the centers for which a distance recomputation is needed.Because if the test succeeds for some group, one can infer that the closest center  cannot be in that group.Only centers in groups, for which the group-specific test fails, need to be considered for recomputation.
Cluster to Cluster Distances.The described bounds test can be improved by not only computing the distance each center moved, but also the distances between (updated) centers, to find for each center another center that is closest to it [5].With my notation I can denote such a center as ν t+1 2 (c t * j1 ), that is, the center that is second closest )).If this is the case, the center that was closest to the data point x j before the update must still be closest after, as , because a center is certainly the center closest to itself.
is illustrated in Fig. 2 for the worst possible case (namely x j , c t * ji and ν t+1 2 (c t * j1 ) lie on a straight line with c t * ji and ν t+1 2 (c t * j1 ) on opposite sides of x j ).Note that this second test can be used with k as well as with 2 bounds.However, it should also be noted that, although it can lead to an acceleration, if used in isolation it may also make an algorithm slower, because of the O(k 2 ) distance computations needed to find the Annular Algorithm.With the YinYang algorithm an idea appeared on the scene that is at the focus of all following methods: try to limit the centers that need to be considered in the recomputations if the tests fail even with a tightened bound u t+1 j .Especially, if one uses the 2 bounds approach, significant gains may be obtained: all we need to achieve in this case is to find c t+1 i1 = ν t+1 1 (x j ) and c t+1 i2 = ν t+1 2 (x j ), that is, the two centers closest to x j , because these are all that is needed for the assignment step as well as for the (tight) bounds u t+1 j and t+1 j .One such approach is the Annular algorithm [3].For its description, as generally in the following, I drop the time step indices t + 1 in order to simplify the notation.The Annular algorithm relies on the following idea: if the tests described above fail with a tightened bound u j , we cannot infer that c t * ji is still the center closest to x j .But we know that the closest center must lie in (hyper-) ball with radius u j around x j (darkest circle in Fig. 3).Any center outside this (hyper-)ball cannot be closest to x j , because c t * ji is closer.Furthermore, if we know the distance to another center closest to c t * ji , that is, ν 2 (c t * j1 ), we know that even in the worst possible case (which is depicted in Fig. 3: x j , c t * ji and ν 2 (c t * j1 ) lie on a straight line), the two closest centers must lie in a (hyper-)ball with radius u j + δ j around x j , where δ j = d(c t * i1 , ν 2 (c t * j1 )) (medium circle in Fig. 3), because we already know two centers that are this close, namely c t * ji and ν 2 (c t * j1 ).Therefore, if we know the distances of the centers from the origin, we can easily restrict the recomputations to those centers that lie in a (hyper-)annulus (hence the name of this algorithm) around the origin with c t * j1 in the middle and thickness 2θ j , where θ j = 2u j + δ j with δ j = d(c t * i1 , ν 2 (c t * j1 )) (see Fig. 3, light gray ring section, origin in the bottom left corner; note that the green line is perpendicular to the red/blue lines only by accident/for drawing convenience).
Exponion Algorithm.The Exponion algorithm [11] improves over the Annular algorithm by switching from annuli around the origin to (hyper-)balls around the (updated) formerly closest center c t * j1 .Again we know that the center closest to x j must lie in a (hyper-)ball with radius u j around x j (darkest circle in Fig. 4) and that the two closest centers must lie in a (hyper-)ball with radius u j + δ j around x j , where δ j = d(c t * i1 , ν 2 (c t * j1 )) (medium circle in Fig. 4).Therefore, if we know the pairwise distances between the (updated) centers, we can easily restrict the recomputations to those centers that lie in the (hyper-)ball with radius r j = 2u j + δ j around c t * j1 (lightest circle in Fig. 4).The Exponion algorithm also relies on a scheme with which it is avoided having to sort, for each cluster center, the lists of the other centers by their distance.For this concentric annuli, one set centered at a each center, are created, with each annulus further out containing twice as many centers as the preceding one.Clearly this creates an onion-like structure, with an exponentially increasing number of centers in each layer (hence the name of the algorithm).
However, avoiding the sorting comes at a price, namely that more centers may have to be checked (although at most twice as many [11]) for finding the two closest centers and thus additional distance computations ensue.In my implementation I avoided this complication and simply relied on sorting the distances, since the gains achievable by concentric annuli over sorting are somewhat unclear (in [11] no comparisons of sorting versus concentric annuli are provided).Shallot Algorithm.The Shallot algorithm is the main contribution of this paper.It starts with the same considerations as the Exponion algorithm, but adds two improvements.In the first place, not only the closest center c j1 and the two bounds u j and j are maintained for each data point (as for Exponion), but also the second closest center c j2 .This comes at practically no cost (apart from having to store an additional integer per data point), because the second closest center has to be determined anyway in order to set the bound j .
If a recomputation is necessary, because the tests fail even for a tightened u j , it is not automatically assumed that c t * j1 is the best center z for a (hyper-)ball to search.As it is plausible that the formerly second closest center c t * j2 may now be closer to x j than c t * j1 , the center c t * j2 is processed first among the centers c t * ji ,

Experiments
In order to evaluate the performance of the different exact k-means algorithms I generated a large number of artificial data sets.Standard benchmark data sets proved to be too small to measure performance differences reliably and would also not have permitted drawing "performance maps" (see below).I fixed the number of data points in these data sets at n = 100 000.Anything smaller renders the time measurements too unreliable, anything larger requires an unpleasantly long time to run all benchmarks.Thus I varied only the dimensionality m of the data space, namely as m ∈ {2, 3, 4, 5, 6, 8, 10, 15, 20, 25, 30, 35, 40, 45, 50}, and the number k of clusters, from 20 to 300 in steps of 20.For each parameter combination I generated 10 data sets, with clusters that are (roughly, due to random deviations) equally populated with data points and that may vary in size by a factor of at most ten per dimension.All clusters were modeled as isotropic normal (or Gaussian) distributions.Each data set was then processed 10 times with different initializations.All optimization algorithms started from the same initializations, thus making the comparison as fair as possible.
The clustering program is written in C (however, there is also a Python version, see the link to the source code below).All implementations of the different algorithms are entirely my own and use the same code to read the data and to write the clustering results.This adds to the fairness of the comparison, as in this way any differences in execution time can only result from differences of the actual algorithms.The test systems was an Intel Core 2 Quad Q9650@3GHz with 8 GB of RAM running Ubuntu Linux 18.04 64bit.W.r.t.distance computations there is no question who is the winner: the Shallot algorithm wins all parameter combinations, some with a considerable margin.W.r.t.execution times, there is also a clear region towards more dimensions and more clusters, but for fewer clusters and fewer dimensions the diagram looks a bit speckled.This is a somewhat strange result, as a smaller number of distance computations should lead to lower execution times, because the effort spent on organizing the search, which is also carried out in exactly the same situations, is hardly different between the Shallot and the Exponion algorithm.
The reason for this speckled look could be that the benchmarks were carried out with heavy parallelization (in order to minimize the total time), which may have distorted the measurements.As a test of this hypothesis, Fig. 8 shows the standard deviation of the execution times relative to their mean.White means no variation, fully saturated blue indicates a standard deviation half as large as the mean value.The left diagram refers to the Shallot, the right diagram to the Exponion algorithm.Clearly, for a smaller number of dimensions and especially for a smaller number of clusters the execution times vary more (this may be, at least in part, due to the generally lower execution times for these parameter combinations).It is plausible to assume that this variability is the explanation for the speckled look of the diagrams in Fig. 6 and in Fig. 7 on the right.
Finally, Fig. 9 shows, again on the same grid, a comparison of the number of distance computations (left) and the execution times (right) of the Shallot algorithm and the YinYang algorithm (using the test based on cluster to cluster distances, although a pure YinYang algorithm performs very similarly).The relative performance is color-coded in the same way as in Fig. 7. Clearly, the smaller number of distance computations explains why the YinYang algorithm is superior for more clusters and more dimensions.
The reason is likely that grouping the centers leads to better bounds.This hypothesis is confirmed by the fact that the Elkan algorithm (k distance bounds) always needs the fewest distance computations (not shown as a grid) and loses on time only due to having to update so many distance bounds.

Conclusion
In this paper I introduced the Shallot algorithm, which adds two improvements to the Exponion algorithm [11], both of which can potentially shrink the (hyper-) ball that has to be searched for the two closest centers if recomputation becomes necessary.This leads to a measurable, sometimes even fairly large speedup compared to the Exponion algorithm due to fewer distance computations.However, for high-dimensional data and large numbers of clusters the YinYang algorithm [4] (with or without the cluster to cluster distance test) is superior to both algorithms.Yet, since clustering in high dimensions is problematic anyway due to the curse of dimensionality, it may be claimed reasonably confidently that the Shallot algorithm is the best choice for standard clustering tasks.
Software.My implementation of the described methods (C and Python), with which I conducted the experiments, can be obtained under the MIT License at http://www.borgelt.net/cluster.html.

Fig. 1 .
Fig. 1.Using the triangle inequality to update the distance bounds for a data point xj.

Fig. 3 .
Fig. 3. Annular algorithm[3]: If even after the upper bound uj for the distance from data point xj to its (updated) formerly closest center c t * j1 has been made tight, the lower bound j for distances to other centers is still lower, it is necessary to recompute the two closest centers.Exploiting information about the distance between c t * j1 and another center ν2(c t * j1 ) closest to it, these two centers are searched in a (hyper-)annulus around the origin (dot in the bottom left corner) with c t * j1 in the middle and thickness 2θj, where θj = 2uj + δj and δj = d(c t * i1 , ν2(c t * j1 )).(Color figure online)

Fig. 4 .
Fig. 4. Exponion algorithm[11]: If even after the upper bound uj for the distance from a data point xj to its (updated) formerly closest center c t * j1 has been made tight, the lower bound j for distance to other centers is still lower, it is necessary to recompute the two closest centers.Exploiting information about the distance between c t * j1 and another center ν2(c t * j1 ) closest to it, these two centers are searched in a (hyper-)sphere around center c t * j1 with radius rj = 2uj + δj where δj = d(c t * j1 , ν2(c t * j1 )).(Color figure online)

Fig. 6 .
Fig.6.Map of the algorithms that produced the best execution times over number of dimensions (horizontal) and number of clusters (vertical), showing fairly clear regions of algorithm superiority.Enjoyably, the Shallot algorithm that was developed in this paper yields the best results for the largest number of parameter combinations.

Fig. 7 .
Fig. 7. Relative comparison between the Shallot algorithm and the Exponion algorithm.The left diagram refers to the number of distance the right diagram to execution time.means that Shallot is better, red that is better.(Color figure online) Figure 6   shows on a grid spanned by the number of dimensions (horizontal axis) and the number of clusters inducted into the data set (vertical axis) which algorithm performed best (in terms of execution time) for each combination.Clearly, the Shallot algorithm wins most parameter combinations.Only for larger numbers of dimensions and larger numbers of clusters the YinYang algorithm is superior.In order to get deeper insights, Fig.7shows on the same grid a comparison of the number of distance computations (left) and the execution times (right) of the Shallot algorithm and the Exponion algorithm.The relative performance 8. Variation of the execution times over number of dimensions (horizontal) and number of clusters (vertical).The left diagram refers to the Shallot algorithm, the right diagram to the Exponion algorithm.The larger variation for fewer clusters and fewer dimensions may explain the speckled look of Figs. 6 and 7.

Fig. 9 .
Fig. 9. Relative comparison between the Shallot algorithm and the YinYang algorithm using the cluster to cluster distance test (pure YinYang is very similar, though).The left diagram refers to the number of distance computations, the right diagram to execution time.Blue means that Shallot is better, red that YinYang is (Color figure online) Results.A table with the complete experimental results I obtained can be retrieved as a simple text table at http://www.borgelt.net/docs/clsbench.txt.More maps comparing the performance of the algorithms can be found at http://www.borgelt.net/docs/clsbench.pdf.