Efficiency of random swap clustering
Abstract
Random swap algorithm aims at solving clustering by a sequence of prototype swaps, and by finetuning their exact location by kmeans. This randomized search strategy is simple to implement and efficient. It reaches good quality clustering relatively fast, and if iterated longer, it finds the correct clustering with high probability. In this paper, we analyze the expected number of iterations needed to find the correct clustering. Using this result, we derive the expected time complexity of the random swap algorithm. The main results are that the expected time complexity has (1) linear dependency on the number of data vectors, (2) quadratic dependency on the number of clusters, and (3) inverse dependency on the size of neighborhood. Experiments also show that the algorithm is clearly more efficient than kmeans and almost never get stuck in inferior local minimum.
Keywords
Clustering Random swap Kmeans Local search EfficiencyIntroduction
The aim of clustering is to group a set of N data vectors {x_{i}} in Ddimensional space into k clusters by optimize a given objective function f. Each cluster is represented by its prototype, which is usually the centroid of the cluster. Kmeans performs the clustering by minimizing the distances of the vectors to their cluster prototype. This objective function is called sumofsquared errors (SSE), which corresponds to minimizing withincluster variances. The output of clustering is the set of cluster labels {p_{i}} and the set of prototypes {c_{i}}.
Kmeans was originally defined for numerical data only. Since then, it has also been applied to other types of data. The key is to define the distance or similarity between the data vectors, and to be able to define the prototype (center). It is not trivial how to do it, but if properly solved, then kmeans can be applied. In case of categorical data, several alternatives were compared including kmedoids, kmodes, and kentropies [1].
Quality of clustering depends on several factors. The first step is to choose the attributes and the objective function according to the data. They have the biggest influence on the clustering result, and their choice is the most important challenge for practitioners. The next step is to deal with missing attributes and noisy data. If the number of missing attributes is small, we can simply exclude these data vectors from the process. Otherwise some data imputation technique should be used to predict the missing values; for some alternatives see [2].
Noise and outliers can also bias the clustering especially with the SSE objective function. Detection of outliers is typically considered as a separate preprocessing step. Another approach is to perform the clustering first, and then label points that did not fit into any cluster as outliers. Outlier removal can also be integrated in the clustering directly by modifying the objective function [3]. After the preprocessing steps, the main challenge is to optimize the clustering so that the objective function would be minimized. In this paper, we focus on this problem.
On the other hand, the correct locations of the prototypes can be solved by a sequence of prototype swaps, and leaving the finetuning of their exact location to kmeans. In Fig. 2, only one swap is needed to fix the solution. An important observation is that it is not even necessary to swap one of the redundant prototypes but simply removing any prototype in their immediate neighborhood is enough since kmeans can finetune their exact location locally. Also, the exact location where the prototype is relocated is not important, as long as it is in the immediate neighborhood where the prototype is needed.
Several swapbased clustering algorithms have been considered in literature. Deterministic swap selects the prototype to be swapped as the one that increases the objective function value f least [5, 6, 7], or by merging two existing clusters [8, 9] following the spirit of agglomerative clustering. The new location of the prototype can be chosen by considering all possible data vectors [7], splitting an existing cluster [7, 10], or by using some heuristic such as selecting the cluster with the largest variance [5]. The swapbased approach has also been used for solving pmedian problem [11].
The main drawback of these methods is their computational complexity. Much simpler but effective approach is random swap strategy: select the prototype to be removed randomly and replace it to the location of a randomly selected data vector. This trialanderror approach was first used in the tabu search algorithm presented in [12], and later simplified to a method called randomized local search (RLS) [13]. The main observation was that virtually the same clustering quality is reached independent of the initial solution. The same conclusion was later confirmed in [14]. CLARANS is another variant of this technique using medoids instead of centroids [15].
The main reason why random swap approach is not so widely used is the lack of theoretical results of its properties. Our experiments, however, have shown that it works much better than kmeans in practice, and in most cases, is also more efficient [16, 17, 18, 19, 20]. Its main limitation is that there is no clear rule how long the algorithm should be iterated and this parameter needs to be selected experimentally.
In this paper, we formulate the random swap (RS) as a probabilistic algorithm. We show that the expected time complexity to find the correct cluster allocation of the prototypes is polynomial. The processing time of the algorithm depends on how many iterations (trial swaps) are needed, and how much time each iteration takes. We will show that for a given probability of failure (q), the time complexity of the algorithm is upper bounded by a function that has linear O(N) dependency on the number of data vectors, quadratic O(k^{2}) dependency on the number of clusters, and inverse dependency on the size of neighborhood.
The main advantage of random swap clustering is that it is extremely simple to implement. If kmeans can be implemented for the data, random swap is only a small extension. Kmeans consists of two steps (partition step and centroid step), and the random swap method has only one additional step: prototype swap. In most cases, this step is independent on the data and the objective function. It is also trivial to implement, which makes it highly useful for practical applications.
Besides the theoretical upper bound, we compare the efficiency experimentally against kmeans, repeated kmeans, kmeans++, xmeans, global kmeans, agglomerative clustering and genetic algorithm. With our clustering benchmark data sets, we compare the results to the known ground truth and observe that random swap finds the correct cluster allocation every time. In case of image data, we use genetic algorithm (GA) as golden standard, as it is the most accurate algorithm known.
The rest of the paper is organized as follows. In “Random swap clustering” section, we recall kmeans and random swap algorithms and analyze their time complexities. In “Number of iterations” section, we study the number of iterations and derive the expected time complexity of the algorithm. The optimality of the algorithm is also discussed. In “Neighborhood size” section, we define the concept of neighborhood. Experimental studies are given in “Experiments” section to demonstrate the results in practice. Conclusions are drawn in “Conclusions” section.
Random swap clustering
These steps are iteratively performed for a fixed number of iterations, or until convergence.
Random swap algorithm
This generates a global change in the clustering structure. An alternative implementation of the swap would be to create the new cluster by first choosing an existing cluster randomly, and then by selecting a random data vector within this cluster. We use the first approach for simplicity but the second approach is useful for the analysis of the swap.
The new solution is then modified by two iterations of kmeans to adjust the partition borders locally. The overall process is a trialanderror approach: a new solution is accepted only if it improves the objective function (1).
There are also video lecture (youtube), presentation material (ppt), flash animation (animator) and web page (Clusterator) where anyone can upload data in text format and obtain quick clustering result in just a 5 s, or alternatively, use longer 5 min option for higher quality. It is currently limited to numerical data only but we plan to extend it to other data types in future.
Time complexity of a swap
 1.
Swap of the prototype.
 2.
Removal of the old cluster.
 3.
Creation of the new cluster.
 4.
Updating affected prototypes.
 5.
Kmeans iterations.
Step 1 consists of two random number generations and one copy operation, which take O(1) time. For simplicity, we assume here that the dimensionality is constant d = O(1). In case of very high dimensions, the complexities should be multiplied by d due to the distance calculations.
In step 2, a new partition is found for every vector in the removed cluster. The time complexity depends on the size of the removed cluster. In total, there are N data vectors divided into k clusters. Since the cluster is selected randomly, its expected size is N/k. Processing of a vector requires k distance calculations and k comparisons. This multiplies to 2k·N/k = 2N. Note that the expected time complexity is independent on the size of the cluster.
In step 3, distance of every data vector to the newly created prototype is calculated. There are N vectors to be processed, each requiring 2 distance calculations. Thus, the time complexity of this step sums up to 2N, which is also independent on the size of the cluster.
In step 4, new prototypes are generated by calculating cumulative sums of the vectors in each cluster. To simplify the implementation, the cumulative sums are calculated already during the steps 2 and 3. One addition and one subtraction are needed per each vector that changes its cluster. The sums of the affected clusters (the removed, the new and their neighbors) are then divided by the size of the cluster. There are N/k vectors both in the removed and in the new cluster, on average. Thus, the number of calculations sums up to 2N/k + 2N/k + 2α = O(N/k) where α denotes to the neighborhood size (see “Neighborhood size” section).
The time complexity of the kmeans iterations is less trivial to analyze but the rule of thumb is that only local changes appear due to the swap. In a straightforward implementation, O(Nk) time would be required for every kmeans iteration. However, we use the reduced search variant [21], where full search is needed only for the vectors in the affected clusters. For the rest of the vectors, it is enough to compute distances only to the prototypes of the affected clusters. We estimate that the number of the affected clusters equals to the size of neighborhood of the removed and the added clusters, which is 2α. The expected number of vectors in those clusters is 2α·(N/k). The time complexity of one kmeans iteration is therefore 2α·(N/k)·k for the full searches, and (N − (2α·(N/k)))·2α ≤ N·2α for the rest. These sum up roughly to 4αN = O(αN) for two kmeans iterations.
Time complexity estimations of the steps of the random swap algorithm, and the observed numbers as the averages over the first 500 iterations for data set Bridge (N = 4096, k = 256, N/k = 16, α ≈ 8)
Step  Time complexity  Observed number of steps at iteration  

50  100  500  
Prototype swap  2  2  2  2 
Cluster removal  2N  7526  8448  10,137 
Cluster addition  2N  8192  8192  8192 
Prototype update  4N/k + 2α  53  61  60 
Kmeans iterations  ≤ 4αN  300,901  285,555  197,327 
Total  O(αN)  316,674  302,258  215,718 
Figure 5 shows the distribution of the processing times between the steps 1–4 (swap + local repartition) and the step 5 (kmeans). The number of processing time required by kmeans is somewhat higher in the early iterations. The reason is the same as above: there are more prototypes moving in the early stage but the movements soon reduce to the expected level. The time complexity function predicts that kmeans step would take 4αN/(2N + 2N + 4αN) = 89% proportion of the total processing time with Bridge. The observed number of the steps gives 197,327/215,718 = 91% at the 500 iterations, and the actual measured processing times of kmeans takes 0.53 + 0.39 = 92%. For BIRCH_{1}, the time complexity predicts 85% proportion whereas the actual is 97% but it drops to 94% around 500 iterations after all the prototypes have found their correct location.
Further speedup of kmeans could be obtained by using the activitybased approach jointly with kdtree [22], or by exploiting the activity information together with triangular inequality rule for eliminating candidates in the nearest neighbor search [23]. This kind of speedup works well for data where the vectors are concentrated along the diagonal but generalizes poorly when the data is distributed uniformly [21, 24]. It is also possible to eliminate those prototypes from the full search whose activity is smaller than a given threshold and provide further speedup at the cost of decreased quality [25, 26].
Number of iterations

Trial.

Accepted.

Successful.
One trial swap is made in every iteration but only a swap that improves the objective function is called accepted swap. However, in the following analysis we are interested not all accepted swaps but only those that also reduces CIvalue. In other words, swaps that correct one error in the global prototype allocation; minor finetunings do not count in this analysis. We therefore define a swap as successful if it reduces the CIvalue.
Successful swaps

Select a proper prototype to be removed.

Select a proper location for the prototype.

Perform local finetuning successfully.
The first two are more important, but we will show that the finetuning can also play a role in finding a successful swap. We analyze next the expected number of iterations to fix one prototype location, and then generalize the result to the case of multiple swaps.
To make successful swap to happen, we must remove one prototype from an overpartitioned region and relocate it to an underpartitioned region, and the finetuning must relocate the prototype so that it fills in one real cluster (pigeonhole). All of this must happen during the same trial swap.
Assume that CI = 1. It means that there is one real cluster missing a prototype, and another cluster overcrowded by having two prototypes. We therefore have two favorable prototypes to be relocated. The probability for selecting one of these prototypes by a random choice is 2/k as there are k prototypes to choose from. To select the new location, we have N data vectors to choose from and the desired location is within the real cluster lacking prototype. Assume that all the clusters are of the same sizes, and that the mass of the desirable cluster is twice that of the others (it covers two real clusters). With this assumption, the probability that a randomly selected vector belongs the desired cluster is 2(N/k)/N = 2/k. The exact choice within the cluster is not important because kmeans will tune the prototype locations locally within the cluster.
At first sight, the probability for a successful swap appears to be 2/k·2/k = O(1/k^{2}). However, the finetuning capability of kmeans is not limited within the cluster but it can also move prototypes across neighboring clusters. We define here that two clusters are kmeans neighbors if kmeans can move prototypes from a cluster to its neighbor. In order this to happen, the clusters must be both spatial neighbors and also in the vicinity of each other. This concept of neighborhood will be discussed more detailed in “Neighborhood size” section.
Note that it is also possible that the swap solves one allocation problem but creates another one elsewhere. However, this not considered a successful swap because it does not change the CIvalue. It is even possible that CIvalue occasionally increases even when objective function decreases but this is also not important for the analysis. In fact, by accepting any swap that decreases the objective function, we guarantee that the algorithm will eventually converge; even if the algorithm itself does not know when it happens. We will study this more detailed in “Experiments” section.
Probability of successful swap
To keep the analysis simple, we introduce a datadependent variable α to represent the size of the kmeans neighborhood (including the vector itself). Although it is difficult to calculate exactly, it provides a useful abstraction that helps to analyze the behavior of the algorithm. Mainly α depends on the dimensionality (d) and structure of the data, but also on the size (N) and number of clusters (k). The worst case is when all clusters are isolated (α = 1). An obvious upper limit is α ≤ k.
In total, there are O(α) clusters to choose from, but both the removal and addition must be made within the neighborhood. This probability becomes lower when the number of clusters (k) increases, but higher when the dimensionality (d) increases. The exact dependency on dimensionality is not trivial to analyze. Results from literature imply that the number of spatial neighbors increases exponentially with the dimensionality: α = O(2^{ d }) [27]. However, the data is expected to be clustered and has some structure; it usually has lower intrinsic dimensionality than its actual dimensionality.
Analysis for the number of iterations
Bounds for the number of iterations
We derive tight bounds for the required number of iterations for (8) as follows.
Proof for upper limit
Proof for lower limit
Since the same function is both the upper (11) and lower bound (12), the theorem is proven.□
Multiple swaps
The above analysis was made only if one prototype is incorrectly located. In case of the S_{1–4} datasets, 1 swap is needed in 60% cases of a random initialization, and 2 swaps in 38% cases. Only very rarely (< 2%) three or more swaps are required.
The only difference to the case of single swap is the logarithmic term (log_{2} w), which depends only on the number of swaps needed. Even this is a too pessimistic estimation since the probability of the first successful swap is up to w^{2} times higher than that of the last swap. This is because there are potentially w times more choices for the successful removal and addition. Experimental observations show that 2.7 swaps are required with S_{1}–S_{4}, on average, and the number of iterations is multiplied roughly by a factor of 1.34, when compared to the case of a single swap. However, the main problem of using the Eq. (13) in practice is that the number of swaps (w) is unknown.
Overall time complexity

Linear dependency on N.

Quadratic dependency on k.

Logarithmic dependency on w.

Inverse dependency onα.
The main result is that the time complexity has only linear dependency on the size of data, but quadratic on the number of clusters (k). Although k is relatively small in typical clustering problems, the algorithm can become slow in case of large number of clusters.
The size of the neighborhood affects the algorithm in two ways. On one hand, it increases the time required by the kmeans iterations. On the other hand, it also increases the probability for finding successful swaps, and in this way, it reduces the total number of iterations. Since the latter one dominates the time complexity, the final conclusion becomes somewhat surprising: the larger the neighborhood (α) the faster the algorithm.
Furthermore, since α increases with the dimensionality, the algorithm has inverse dependency on dimensionality. The higher the dimensionality implies the algorithm being faster. In this sense, data having low intrinsic dimensionality represent the worst case. For example, BIRCH_{2} (see “Experiments” section) has onedimensional structure (D = 1) having small neighborhood size (α = 3).
Optimality of the random swap
So far we have assumed that the algorithm finds the correct cluster allocation every time (CI = 0). This can be argued by the pigeonhole principle: there are k pigeons and k pigeonholes. In a correct solution, exactly one pigeon (prototype) occupies one pigeon hole (cluster). The solution for this problem can be found by a sequence of swaps: by swapping pigeons from overcrowded holes to the empty ones. It is just a matter of how many trial swaps are needed to make it happen. We do not have formal proof that this would always happen but it is unlikely that the algorithm would get stuck in suboptimal solution with wrong cluster allocation (CI > 0). With all our data having known ground truth and unique global minimum, random swap always finds it in our experiments.
A more relevant question is what happens with real world data that does not necessarily have clear clustering structure. Can the algorithm also find the global optimum minimizing SSE? According to our experiments, this is not the case. Higher dimensional image data can have—not exactly multiple optima—but multiple plateaus with virtually the same SSEvalues. In such cases, random swap ends up to any of the alternative plateaus all having nearoptimal SSEvalues. Indirect evidence in [4] showed that two highly optimized solutions with the same cluster allocation (CI = 0) have also virtually the same objective function value (SSE = 0.1%) with significantly different allocations of the prototypes (5%).
To sum up: proof of optimality (or nonoptimality) of the cluster level allocation (CIvalue) remains an open problem. For minimizing SSE, the algorithm finds either the optimal solution (if unique) or one of the alternative nearoptimal solutions all having virtually equal quality.
Neighborhood size
In the previous analysis, we introduced the concept of kmeans neighborhood. The size of this neighborhood (including the vector itself) is denoted by α. It is defined as the average over the entire data set. In practice, it is not possible to calculate or even estimate α accurately without actually performing kmeans. The value is bounded to the range α∈[1,k], and the worst case is α = 1 when all clusters are isolated from each other. We next discuss how the size of this neighborhood could be analyzed in the context of multidimensional data.
Voronoi neighbors
For 2D data, an upper limit for the number of Voronoi surfaces has been shown to be 3k − 6 [30]. Since every Voronoi surface separates two neighbor partitions, we can derive an upper limit for the average number of neighbors as 2·(3k − 6)/k = 6 − 12/k, which approaches to 6 when k becomes large. In our 2D data sets (S_{1}–S_{4}), there are 4 Voronoi neighbors, on average, varying from 1 to 10. For Ddimensional data, the number of Voronoi surfaces is upper bounded by \({\text{O}}(k^{{\left \!{\overline {\, D \,}} \right. /\left. {\overline {\, 2 \,}}\! \right }} )\) [30]. Respectively, an upper bound for the Voronoi neighbors is \({\text{O}}( 2\cdot k^{{\left \!{\overline {\, D \,}} \right. /\left. {\overline {\, 2 \,}}\! \right }} /k)\, = \,{\text{O}}( 2\cdot k^{{\left \!{\overline {\, D \,}} \right. /\left. {\overline {\, 2 \,}}\! \right  1}} )\).
 Theory for 2dim:

α ≤ 6 − 12/k
 Theory for Ddim:

\(\upalpha \le {\text{O}}( 2\cdot k^{{\left \!{\overline {\, D \,}} \right. /\left. {\overline {\, 2 \,}}\! \right  1}}\)
 Data limit:

α ≤ k
Data  Dim  k  Theory  Reality 

S _{2}  2  15  5.2  4.5 
Bridge  16  256  10^{17}  5.4 
DIM32  32  16  10^{19}  1.1 
KDD04Bio  74  2000  10^{118}  33.3 
To sum up, the number of Voronoi neighbors is significantly higher than the size of the kmeans neighborhood. A better estimator for the size of neighborhood is therefore needed.
Estimation algorithm
For the sake of analysis, we introduce an algorithm to estimate the average size of the kmeans neighborhood. We use it merely to approximate the expected number of iterations (14) for a given data set in experiments in “Experiments” section. The idea is to first to find Voronoi neighbors and then analyze whether they are also kmeans neighbors. However, Voronoi can be calculated fast in O(k·log k) only in case of 2dimensional data. In higher dimensions it takes \({\text{O}}(k^{{\left \!{\overline {\, D \,}} \right. /\left. {\overline {\, 2 \,}}\! \right }} )\) time [31]. We therefore apply the following procedure.
 1.
Calculate the half point of the prototypes: hp←(c_{a} + c_{b})/2.
 2.
Find the nearest prototype (nc) for hp.
 3.
If nc = c_{ a } or nc = c_{ b } then (a, b) are neighbors.
Every pair of prototypes that pass this test, are detected as spatial neighbors. We then calculate all vector distances across the two clusters. If any distance is smaller than the distance of the corresponding vector to its own cluster prototype, it is evidence that kmeans has potential to operate between the clusters. Accordingly, we define the clusters as kmeans neighbors. Although this does not give any guarantee, it is reasonable indicator for our purpose.
Experiments
Summary of the Data Sets
Data set  Ref.  Type of data  Vectors (N)  Clusters (k)  Vectors per cluster  Dimension (d) 

Bridge  [35]  Grayscale image  4096  256  16  16 
House ^{a}  [35]  RGB image  34,112  256  133  3 
Miss America  [35]  Residual vectors  6480  256  25  16 
Europe  Diff. coordinates  169,673  256  663  2  
BIRCH _{ 1 } –BIRCH _{ 3 }  [33]  Artificial  100,000  100  1000  2 
S _{ 1 } –S _{ 4 }  [6]  Artificial  5000  15  333  2 
Unbalance  [42]  Artificial  6500  8  821  2 
Dim16–Dim1024  [24]  Artificial  1024  16  64  16–1024 
KDD04Bio  [34]  DNA sequences  145,751  2000  73  74 
Ground truth clustering results exist for all the generated data. For the rest, we iterate random swap algorithm for 1 million iterations, and use the final result as the reference solution (golden standard) when applicable. To measure the goodness of the result, we calculate the centroid index (CI) against the ground truth (or reference) solution. Value CI = 0 indicates that the cluster level structure is correct.
All tests have been performed on a Dell PowerEdge R920 Server with four Xeon E74860 v2 processors having 1 TB memory and using RedHat Enterprise Linux 7 operating system.
Estimating α and T
Neighborhood size (α) estimated from the data and from the clustering result after T Iterations
Dataset  Full data  From clustering  Estimated iterations (T)  

Initial T = 0  Early T = 5  Final T = 5000  q = 10%  q = 1%  q = 0.1%  
Bridge  69.8  8.7  5.4  4.6  33,595  67,910  100,785 
House  15.4  6.7  8.3  8.2  13,381  26,761  40,142 
Miss America  346  34.2  17.1  11.9  3593  7078  10,617 
Europe  (5.0)  4.8  6.3  6.3  26,699  53,398  80,098 
BIRCH _{ 1 }  5.0  4.5  5.8  5.6  2908  5815  8723 
BIRCH _{ 2 }  (4.7)  3.1  3.1  2.9  10,524  21,048  31,572 
BIRCH _{ 3 }  (4.9)  4.1  4.9  5.0  4508  9016  13,523 
S _{ 1 }  4.8  3.7  4.1  4.2  46  92  137 
S _{ 2 }  4.9  3.7  4.5  4.7  37  73  110 
S _{ 3 }  4.9  3.9  4.4  4.3  38  77  115 
S _{ 4 }  4.9  3.9  4.8  5.0  32  64  97 
Unbalance  3.4  2.3  2.3  2.0  56  111  167 
Dim32  26.8  1.5  1.1  1.0  920  1839  2759 
Dim64  37.1  1.9  1.1  1.0  920  1839  2759 
Dim128  47.3  1.4  1.0  1.0  1135  2271  3406 
KDD04Bio  –  286.2  33.3  30.4  72,800  145,600  218,401 
We estimate the number of iterations (trial swaps) required to find the clustering for three given probabilities of failure (10, 1, 0.1%). The estimated number of iterations is small for data with few clusters (S_{1}–S_{4}, Unbalance). Even the highest confidence level (q = 0.1%) requires only 97–137 iterations for S_{1}–S_{4}, and 167 for Unbalance. The clusters in the Dim dataset are isolated, which makes the size of the neighborhood very small (1.1). It therefore has larger estimates (about 1700). For the image datasets the estimates are significantly higher because of large number of clusters.
Expected number of iterations (Exp) for the last successful swap is estimated as 1/p = (k/α)^{2}, which is multiplied by log w to obtain the value for all swaps
Dataset  α  w  Last swap  All swaps  

Exp  Real^{2}  Real^{10}  Exp  Real^{2}  Real^{10}  
Bridge  5.4  90  2247  –  –  14,590  –  – 
House  8.3  69  951  –  –  5811  –  – 
Miss America  17.1  116  224  –  –  1537  –  – 
Europe  6.3  130  1651  –  –  11,595  –  – 
BIRCH _{ 1 }  5.8  19  297  440  121  1263  637  197 
BIRCH _{ 2 }  3.1  21  1041  761  548  4571  1246  924 
BIRCH _{ 3 }  4.9  26  416  –  –  1958  –  – 
S _{ 1 }  4.1  2.8  13  26  18  20  33  23 
S _{ 2 }  4.5  2.7  11  19  9  16  25  12 
S _{ 3 }  4.4  2.7  12  17  7  17  22  10 
S _{ 4 }  4.8  2.7  10  19  9  14  25  11 
Unbalance  2.3  4.0  12  58  54  24  122  110 
Dim32  1.1  3.7  212  52  52  399  73  76 
Dim64  1.1  3.7  212  59  64  399  83  88 
Dim128  1.0  3.8  256  56  65  493  91  98 
KDD04Bio  33.3  435  3607  –  –  31,617  –  – 
The worst case behavior is when α = 1, meaning that all clusters are isolated. If we left α out of the equations completely, this would lead to estimates of T ≈ 700 for S_{1}–S_{4} even with q = 10%. With this number of iterations we always found the correct clustering and the maximum number of iterations ever observed was 322 (S_{1}) and 248 (S_{4}) at the end of the tail of the distribution. To sum up, the upper bounds hold very well in practice even if we ignored α.
Expected number of iterations
Next we study how well the Eq. (16) can predict the expected number of iterations. We use the same test setup and run the algorithm as long as needed to reach the correct result (CI = 0). Usually the last successful swap is the most time consuming, so we study separately how many iterations are needed from CI = 1 to 0 (last swap), and how many in total (all swaps). The estimated and the observed numbers are reported in Table 4. The number of successful swaps needed (w) is experimentally obtained from the data sets.
a. Estimated iterations
The expected results are again somewhat overoptimistic compared to the reality but still well within the order of magnitude of the time complexity results. For S sets, the algorithm is iterated about 50% longer [26, 29, 35] than estimated [18, 20, 21, 24]. For BIRCH data, the estimates for the last swap are slightly too pessimistic as it overestimates the iterations by about 30%. The difference becomes bigger in case of all swaps. Unbalance has the biggest difference, almost 5 times, so let us focus on it a little bit more.
The creation of a new cluster assumes that all clusters are roughly of the same size. However, this assumption does not hold for the Unbalance data, which has three big clusters of size 2000 and five small ones of size 100. By random sampling, it is much more likely to allocate prototype in a bigger cluster. On average, random initialization allocates 7 prototypes within the three big clusters, and only one within the five small clusters. Starting from this initial solution, a successful swap must select a vector from any of the small clusters because the big clusters are too far and they are not kmeans neighbors with the small ones. The probability for this is 500/6500 = 7.7% in the beginning (when w = 4 swaps needed), and 200/6500 = 3% for the last swap. The estimate is 1/k = 1/8 = 12.5%.
Despite this inaccuracy, the balance cluster assumption itself is usually fair because the time complexity result is still within the order of the magnitude. We could make even more relaxed assumption by considering the cluster sizes following arithmetic series cN, 2cN,…, kcN, where c is a constant in range [0,1]. The analysis in [36] shows that the time complexity result holds both with the balance assumption and with the arithmetic case. The extreme case still exists though: cluster size distribution of (1, 1, 1,…, N − k) would lead to higher time complexity of O(Nk^{3}) instead of O(Nk^{2}). However, balance assumption is still fair because such tiny clusters are usually outliers.
b. Kmeans iterations
Another source of inaccuracy is that we apply only two kmeans iterations. It is possible that kmeans sometimes fails to make the necessary finetuning if the number of iterations is fixed too low. This can cause overestimation of α. However, since the algorithm is iterated long, much more trial swaps can be tested within the same time. A few additional failures during the process is also compensated by the fact that kmeans tuning also happens when a swap is accepted even if it is not considered as successful swap by the theory.
We tested the algorithm also using 10 kmeans iterations to see whether it affects the estimates of T, see column Real^{10} in Table 4. The estimates are closer to the reality with the S sets having cluster overlap. This allows kmeans to operate better between the clusters. However, the estimates are not significantly more accurate, and especially with datasets like Dim and Unbalance where clusters are well separated, the difference is negligible. The value of the kmeans iterations is therefore not considered as important.
c. Number of swaps
Total number of iterations required to reach a solution with certain number of swaps remaining (0–4)
Dataset  Iterations to reach CIvalue  Factor  log w  

Total  0  1  2  3  4  
BIRCH _{ 1 }  637  440  78  44  24  14  1.45  4.2 
BIRCH _{ 2 }  1246  761  191  84  51  33  1.64  4.4 
S _{ 1 }  33  26  7  2  1  1  1.34  1.5 
S _{ 2 }  25  19  4  1  1  1  1.30  1.4 
S _{ 3 }  22  17  4  1  1  1  1.34  1.4 
S _{ 4 }  25  19  3  1  1  1  1.27  1.4 
Unbalance  122  58  29  23  13  1  2.09  2.0 
Dim32  73  52  13  6  3  2  1.42  1.9 
Dim64  83  59  13  7  4  2  1.41  1.9 
Dim128  91  56  20  9  4  3  1.61  1.9 
Overall trend is that the log w term is slightly too high estimate when compared to the reality. In case of S sets, the log w value indicates 1.5 total work, whereas the reality is between 1.27 and 1.34. For example, S_{1} requires 33 swaps in total, of which 26 are spent for the last swap. In BIRCH datasets, the difference is much more visible. About 20 swaps are needed, which indicates extra work by a factor of about 4. In reality, only 50% more is required. The corresponding numbers for unbalance (2.0 vs. 2.09) and Dim sets (1.9 vs. 1.41–1.61) show also mild overestimate.
Overall, our conclusion is that for datasets with clear clustering structure, the last swap is the bottleneck and the additional work for all the previous swaps at most doubles the total work load. We therefore conclude that knowing the exact number of swaps needed is not very important to have a good estimate for the required number of iterations.
d. Effect of N and k
Timedistortion efficiency in practice
We next compare timedistortion performance of the random swap (RS) algorithm to repeated kmeans (RKM) in practice. It is well known that kmeans often gets stuck into an inferior local minimum, and because of this, most practitioners repeat the algorithm. The number of repeats is similar parameter as the number of iterations in RS. We consider 10–100 repeats but it can be easily extended to much higher. Open questions are how many repeats should one use, and whether RKM will also find the correct global allocation.
The results show that RS is significantly more efficient than RKM. It usually achieves the same quality equally fast as kmeans, and outperforms it when kept iterating longer. In case of artificial data with clear clusters (Birch_{1}, Birch_{2}, Unbalanced), RS finds the correct clustering (CI = 0) in all cases. For the same data, kmeans has (CI = 7, CI = 18, CI = 3) and repeated kmeans (CI = 3, CI = 9, CI = 1). RS achieves the correct clustering in less than 1 min using (606, 2067, 98) trial swaps. Repeated kmeans is successful only with the Unbalance dataset. It finds the correct result 60% of the trials if repeated 20,000 times. On average, it took 17,538 repeats to reach CI = 0 by RKM.
Multidimensional image data sets do not have clear clustering structure. The clustering can therefore endup with significantly different cluster allocation (10% different clusters) despite having virtually the same SSEvalue (see “Optimality of the random swap” section). The number of missing swaps is therefore less intuitive what it means in practice, but the superiority of RS in comparison to the repeated Kmeans is still clear. The toughest data set is Europe, for which RS outperforms RKM only after running several minutes (after about 2000 swaps).
The expected number of iterations (dashed vertical line) gives a reasonable estimate when the algorithm has succeeded in case of S and BIRCH datasets. For the image datasets, it is just like a line drawn into water without any specific meaning. Since they lack clear clustering structure, the algorithm keeps on searching—and also keeps on finding—better allocations of the prototypes. It seems not to reach any local or global minimum. Additional tests revealed that it seems to stabilize somewhere before 10 M iterations, which corresponds roughly about 3 weeks of processing—way beyond any practical significance.
Finally, we tested the optimality of the random swap using the datasets with known ground truth (S, Birch, Dim, Unbalance). We let the algorithm run until it reached the correct clustering (CI = 0), and then restarted it from scratch. We let it run 10,000 times for Birch sets and 1,000,000 times for S, Dim and Unbalance sets. The algorithm found CI = 0 result every time and never got stuck into a suboptimal solution even once.
Comparison to other algorithms
Summary of the processing times and clustering quality
Dataset  KM  RKM  KM++  XM  AC  RS_{5000}  RS_{x}  GKM  GA 

Processing time (s)  
BIRCH_{ 1 }  3.7  374  3.2  107  141  276  420  –  297 
BIRCH_{ 2 }  1.1  114  1.0  12  144  92  537  –  256 
S_{ 1 }  < 1  1.1  < 1  < 1  < 1  5  < 1  74  3 
S_{ 2 }  < 1  1.4  < 1  < 1  < 1  6  < 1  95  2 
S_{ 3 }  < 1  1.8  < 1  < 1  < 1  6  < 1  109  3 
S_{ 4 }  < 1  2.5  < 1  < 1  < 1  7  < 1  117  3 
Unbalance  < 1  2.6  < 1  < 1  < 1  12  < 1  152  2 
Dim32  < 1  < 1  < 1  < 1  < 1  3  1.5  6  1 
Dim64  < 1  < 1  < 1  < 1  < 1  5  3  11  2 
Dim128  < 1  < 1  < 1  < 1  < 1  8  5  19  3 
Centroid index (CI)  
BIRCH_{ 1 }  6.6  2.9  4.0  1.6  0  0  0  –  0 
BIRCH_{ 2 }  16.9  10.5  7.6  1.7  0  0  0  –  0 
S_{ 1 }  1.8  0.0  1.1  0.3  0  0  0  0  0 
S_{ 2 }  1.5  0.0  1.0  0.2  0  0  0  0  0 
S_{ 3 }  1.1  0.0  0.9  0.3  0  0  0  0  0 
S_{ 4 }  0.8  0.0  0.9  0.4  1  0  0  0  0 
Unbalance  3.9  2.0  0.5  1.7  0  0  0  0  0 
Dim32  3.8  1.1  0.5  2.7  0  0  0  0  0 
Dim64  3.7  1.1  0.0  4.0  0  0  0  0  0 
Dim128  4.0  1.4  0.0  4.2  0  0  0  0  0 
The results show that all kmeans variants (KM, RKM, KM++, XM) fail to find the correct result in more than 50% of the cases. Kmeans++ and xmeans work better than kmeans but not significantly better than RKM. Better algorithms (AC, RS, GKM, GA) are successful in all cases with the exception of AC, which makes one error with S_{4}. RS is simplest to implement. AC is also relatively simple but requires the fast variant from [35] since a straightforward school book or Matlab implementations would be an order of magnitude slower. GA [39] is composed of the same AC and kmeans components.
The down side of using the better algorithms is their slower running time. Three algorithms (AC, RS, GA) work in reasonable time for data sets of size N = 5000 but require already several minutes for the Birch sets of size N = 100,000. GKM is slow in all cases.
Conclusions

Linear O(N) dependency on the size of data.

Quadratic O(k^{2}) dependency on the number of clusters.

Inverse O(1/α) dependency on the neighborhood size.

Logarithmic O(log w) dependency on the number successful swaps needed.
The main limitation is that no practical stopping criterion can be reliably derived from the theory. Previously a fixed number of iterations has been used, such as T = 5000, or another rule of thumb such as T = N. Here we used w = 1 and estimated α using the preliminary clustering after T = 5 iterations. This works ok in practice, but a better estimation would still be desired. Nevertheless, if the quality of the clustering is the main concern, one can simply iterate the algorithm as long as there is time.
The worst case of the algorithm is when the clusters are isolated (α = 1), which leads to O(Nk^{2}) time for one swap. The number of successful swaps needed (w) is unknown but it has at most logarithmic additional cost. Empirical results indicate that the last swap is the most timeconsuming. Theoretical upper limit would be O(Nk^{2}logk) assuming that w = k swaps were needed. Due to the quadratic dependency on k, a faster algorithm such as deterministic swap [20] or divisive clustering [10] might be more useful if the number of clusters is very high.
Implementation of random swap is publicly available in C, Matlab, Java, Javascript, R and Python. We also have web page (http://cs.uef.fi/sipu/clusterator/) where anyone can upload data in text format and obtain quick clustering result in a 5 s, or alternatively use longer 5 min option. As future work, we plan to extend the Clusterator to solve the number of clusters, and also to support nonnumerical data. Importing random swap to other machine learning platforms like Spark MLlib will be considered, including a parallel variant to provide better support for big data.
Notes
Authors’ contributions
The author is solely responsible of the entire work. The author read and approved the final manuscript.
Authors’ information
Pasi Fränti received the M.Sc. and Ph.D. degrees from the University of Turku, 1991 and 1994 in Computer Science. Since 2000, he has been a professor of Computer Science at the University of Eastern Finland. He has published 72 journal and 159 conference papers, including 14 IEEE transaction papers. His current research interests are in clustering and locationaware applications.
Acknowledgements
The author thanks Dr. Olli Virmajoki for his efforts on the earlier drafts: let’s shed more sweat and tears in future!
Competing interests
The author declares there is no competing interests.
Data
All data is publicly available here: http://cs.uef.fi/sipu/datasets/.
Duplication
The content of the manuscript has not been published, or submitted for publication elsewhere.
Ethics approval and consent to participate
Not required.
Funding
No funding to report.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
 1.Hautamäki V, Pöllänen A, Kinnunen T, Lee KA, Li H, Fränti P. A comparison of categorical attribute data clustering. In: Joint international workshop on structural, syntactic, and statistical pattern recognition (S+SSPR 2014), LNCS 8621, Joensuu, Finland; 2014. p 55–64.Google Scholar
 2.Tuikkala J, Elo LL, Nevalainen OS, Aittokallio T. Missing value imputation improves clustering and interpretation of gene expression microarray data. BMC Inf. 2008;9(1):202.Google Scholar
 3.Ott L, Pang L, Ramos F, Chawla S. On integrated clustering and outlier detection. In: Advances in neural information processing systems (NIPS). 2014. p 1359–67.Google Scholar
 4.Fränti P, Rezaei M, Zhao Q. Centroid index: cluster level similarity measure. Pattern Recognit. 2014;47(9):3034–45.CrossRefGoogle Scholar
 5.Fritzke B. The LBGU method for vector quantization—an improvement over LBG inspired from neural networks. Neural Process Lett. 1997;5(1):35–45.CrossRefGoogle Scholar
 6.Fränti P, Virmajoki O. Iterative shrinking method for clustering problems. Pattern Recognit. 2006;39(5):761–5.CrossRefMATHGoogle Scholar
 7.Likas A, Vlassis N, Verbeek JJ. The global kmeans clustering algorithm. Pattern Recognit. 2003;36:451–61.CrossRefGoogle Scholar
 8.Kaukoranta T, Fränti P, Nevalainen O. Iterative splitandmerge algorithm for VQ codebook generation. Opt Eng. 1998;37(10):2726–32.CrossRefGoogle Scholar
 9.Frigui H, Krishnapuram R. Clustering by competitive agglomeration. Pattern Recognit. 1997;30(7):1109–19.CrossRefGoogle Scholar
 10.Fränti P, Kaukoranta T, Nevalainen O. On the splitting method for vector quantization codebook generation. Opt Eng. 1997;36(11):3043–51.CrossRefGoogle Scholar
 11.Resende MGC, Werneck RF. A fast swapbased local search procedure for location problems. Ann Oper Res. 2007;150(1):205–30.MathSciNetCrossRefMATHGoogle Scholar
 12.Fränti P, Kivijärvi J, Nevalainen O. Tabu search algorithm for codebook generation in VQ. Pattern Recognit. 1998;31(8):1139–48.CrossRefGoogle Scholar
 13.Fränti P, Kivijärvi J. Randomised local search algorithm for the clustering problem. Pattern Anal Appl. 2000;3(4):358–69.MathSciNetCrossRefMATHGoogle Scholar
 14.Kanungo T, Mount DM, Netanyahu N, Piatko C, Silverman R, Wu AY. A local search approximation algorithm for kmeans clustering. Comput Geometry. 2004;28(1):89–112.MathSciNetCrossRefMATHGoogle Scholar
 15.Ng RT, Han J. CLARANS: a method for clustering objects for spatial data mining. IEEE Trans Knowl Data Eng. 2002;14(5):1003–16.CrossRefGoogle Scholar
 16.Fränti P, Gyllenberg HH, Gyllenberg M, Kivijärvi J, Koski T, Lund T, Nevalainen O. Minimizing stochastic complexity using GLA and local search with applications to classification of bacteria. Biosystems. 2000;57(1):37–48.CrossRefGoogle Scholar
 17.Rus C, Astola J. Legend based elevation layers extraction from a colorcoded relief scanned map. In: IEEE international conference image processing (ICIP), vol. 2. Genova, Italy; 2005. p 237–40.Google Scholar
 18.Güngörand Z, Ünle A. Kharmonic means data clustering with simulated annealing heuristic. Appl Math Comput. 2007;184(2):199–209.MathSciNetGoogle Scholar
 19.Nosovskiya GV, Liub D, Sourina O. Automatic clustering and boundary detection algorithm based on adaptive influence function. Pattern Recognit. 2008;41(9):2757–76.CrossRefGoogle Scholar
 20.Fränti P, Tuononen M, Virmajoki O. Deterministic and randomized local search algorithms for clustering. In: IEEE international conference on multimedia and expo, (ICME’08). Hannover, Germany; 2008. p 837–40.Google Scholar
 21.Kaukoranta T, Fränti P, Nevalainen O. A fast exact GLA based on code vector activity detection. IEEE Trans Image Process. 2000;9(8):1337–42.CrossRefGoogle Scholar
 22.Lai JZC, Liaw YC. Improvement of the kmeans clustering filtering algorithm. Pat. Rec. 2008;41(12):3677–81.CrossRefMATHGoogle Scholar
 23.Elkan C. Using the triangle inequality to accelerate kmeans. In: International conference on machine learning (ICML’03), Washington, DC, USA; 2003. p 147–53.Google Scholar
 24.Fränti P, Virmajoki O, Hautamäki V. Fast agglomerative clustering using a knearest neighbor graph. IEEE TransPattern Anal Mach Intell. 2006;28(11):1875–81.CrossRefGoogle Scholar
 25.Jin X, Kim S, Han J, Cao L, Yin Z. A general framework for efficient clustering of large datasets based on activity detection. Statist Anal Data Mining. 2011;4:11–29.MathSciNetCrossRefGoogle Scholar
 26.Lai JZC, Liaw YC, Liu J. A fast VQ codebook generation algorithm using codeword displacement. Pattern Recognit. 2008;41:315–9.CrossRefMATHGoogle Scholar
 27.Pestov V. On the geometry of similarity search: dimensionality curse and concentration of measure. Inf Process Lett. 2000;73:4751.MathSciNetCrossRefMATHGoogle Scholar
 28.Cormen T, Leiserson C, Rivest R. Introduction to algorithms. 2nd ed. New York: MIT Press; 1990. p. 54.MATHGoogle Scholar
 29.Aurenhammer F. Voronoi diagramsa survey of a fundamental geometric data structure. ACM Comput Surv. 1991;23(3):345–405.CrossRefGoogle Scholar
 30.Preparata F, Shamos M. Computational geometry. New York: Springer; 1985.CrossRefMATHGoogle Scholar
 31.Barber CB, Dobkin DP, Huhdanpää HT. The quickhull algorithm for convex hulls. ACM Trans Math Softw. 1996;22(4):469–83.MathSciNetCrossRefMATHGoogle Scholar
 32.Fränti P, MariescuIstodor R, Zhong C. XNN graph. In: Joint international workshop on structural, syntactic, and statistical pattern recognition (S+SSPR 2016), Merida, Mexico, LNCS 10029; 2016. p 207–17.Google Scholar
 33.Zhang T, Ramakrishnan R, Livny M. BIRCH: a new data clustering algorithm and its applications. Data Min Knowl Disc. 1997;1(2):141–82.CrossRefGoogle Scholar
 34.KDD Cup 2004. bio_train.dat, training data for the protein homology task. 2004. http://osmot.cs.cornell.edu/kddcup.
 35.Fränti P, Kaukoranta T, Shen DF, Chang KS. Fast and memory efficient implementation of the exact PNN. IEEE Trans Image Process. 2000;9(5):773–7.CrossRefGoogle Scholar
 36.Zhong C, Malinen MI, Miao D, Fränti P. A fast minimum spanning tree algorithm based on Kmeans. Inf Sci. 2015;295:1–17.MathSciNetCrossRefMATHGoogle Scholar
 37.Arthur D, Vassilvitskii S. Kmeans++: the advantages of careful seeding. In: ACMSIAM symposium on discrete algorithms (SODA’07), New Orleans, LA. 2007. p 1027–35.Google Scholar
 38.Pelleg D, Moore A. Xmeans: extending kmeans with efficient estimation of the number of clusters. In: International conference on machine learning, (ICML’00), Stanford, CA, USA. 2000.Google Scholar
 39.Fränti P. Genetic algorithm with deterministic crossover for vector quantization. Pattern Recognit Lett. 2000;21(1):61–8.CrossRefGoogle Scholar
 40.Wu X. On convergence of Lloyd’s method I. IEEE Trans Inf Theory. 1992;38(1):171–4.MathSciNetCrossRefMATHGoogle Scholar
 41.Selim SZ, Ismail MA. Kmeanstype algorithms: a generalized convergence theorem and characterization of local optimality. IEEE Trans Pattern Anal Mach Intell. 1984;6(1):81–7.CrossRefMATHGoogle Scholar
 42.Rezaei M, Fränti P. Set matching measures for external cluster validity. IEEE Trans Knowl Data Eng. 2016;28(8):2173–86.CrossRefGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.