An empirical comparison between stochastic and deterministic centroid initialisation for K-Means variations

K-Means is one of the most used algorithms for data clustering and the usual clustering method for benchmarking. Despite its wide application it is well-known that it suffers from a series of disadvantages, such as the positions of the initial clustering centres (centroids), which can greatly affect the clustering solution. Over the years many K-Means variations and initialisations techniques have been proposed with different degrees of complexity. In this study we focus on common K-Means variations and deterministic initialisation techniques and we first show that more sophisticated initialisation methods reduce or alleviates the need of complex K-Means clustering, and secondly, that deterministic methods can achieve equivalent or better performance than stochastic methods. These conclusions are obtained through extensive benchmarking using different model data sets from various studies as well as clustering data sets.


Introduction
The most well-known algorithm in the field of clustering analysis is the K-Means algorithm. Its implementation simplicity, versatility and efficiency makes it popular to many different research fields [1,2]. Despite its reputation and success in many different studies it is an established fact that it has a series of disadvantages such that it can detect only spherical and well-separated clusters, it is sensitive to outliers, highly dependent on the features (attributes) of the data set under analysis and it converges to local minima [2]. Over the years a number of K-Means variations (Lloyd's K-Means [3], Hartigan-Wong's K-Means [4]), K-Means inspired algorithms (K-Medians [5]) and K-Means initialisation methods [6] have been proposed in order to overcome these issues and also enhance K-Means with extra properties such as feature selection mechanisms [7,8] and outliers robustification [9].
In the literature there are various studies regarding the importance of the initial cluster centroids for the performance of the K-Means algorithm [2] and extensive testing on various initialisation techniques [6,10], but a detailed comparison on the effects on these techniques on common K-Means variations is not directly available. We hypothesize that sophisticated initialisation methods alleviate the need for complex clustering and, if deterministic, they could lead to arXiv:1908.09946v2 [cs.LG] 6 Sep 2019 satisfactory solutions within a single execution of the clustering algorithm. Consequently, they would alleviate the need of executing a stochastic method multiple times and picking the best clustering based on some criterion.
In order to investigate this hypothesis we compare five different clustering initialisation methods, namely Random (MacQueen) [11], K-Means++ [12], ROBust INitialisation (ROBIN) [13], Kaufman [14] and Density K-Means++ (DK-Means++) [15], and their effects on common K-Means variations, Lloyd's [2] K-Means, Hartigan-Wong's K-Means [16,4] and K-Medians [5]. We show that more sophisticated initialisation methods can reduce the performance difference among the K-Means implementations and that the deterministic DK-Means++ method can achieve the same or better performance than the one obtained using stochastic methods, which require multiple executions and selection of the clustering which had the best performance. We also show that the ROBIN algorithm can become deterministic (D-ROBIN) with a minor change and we propose D-ROBIN as an alternative to DK-Means++ when memory requirements are important. A similar study comparing many different intialisation methods has been performed by Celebi et al. [6] but it is focused on algorithms of linear complexity without considering a specific K-Means implementation. Recently, another study by Fránti and Sieranoja [10] was performed on stochastic initialization heuristics for K-Means and how much the algorithm can be improved by repetition. They base their conclusions using the clustering basic benchmark [17,10] which contains standalone data sets with different properties and they showed that K-Means performance is in general poor on unbalanced data sets and that the algorithm is not affected by high dimensionality while more iterations can improve its performance on overlapping clusters. In our case we performed a more extensive benchmarking by taking into consideration data set generation models as well as standalone data sets. The models gave us the ability to perform hypothesis testing in order to strengthen our conclusions and to account for variability.

The K-Means algorithm
Given K initial centroids, the K-Means algorithm [2] assigns the data points of a data set into K clusters which minimizes the within cluster sum of squares (WCSS): where K is the number of clusters, n the number of data points (observations) and p the dimensionality (number of features) of a given dataset; x ij is the values of the j-th feature of the i-th data point and x i: the values of its p features; m k: specifies the location of the k-th cluster centroid which is computed by taking the average of the data points which belong to that cluster considering all the dimensions. This problem is equivalent to maximizing the between cluster sum of squares which is given by [7]: where M 1: specifies the global centroid assuming that all the data points belong to one cluster.
The steady state analysis of 1 leads to the computation of the cluster center as follows: ∂W CSS ∂m k: = ∂ ∂m k: where n k is the number of points in cluster k.

Lloyd's K-Means
Lloyd's method is the most commonly used K-Means and the standard clustering method in many programming languages such as MATLAB [18] and Python [19]. The steps of this algorithm are as follows [2]: 1. Generate K centroids for the clusters C = {c 1 , · · · , c k }. This is normally done by choosing K data points as cluster centroids at random from the data set (MacQueen method [11]). 2 2. Assign each data point to its nearer cluster based on the minimum squared euclidean distance between the point and the cluster centroids, i.e the cluster of the i-th data point will be k * = argmin k {||x i: − m k: ||}.
3. Recompute the cluster centroids by taking the mean of the data points assigned to them, i.e for the k-th cluster, if it contains n k elements its centroid is computed as m k: = 1 n k n k i=1 x i: , ∀x i: ∈ c k . 4. Repeat steps 2 and 3 until converge, i.e. there are no more data points reassignments.

Hartigan-Wong's K-Means
Hartigan-Wong's K-Means algorithm is an alternative to Lloyd's K-Means and the default K-Means of R language [20]. In the study of [3] it is shown that this method has lower probability of converging to a local minima solution compared to Lloyd's method in exchange of extra complexity. The algorithm starts by executing the first two steps of Lloyd's K-means algorithm and then proceeds as follows [4,3] (d) Recompute the centroid of c ξ given its new element x i: by taking the mean of the data points assigned to them, i.e m ξ: = 1

The K-Medians algorithm
The K-Medians algorithm [5] is similar to the K-Means but uses the median instead of the mean to calculate the cluster centroid. For every dimension the objection function of the algorithm is given by the equation: wherem kj specifies the location of the k-th cluster centroid in the j-th dimension which is computed by taking the median of the data points x ij belonging to that cluster on the j-th dimension. K-Medians corresponds to the L 1 -norm as opposed to the L 2 -norm of K-Means [5]. The use of median in place of the mean makes the K-Median algorithm robust to outliers [21] since the median has a breaking point of 0.5, i.e. even if half of the data set is corrupted by outliers the median of the corrupted data set will be similar to the median of the original data set [22]. The study of [23] contains a list of K-Median algorithms and common implementations consider a similar algorithm to Lloyd's K-Means where in the 3rd step of Lloyd's algorithm the median is used to compute the new centroids locations instead of the mean. It should be noted that during the 2nd step of the algorithm (the assignment step) the use of Euclidean or Euclidean 2 will have qualitatively the same result.

Random
The initialisation method of MacQueen [11] proposes a random selection of data points from the data set which will be the initial centroids. This is one of the earliest clustering initialisation techniques and an improvement of Jancey's method [24]. The latter study suggested the centroids to be at random locations within the minimum hypersphere of the data set but this might result to empty clusters to be generated after the execution of the K-Means algorithm.

K-Means++
K-Means++ [12] is a standard clustering initialisation technique to many programming languages such as MATLAB and Python. It uses a probabilistic approach in order to select as initial centroids data points that are far away from each other. The steps of this algorithm are as follows: 1. Select randomly a data point x i: as the first centroid m 1: uniformly from the data set. 3. Repeat the previous step until K data points have been selected as centroids.

Kaufman
Kaufman and Rousseeuw [14] proposed a deterministic method for centroids initialisation. Their method is as follows [1]: 1. Select as the first centroid the closest data point to the global centroid of the data set.
2. For every two non-selected data points x i: and x j: calculate C ji = max min where c k is one of the selected data points as centroid. 3. Select as the next centroid the data point x * i: which maximizes j C ji 4. Repeat the previous two steps until we have K centroids.
Kaufman's and Rousseeuw's algorithm has quadratic complexity O(N 2 ) because of the computation of the pairwise distances [6].

ROBust INitialisation (ROBIN)
ROBIN [13] is a robust to outliers initialisation method. It uses the Local Outlier Factor (LOF) [25] in order to select as initial centroids data points that are far away from each other and also representative points of dense regions in the data set. It is not deterministic since one reference point should be picked at random from the data set. In addition it requires one more tuning parameter which is the number of neighboring data points mp to be consider when computing the LOF of each data point. The steps of this algorithm are as follows: 1. Compute the LOF of each data point in the data set given the following formulas [13], ard(x i: , mp) = density(x i: , mp) LOF (x i: , mp) = 1 ard(x i: , mp) (7) where N N (x i: , mp) refers to the set of mp nearest data points, i is the index of the i -th data point in N N (x i: , mp) and |N N (x i: , mp)| ≥ mp. ard stands for the average readability distance.
2. Pick a data point x r: at random from the data set. 3. Sort the data points of the data set in decreasing order of their distance from x r: 4. Find the first of the sorted data point x r : for which LOF (x r : , mp) ≈ 1 and pick this data point as a cluster centroid. A LOF score close to 1 indicates a data point inside a dense region of the feature space. 5. Sort the data points of the data set in decreasing order of their distance from x r : 6. Repeat the previous two steps until we have K centroids, by replacing x r : with the point x c : whose coordinates have been calculated in the following way: let {x c: } be a set containing the data points selected as centroids, c be the index on the selected data points, x cj be a data point from the set with p dimensionality and j is the index on the dimensions. Then x c : = min c {x cj }∀j. This is equivalent on taking the minimum distances from the elements in {x c: } over all the dimensions.
ROBIN has complexity of O(N log N ) [6]. Regarding the 4 th step of the algorithm, in an R implementation (refer to the study of [9]) the formula LOF (x r : , mp) < 1.05 was used but since the LOF score can also be lower than 1, in our experiments, we used the formula 1 − crit < LOF (x r : , mp) < 1 + crit where crit was set to 0.05. 4

Density K-Means++ (DK-Means++)
DK-Means++ [15] is a deterministic method for centroids initialisation based on data density. It is an improved method of [26,27] since it requires only the parameter k to be defined and utilizes a heuristic to detect dense regions in the data set in order to select optimal centroids. The steps of this algorithm are as follows: 1. Compute an appropriate value for ε which is the radius of the hypersphere centered around each data point. In order to select a value for ε the following formula is proposed: ε = 3 · IQR(Λ) + 75 th percentile(Λ), where IQR stands for the Inter Quartile Range and Λ are the weights of edges of the Minimum Spanning Tree of the distance matrix of the data set. This equation was inspired based on statistical methods for detecting outliers [28] and is considered a good approximation for ε [15].

2.
For each x i: define the ε-neighbors(x i: ) as the data points falling under the hypersphere with centroid at x i: and radius ε. Compute the local density p(x i: ) using Equation 8: 3. Normalize p(x i: ) using the min-max normalization.
4. The first cluster centroid m 1: is the data point x * i: for which p(x * i: ) = max{p(x)}. Then C = {x * i: }. 5. Until K centroids are selected, for the rest of the data points compute their prospectiveness φ( The complexity of DK-Means++ varies from O(n log n) to O (2n − 1) log n depending on the algorithm used to compute the minimum spanning tree of the distance matrix (e.g. Kruskal's or Prim's algorithm) [29]. Since the distance matrix is a fully connected graph it will have vertices V = n and edges E = n − 1, where n is the number of data points in the data set. In case of Kruskal's algorithm, which has complexity O (n − 1) log n , the total algorithm complexity will be dominated by the sorting required in step 5 of the algorithm thus O(n log n) [6]. In case of Prim's algorithm the complexity can reach O (2n − 1) log n depending on its implementation. If the distance matrix is unknown then the algorithm complexity is dominated by the distance matrix computation which is O 1 2 n(n − 1) .

Deterministic ROBIN (D-ROBIN)
Both ROBIN [13] and the DK-Means++ [15] detect dense regions in the feature space and pick the most representative points in those regions as initial centroids. Their main difference is that ROBIN is picking good enough candidates for initial centroids while DK-Means++ is trying to find the optimal ones. This can be seen from the 4th step of ROBIN where the first data point x r : , which satisfies LOF (x r : , mp) ≈ 1, is picked as an initial centroid; this does not guarantee that x is the optimal candidate. Another difference is that the 2nd step of ROBIN is making it stochastic. In order to overcome the issue of stochasticity of ROBIN the reference point x r: can be selected as the one that has LOF score closest to 1 which means that this point is the most dense point in the data set. By performing this action it is guarantee that an almost optimal candidate will be selected as the first centroid; this can be seen from the 3rd to 5th steps of ROBIN (see section 2.2.4).

Clustering evaluation
In the literature there are many indexes for assessing the clustering performance [30,17]. Here, in order to evaluate the clustering solutions, we assume that our data belong to different classes l and it is possible that one cluster might contain more than one classes. We then define the purity of each cluster as P ci = 1 nc i max(n l ci ) and the overall clustering purity as: where i equals 1 to the number of available clusters K, n ci is the number of data points in cluster c i and n l ci is the number of data points in cluster c i with class label l [30]. Purity is bound between [0 1]; larger values of purity correspond to better performance accuracy and purity equals to 1 specifies an accuracy of 100% meaning that each cluster has data points from only one class. This performance indicator is ideal for our experiments as we know the class labels of our data. 5 Table 1: Gap [31] and weighted gap statistic [32] data sets models. Points the number of data points per cluster, or indicates that a random number was selected among the specified numbers for each cluster, to indicates that a random number was selected between the specified numbers for each cluster; D: number of features or attributes of the data set (dimensions); C: number of generated clusters. Gaussian models: clusters of low dimensionality generated from Gaussian distributions. 10-D Gaussian models: clusters of higher dimensionality generated from Gaussian distributions. Elongated models: clusters generated by adding Gaussian noise across lines. Misc models: (1) data sets containing Gaussian clusters of very different sizes, (2) non-Gaussain clusters generated from the exponential distribution. Visualization (when possible) of a data set from each model is available in Figure 1. Gaussian

Benchmarks
In our experiments we use the synthetic data sets models from the studies of Tibshirani et al. (gap statistic) [31], Yan and Ye (weighted gap statistic) [32] and Brodinova et al. [9]. From the first two studies we use the models without further modifications. A summary of the models can be found in Table 1 grouped by specific properties of the models. For more information refer to the relevant studies and also to Figure 1 for sample visualization of a data sets. We exclude model 1 from the gap statistic study [31] since it contains one cluster. The Brodinova et al. [9] generator was used to generate high-dimensional data sets consisting of informative and non-informative features. No noise injection (attributes with noise contamination) was considered in the current study. To avoid situations of overlapping clusters the minimum Euclidean distance between any two points in different clusters was set to 3 and data sets violating this rule were re-generated. A summary of the models can be found in Table 2.
Next we use our own synthetic data sets models consisted of clusters with mixed properties. We will refer to these models as mixed (refer to Figure 1 for a sample visualization of these models).
• model 1 generates 3-dimensional clusters, 1 spherical and 2 elongated. The spherical cluster is an 80 points Gaussian cluster at the origin with standard deviation of 0.1. The two elongated clusters have 100 points each and are generated as follows:   [9]. The minimum allowed Euclidean distance between two data points in different clusters was set to 3 and no noise injection was considered. Name: name of the model; Data points: the total number of data points in the data set; Dimensions: number of features or attributes of the data set, Informative (+) indicates attributes that are required to describe the data set while Non-informative (-) indicates variables that should be ignored; Clusters: number of generated clusters. These models are creating high-dimensional Gaussian clusters of different shapes using two different distributions, one for the informative and one for the uninformative variables. Left table (Brodinova (1)): Each cluster contains 40 data points. The parameters of these models are selected to test the performance of the clustering algorithm in data sets with different degrees of informative and/or uninformative features Right table (Brodinova (2)): The first four models create higher-dimensional balanced clusters (clusters of equal sizes) and the last two higher-dimensional unbalanced clusters each with number of points randomly selected between 50 to 100. The parameters of these models are selected to test the performance of the clustering algorithm in balanced and unbalanced data sets of higher dimensionality with increasing number of clusters. The input space was selected to be sparse, i.e. a few hundred points in 1000 or 1500 dimensions to avoid the slow computation of the Kaufman algorithm. We also consider the S-sets [33] and the A-sets [34] obtained from the clustering basic benchmark which was used in the studies of [17,10]. The aforementioned studies were dedicated to the K-Means properties and variations on various synthetic clustering data sets. Both models contains 2-dimensional data; S-sets contains 4 data sets with 5000 data points distributed among 15 Gaussian clusters with different degree of clustering overlap [33] and A-sets contains 3 data sets with 50, 35 and 50 clusters and 150 data points per cluster [34]. For more information about these data sets refer to the relevant studies.

Results
We test the performance of the K-Means variations, Lloyd's [2] Hartigan-Wong's [16,4] and K-Medians [5] initialised using the six different clustering initialisation methods named: Random [11], K-Means++ [12], ROBIN [13], Kaufman [14], DK-Means++ [15] and our proposed D-ROBIN. For the ROBIN and D-ROBIN the mp parameter specifying the number of neighbor data points was set to 10 as in the original study [13]. For the Hartigan-Wong's algorithm NAG's implementation was used [35]. In our experiments we use the synthetic data sets models from the studies of gap statistic [31] and weighted gap statistic [32] (refer to Table 1), Brodinova [9] (refer to Table 2) and other four custom data sets models (refer to Methods and Figure 1). From each model we generated 40 data sets and for each data set the stochastic methods were executed 25 times; the minimum, the maximum, the average and the standard deviation of the purity index of the clustering solutions were calculated. We also use the clustering data sets (S-sets [33] and A-sets [34]) from the studies of [17,10]. For each of these data sets we use the same set up of executing the stochastic methods 25 times.

Comparissons among Stochastic and Deterministic methods
We compare the different initialisation methods and their effects on K-Means clustering for the stochastic and the deterministic methods separately. Based on the results on the stochastic methods (Random, K-Means++ and ROBIN) and for the gap, weighted gap, Brodinova, mixed models, more sophisticated initialisation techniques lead in better clustering and K-Means initialized with ROBIN achieves, on average, the best performance (refer to Figure 2 where in 23/26 cases there is statistical significant difference on the average performance between ROBIN and the other two methods from these cases 22/23 times ROBIN is better than Random and K-Means++). Similar observations can be obtained from the clustering data sets (see Figure 6). Based on the results on the deterministic methods (Kaufman, DK-Means++ and D-ROBIN) K-Means initialized with DK-Means++ achieves the best performance followed closely by D-ROBIN (refer to Figure 3 where (a) there is statistical significant difference between Kaufman and DK-Means++ (D) Mixed models: these are the additional models proposed in our study that contain clusters with mixed properties such as different sizes (unbalanced) and/or generated from Gaussian and non-Gaussian distributions.
in 13/26 cases and from them in 10/13 cases DK-Means++ achieves better performance than Kaufman; (b) there is statistical significant difference between Kaufman and D-ROBIN in 12/26 cases and from them in 7/12 cases D-ROBIN has significantly better performance than Kaufman; (c) there is statistical statistical significant difference between D-ROBIN and DK-Means++ in 5/26 cases and DK-Means++ wins in all of them). Similar observations can be obtained from the clustering data sets (see Figure 6). It should be noted that similar conclusions are obtained regardless on the K-Means variation (Hartigan-Wong's, Lloyd's or K-Medians) however we observe that in the case of K-Medians Kaufman is better than both D-ROBIN and DK-Means++ considering only the cases where there is statistical difference.
In the cases of clustering data sets DK-Means++ and D-ROBIN are both better than Kaufman (see Figure 6) regardless the clustering algorithm (refer to the tables in the supplementary material).
Furthermore, we compare the initialisation methods of ROBIN and DK-Means++ which are the best performers of the stochastic and deterministic methods correspondingly. In general DK-Means++ achieves the best performance compared to ROBIN (see Figure 4 where in 6/26 cases there is statistical significant difference between the two methods and DK-Means++ wins in all of them) regardless of the K-Means variation (Hartigan-Wong's, Lloyd's or K-Medians). Interestingly in the case of mixed model 4 ROBIN performance drops because it spawns a centroid on one of the edges of the elongated cluster (purple cluster in Figure 1) while DK-Means++ selects a more central point as centroid for the same cluster. This indicates that the minimum spanning tree heuristic of DK-Means++ is more robust to elongated clusters compared to the LOF score of ROBIN. Regarding the clustering data sets (see Figure 6) similar again DK-Means++ is superior to ROBIN. These results suggest that deterministic methods are preferable over multiple executions of stochastic methods.
Next we compare ROBIN with D-ROBIN (see Figure 5) in order to examine if we have loss of performance by turning ROBIN into a deterministic algorithm. Based on the results ROBIN and D-ROBIN are equivalent in most cases or almost equivalent indicating that one execution of D-ROBIN can replace multiple executions of ROBIN. Similar conclusions are obtained in the cases of clustering data sets (see Figure 6) and regardless on the K-Means variation (Hartigan-Wong's, Lloyd's or K-Medians).
8 Table 3: Summary of performance comparison on K-Means variations using stochastic and deterministic methods. In this summary only the gap, weighted gap, Brodinova and mixed models were considered (26 models in total) and only the cases where there is a significant difference in the performance of any two K-Means variations (p < 0.05). HW = Hartigan-Wong's K-Means, Ll = Lloyd's K-Means, KMed = K-Medians. The column Significant better performance counts the number of times that the significant difference on the performance was in favor of each algorithm. Based on the results deterministic methods lowers the performance differences among the K-Means variations except in the case of Lloyd's vs K-Medians. Furthermore Hartigan-Wong's K-Means is superior to Lloyd's either when stochastic or deterministic methods are considered and superior to K-Medians only when stochastic methods are considered. K-Medians is better than Lloyd's. These results suggest that deterministic methods lowers the performance difference among the K-Means variations but not in the case of Loyd's K-Means vs K-Medians. They also suggest that K-Medians is a better algorithm than both Loyd's and Hartigan-Wong's K-Means since we have established from Figure 4 that between the best stochastic (ROBIN) and the best deterministic (DK-Means++) methods the latter has better performance and K-Medians wins the other algorithms on the deterministic methods.

Discussion
K-Means clustering remains one of the most common clustering techniques in many different research fields and frequently it is a component of more complex algorithms (e.g. hierarchical clustering [2]). Following similar benchmark studies on K-Means [6,17,10], in this study we compare stochastic and deterministic initialisation methods on K-Means variations. We particularly investigated the methods of ROBIN and DK-Means++ since to the best of our knowledge they have not been studied as extensively as other initialisation methods. Experimentally we showed: 9 Figure 2: The average performance of K-Means algorithm increases by using more sophisticated initialisation methods. K-Means performance initialized with different stochastic initialisation methods. Each plot shows the performance of the Hartigan-Wong's K-Means clustering solution using the purity index (y-axis) on different data sets models (x-axis) and initialized with different methods. gap, weighted gap and Brodinova models: For each model 40 data sets were generated and each initialisation method was executed 25 times on each model. The mean purity of the 25 executions were obtained for each data set and each bar shows the average mean purity across the 40 data sets. The errorbars are showing the minimum and maximum mean purity. Significant differences were obtained based on the Paired Samples Wilcoxon Test along the mean purity across the 40 data sets (#p-value ≥ 0.05; * p-value < 0.05; * * p-value < 0.01; * * * p-value < 0.001; * * * * p-value < 0.0001). Based on the graphs and considering only the cases with significant difference, the ROBIN initialisation method achieves the best performance compared with Random and K-Means++ except in the cases of weighted gap model 4 (wgap4) and mixed model 4. As expected K-Means++ is superior to Random [6,12]. For Bradinova (2)    showing the minimum and maximum mean purity. Significant differences were obtained based on the Paired Samples Wilcoxon Test along the mean purity across the 40 data sets (#p-value ≥ 0.05; * p-value < 0.05; * * p-value < 0.01; * * * p-value < 0.001; * * * * p-value < 0.0001). Based on the graphs DK-Means++ initialisation achieves equally good or better performance than ROBIN which is the best stochastic method (refer to Figure 2). In more detail in the cases of Brodinova models 2 and 4 and mixed model 4 DK-Means outperforms ROBIN while there is also a significant different between the two methods on the mixed model 3 but in that case the performance between the two is almost equivalent. These results indicate that it is possible to prefer one execution DK-Means++ over multiple executions of ROBIN given also the fact that both algorithms have the same complexity (O (N log N )) in the distance matrix is given. • More sophisticated stochastic initialisation methods can lead to better clustering regardless the K-Means variation and the ROBIN algorithm can achieve the best performance compared with Random and K-Means++ (refer to Figures 2). It also has the smallest variability among the clustering solutions (refer to the tables in the supplementary material).  Table 3) without compromising their performance (refer to the comparison figures in the supplementary material). • Hartigan-Wong's outperforms Lloyd's K-Means when performance differences are considered either with stochastic or deterministic initialisation methods (refer to Table 3). It outperforms K-Medians only when stochastic methods are considered. • K-Medians outperforms Lloyd's when performance differences are considered either with stochastic or deterministic initialisation methods (refer to Table 3).
These conclusions were based on extensive benchmarking considering many different clustering models from other studies: Gaussian, high-dimensional (10 dimensions), elongated, unbalanced, non-Gaussian from the studies of [31] and [32]; high-dimensional (20 dimensions) containing informative and uninformative features and higher-dimensional (1000 and 1500 dimensions) with varying number of clusters (3, 10, 50 clusters) and cluster sizes (50-100 points) [9]. We also used our own mixed models which contain clusters with different properties (unbalanced, elongated and Gaussian; unbalanced Gaussian and non-Gaussian; unbalanced, Gaussian with different standard deviation among their dimensions).
With the use of synthetic data set generators we had the ability to generate multiple data sets and run hypothesis testing to further strengthen our conclusions but we also considered standalone data sets. The clustering data sets S-sets [33] and A-sets [34] were selected from the studies of [17,10] because both are containing more clusters and data points than the generated ones and also because in the case of the S-sets the clusters are having different overlap degrees. The conclusions we obtained from the data set generators match with the conclusions of the stand-alone S-sets and A-sets clustering data.
Based on the previous studies [17,10] the authors have clearly demonstrated that K-Means performs worse when there is large number of clusters and that dimensionality does not have a direct effect on the performance of the algorithm.
In our experiments using the Brodinova models (refer to Figures 2 and 3 Brodinova graphs) we observe that indeed the performance of all the methods drops when the number of clusters is increased regardless of the dimensionality, especially in the case of Brodinova (2) model 4 where we generate data sets having 50 clusters. Apart from the last extreme case, we observed that multiple executions of stochastic methods improves the performance of K-Means but also does the deterministic method of Kaufman. It should also be noted that the deterministic DK-Means++ method achieves high performance on the clustering basic benchmark [17,10] A-sets and also in the S-sets 1 and 2 (refer to Figure 6) even thought these data sets have high number of clusters (A-sets: 20, 30, 50; S-sets: 15); its performance drops only in the S-sets 3 and 4 and this is possible due to the increased overlapping of these data sets. Nevertheless it still achieves better performance than the stochastic methods.
The same authors ( [17,10]) also demonstrated that strong cluster unbalances affect negatively the K-Means clustering.
In our experiments (refer to Figure 3) we observed that data sets with unbalance clusters does not cause any issues with D-ROBIN and DK-Means++ initialization methods except again in the case of Brodinova (2) model 6 possibly because of the number of clusters. However, D-ROBIN (and also its predecessor ROBIN) has impaired performance on mixed model 4 which contains four unbalanced mixed Gaussian clusters, one of which is particularly elongated; this happens because an initial centroid can spawn in one of the edges of this cluster which is relatively dense.
In order to show application to "real world problems" previous studies have chosen to use standard classification data sets as benchmarks for clustering. While this approach is commonly used, in these data the mapping from classes to clusters is somehow forced: it is possible that data from one class belong to different clusters, and assuming that number of clusters equals number of classes is likely to underestimate the true number of clusters. Instead we chose to work with benchmark models that allows us to generate multiple samples and evaluate the statistical significance of the results. At the same time, we considered a broad combination of different clusters, in terms of normality (Gaussian, non-Gaussian), shape (spherical, elongated) and size (clusters with different number of data points) including high dimensional data, as found in real world applications such as bioinformatics [36]. It should also be noted that many clustering frameworks designed to deal with complex data sets (e.g. sub-clustering [37], or sparse clustering [7,8,9]) are using the K-Means or some variant of it and are dependent on good clustering initialisation. Our experimental work revealed that there are deterministic methods (DK-Means++ [15] and D-ROBIN) that can potentially improve the performance of such frameworks or reduce their computational requirements by only one execution of the K-Means component.

Supplementary material
A Summary Tables   Table 1: Detailed comparison of K-Means variations with different initialisation methods using the gap models.
For each model 40 data sets were generated and the non-deterministic methods (Random, K-Means++ and ROBIN) were executed 25 times for each data set. For these methods the average purity of minimum, maximum, mean and standard deviation purity were obtained for each data set and then the average of these metrics were computed over all the data sets. Using the same metrics, for the deterministic methods (Kaufman, DK-Means++ and D-ROBIN) the average minimum, maximum and mean purity over all the data sets is the same and the average standard deviation purity is always 0.  Table 2: Detailed comparison of K-Means variations with different initialisation methods using the weighted gap models. For each model 40 data sets were generated and the non-deterministic methods (Random, K-Means++ and ROBIN) were executed 25 times for each data set. For these methods the average purity of minimum, maximum, mean and standard deviation purity were obtained for each data set and then the average of these metrics were computed over all the data sets. Using the same metrics, for the deterministic methods (Kaufman, DK-Means++ and D-ROBIN) the average minimum, maximum and mean purity over all the data sets is the same and the average standard deviation purity is always 0.  Table 3: Detailed comparison of K-Means variations with different initialisation methods using the Brodinova (1) models. For each model 40 data sets were generated and the non-deterministic methods (Random, K-Means++ and ROBIN) were executed 25 times for each data set. For these methods the average purity of minimum, maximum, mean and standard deviation purity were obtained for each data set and then the average of these metrics were computed over all the data sets. Using the same metrics, for the deterministic methods (Kaufman, DK-Means++ and D-ROBIN) the average minimum, maximum and mean purity over all the data sets is the same and the average standard deviation purity is always 0.  Table 4: Detailed comparison of K-Means variations with different initialisation methods using the Brodinova (2) models. For each model 40 data sets were generated and the non-deterministic methods (Random, K-Means++ and ROBIN) were executed 25 times for each data set. For these methods the average purity of minimum, maximum, mean and standard deviation purity were obtained for each data set and then the average of these metrics were computed over all the data sets. Using the same metrics, for the deterministic methods (Kaufman, DK-Means++ and D-ROBIN) the average minimum, maximum and mean purity over all the data sets is the same and the average standard deviation purity is always 0.  Table 5: Detailed comparison of K-Means variations with different initialisation methods using the mixed models. For each model 40 data sets were generated and the non-deterministic methods (Random, K-Means++ and ROBIN) were executed 25 times for each data set. For these methods the average purity of minimum, maximum, mean and standard deviation purity were obtained for each data set and then the average of these metrics were computed over all the data sets. Using the same metrics, for the deterministic methods (Kaufman, DK-Means++ and D-ROBIN) the average minimum, maximum and mean purity over all the data sets is the same and the average standard deviation purity is always 0.  Table 6: Detailed comparison of K-Means variations with different initialisation methods using the clustering data sets. For each data set the non-deterministic methods (Random, K-Means++ and ROBIN) were executed 25 times. For these methods the average purity of minimum, maximum, mean and standard deviation purity were obtained. Using the same metrics, for the deterministic methods (Kaufman, DK-Means++ and D-ROBIN) the average minimum, maximum and mean purity over all the data sets is the same and the average standard deviation purity is always 0.
K (1) the Gaussian clusters models, (2) the high-dimensions clusters models, (3) the elongated clusters models. gap, weighted gap and Brodinova models: For each model 40 data sets were generated and each initialisation method was executed 25 times on each model. The mean purity of the 25 executions were obtained for each data set and each bar shows the average mean purity across the 40 data sets. Significant differences were obtained based on the Paired Samples Wilcoxon Test along the mean purity across the 40 data sets (#p-value ≥ 0.05; * p-value < 0.05; * * p-value < 0.01; * * * p-value < 0.001; * * * * p-value < 0.0001). Hartigan-Wong is better and negative value that K-Medians is better. gap and weighted gap: (1) the Gaussian clusters models, (2) the high-dimensions clusters models, (3) the elongated clusters models. gap, weighted gap and Brodinova models: For each model 40 data sets were generated and each initialisation method was executed 25 times on each model. The mean purity of the 25 executions were obtained for each data set and each bar shows the average mean purity across the 40 data sets. Significant differences were obtained based on the Paired Samples Wilcoxon Test along the mean purity across the 40 data sets (#p-value ≥ 0.05; * p-value < 0.05; * * p-value < 0.01; * * * p-value < 0.001; * * * * p-value < 0.0001). Hartigan-Wong is better and negative value that K-Medians is better. gap and weighted gap: (1) the Gaussian clusters models, (2) the high-dimensions clusters models, (3) the elongated clusters models. gap, weighted gap and Brodinova models: For each model 40 data sets were generated and each initialisation method was executed 25 times on each model. The mean purity of the 25 executions were obtained for each data set and each bar shows the average mean purity across the 40 data sets. Significant differences were obtained based on the Paired Samples Wilcoxon Test along the mean purity across the 40 data sets (#p-value ≥ 0.05; * p-value < 0.05; * * p-value < 0.01; * * * p-value < 0.001; * * * * p-value < 0.0001).  : Comparison between Lloyd's clustering and K-Medians clustering using stochastic initialisation methods. Each plot shows the goodness of the Lloyd's K-Means clustering solution (from bottom every first line) and K-Medians clustering solution (from bottom every second line), using the purity index (y-axis), on different data sets and initialized with different initialisation methods. The number next to each bar represent the difference of the average purity between Lloyd's K-Means and K-Medians; positive value indicates that Lloyd's is better and negative value that K-Medians is better. gap and weighted gap: (1) the Gaussian clusters models, (2) the high-dimensions clusters models, (3) the elongated clusters models. gap, weighted gap and Brodinova models: For each model 40 data sets were generated and each initialisation method was executed 25 times on each model. The mean purity of the 25 executions were obtained for each data set and each bar shows the average mean purity across the 40 data sets. Significant differences were obtained based on the Paired Samples Wilcoxon Test along the mean purity across the 40 data sets (#p-value ≥ 0.05; * p-value < 0.05; * * p-value < 0.01; * * * p-value < 0.001; * * * * p-value < 0.0001). Figure 8: Comparison between Lloyd's clustering and K-Medians clustering using deterministic initialisation methods. Each plot shows the goodness of the Lloyd's K-Means clustering solution (from bottom every first line) and K-Medians clustering solution (from bottom every second line), using the purity index (y-axis), on different data sets and initialized with different initialisation methods. The number next to each bar represent the difference of the average purity between Lloyd's K-Means and K-Medians; positive value indicates that Lloyd's is better and negative value that K-Medians is better. gap and weighted gap: (1) the Gaussian clusters models, (2) the high-dimensions clusters models, (3) the elongated clusters models. gap, weighted gap and Brodinova models: For each model 40 data sets were generated and each initialisation method was executed 25 times on each model. The mean purity of the 25 executions were obtained for each data set and each bar shows the average mean purity across the 40 data sets. Significant differences were obtained based on the Paired Samples Wilcoxon Test along the mean purity across the 40 data sets (#p-value ≥ 0.05; * p-value < 0.05; * * p-value < 0.01; * * * p-value < 0.001; * * * * p-value < 0.0001). Figure 9: Comparison between Lloyd's K-Means and K-Medians clustering using different Stochastic and Deterministic initialisation methods. Each plot shows the goodness of the Lloyd's K-Means clustering solution (from bottom every first line) and K-Medians clustering solution (from bottom every second line), using the purity index (y-axis), on different data sets and initialized with different initialisation methods. The number next to each bar represent the difference of the average purity between Lloyd's K-Means and K-Medians; positive value indicates that Lloyd's K-Means is better and negative value that K-Medians is better. The mean purity of the 25 executions were obtained for each data set.