Abstract
The analysis of continously larger datasets is a task of major importance in a wide variety of scientific fields. Therefore, the development of efficient and parallel algorithms to perform such an analysis is a a crucial topic in unsupervised learning. Cluster analysis algorithms are a key element of exploratory data analysis and, among them, the K-means algorithm stands out as the most popular approach due to its easiness in the implementation, straightforward parallelizability and relatively low computational cost. Unfortunately, the K-means algorithm also has some drawbacks that have been extensively studied, such as its high dependency on the initial conditions, as well as to the fact that it might not scale well on massive datasets. In this article, we propose a recursive and parallel approximation to the K-means algorithm that scales well on the number of instances of the problem, without affecting the quality of the approximation. In order to achieve this, instead of analyzing the entire dataset, we work on small weighted sets of representative points that are distributed in such a way that more importance is given to those regions where it is harder to determine the correct cluster assignment of the original instances. In addition to different theoretical properties, which explain the reasoning behind the algorithm, experimental results indicate that our method outperforms the state-of-the-art in terms of the trade-off between number of distance computations and the quality of the solution obtained.
Notes
Algorithm A is an \(\alpha \) factor approximation of the K-means problem, if \(E^D(C') \le \alpha \cdot \min \limits _{C \subseteq {\mathbb {R}}^d, |C|=K} E^{D}(C)\), for any output \(C'\) of A.
A weighted set of points W is a (K, \(\varepsilon \))-coreset if, for all set of centroids C, \(|F^{W}(C)-E^{D}(C)|\le \varepsilon \cdot E^{D}(C)\), where \(F^{W}(C)=\sum \limits _{\mathbf{y }\in W} w(\mathbf{y })\cdot \Vert \mathbf{y }-\mathbf{c }_{\mathbf{y }}\Vert ^2\) and \(w(\mathbf{y })\) is the weight associated to a representative \(\mathbf{y } \in W\).
A partition of the dataset \({\mathcal {P}}'\) is thinner than \({\mathcal {P}}\), if each subset of \({\mathcal {P}}\) can be written as the union of subsets of \({\mathcal {P}}'\).
From now on, we will refer to each \(B \in {\mathcal {B}}\) as a block of the spatial partition \({\mathcal {B}}\).
Data sets with an enormous number of instances and low number of dimensions.
From now on, we assume each block \(B \in {\mathcal {B}}\) to be a hyperrectangle aligned with the coordinate axes.
Additionally, in “Appendix C”, we comment on the grid based \(\hbox {RP}K\hbox {M}\).
The output of such an initialization is presented as KM++_init.
Similar values were used in the original paper (Sculley 2010).
References
Aloise D, Deshpande A, Hansen P, Popat P (2009) NP-hardness of Euclidean sum-of-squares clustering. Mach Learn 75(2):245–248
Arthur D, Vassilvitskii S (2007) k-means++: The Advantages of Careful Seeding. In: Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms, pp 1027–1035
Äyrämö S, Kärkkäinen T (2006) Introduction to Partitioning-Based Clustering Methods with a Robust Example. Reports of the Department of Mathematical Information Technology Series C, Software engineering and computational intelligence 1/2006
Bachem O, Lucic M, Hassani H, Krause A (2016) Fast and Provably Good Seedings for K-means. In: Advances in neural information processing systems, pp 55–63
Bachem O, Lucic M, Krause A (2018) Scalable K-means Clustering via Lightweight Coresets. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1119–1127
Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S (2012) Scalable K-means++. Proc VLDB Endow 5(7):622–633
Balcan MFF, Ehrlich S, Liang Y (2013) Distributed K-means and K-median clustering on general topologies. In: Advances in neural information processing systems, pp 1995–2003
Berkhin P et al (2006) A survey of clustering data mining techniques. Group Multidimens Data 25:71
Bottou L, Bengio Y (1995) Convergence Properties of the K-means Algorithms. In: Advances in neural information processing systems, pp 585–592
Boutsidis C, Drineas P, Mahoney MW (2009) Unsupervised Feature Selection for the K-means Clustering Problem. In: Advances in neural information processing systems, pp 153–161
Boutsidis C, Zouzias A, Drineas P (2010) Random Projections for K-means clustering. In: Advances in neural information processing systems, pp 298–306
Bradley PS, Fayyad UM (1998) Refining initial points for K-means clustering. In: Proceedings of the 15th international conference on machine learning, vol 98, pp 91–99
Capó M, Pérez A, Lozano JA (2017) An efficient approximation to the K-means clustering for massive data. Knowl-Based Syst 117:56–69
Cohen MB, Elder S, Musco C, Musco C, Persu M (2015) Dimensionality reduction for K-means clustering and low rank approximation. In: Proceedings of the 47th annual ACM symposium on theory of computing. ACM, pp 163–172
Davidson I, Satyanarayana A (2003) Speeding up K-means clustering by bootstrap averaging. In: IEEE data mining workshop on clustering large data sets
Ding C, He X (2004) K-means Clustering via Principal Component Analysis. In: Proceedings of the 21st international conference on Machine learning. ACM, p 29
Ding Y, Zhao Y, Shen X, Musuvathi M, Mytkowicz T (2015) Yinyang k-means: a drop-in replacement of the classic K-means with consistent speedup. In: International conference on machine learning, pp 579–587
Drake J, Hamerly G (2012) Accelerated K-means with adaptive distance bounds. In: 5th NIPS workshop on optimization for machine learning, pp 42–53
Elkan C (2003) Using the triangle inequality to accelerate K-means. In: Proceedings of the 20th international conference on machine learning, pp 147–153
Feldman D, Monemizadeh M, Sohler C (2007) A PTAS for K-means clustering based on weak coresets. In: Proceedings of the 23rd annual symposium on computational geometry, pp 11–18
Forgy EW (1965) Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics 21:768–769
Hamerly G (2010) Making K-means even faster. In: Proceedings of the SIAM international conference on data mining, pp 130–140
Har-Peled S, Mazumdar S (2004) On Coresets for K-means and K-median Clustering. In: Proceedings of the 36th ACM symposium on theory of computing, pp 291–300
Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recogn Lett 31(8):651–666
Jain AK, Dubes RC (1988) Algorithms for Clustering Data. Prentice-Hall, Inc, Upper Saddle River
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323
Jordan M (2013) Committee on the analysis of massive data, committee on applied and theoretical statistics, board on mathematical sciences and their applications, division on engineering and physical sciences, council, nr. frontiers in massive data analysis. Front Mass Data Anal
Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY (2002a) An efficient K-means clustering algorithm: analysis and implementation. IEEE Trans Pattern Anal Mach Intell 24(7):881–892
Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY (2002b) A local search approximation algorithm for K-means clustering. In: Proceedings of the 18th annual symposium on computational geometry, pp 10–18
Kaufman L, Rousseeuw P (1987) Clustering by means of medoids. North-Holland, Amsterdam
Kumar A, Sabharwal Y, Sen S (2004) A simple linear time (1 + \(\varepsilon \))-approximation algorithm for K-means clustering in any dimensions. In: Proceedings of the 45th annual IEEE symposium on foundations of computer science, pp 454–462
Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137
Lucic M, Bachem O, Krause A (2016) Strong coresets for hard and soft Bregman clustering with applications to exponential family mixtures. In: Artificial intelligence and statistics, pp 1–9
Mahajan M, Nimbhorkar P, Varadarajan K (2009) The Planar k-means problem is NP-hard. In: International workshop on algorithms and computation, pp 274–285
Manning CD, Raghavan P, Schütze H (2008) Evaluation in information retrieval. In: Introduction to information retrieval pp 151–175
Matoušek J (2000) On approximate geometric K-clustering. Discret Comput Geom 24(1):61–84
Newling J, Fleuret F (2016) Nested mini-batch K-means. In: Advances in neural information processing systems, pp 1352–1360
Peña JM, Lozano JA, Larrañaga P (1999) An empirical comparison of four initialization methods for the K-means algorithm. Pattern Recogn Lett 20(10):1027–1040
Redmond SJ, Heneghan C (2007) A method for initialising the K-means clustering algorithm using KD-trees. Pattern Recogn Lett 28(8):965–973
Sculley D (2010) Web-scale K-means clustering. In: Proceedings of the 19th international conference on world wide web, pp 1177–1178
Shen X, Liu W, Tsang I, Shen F, Sun QS (2017) Compressed K-means for large-scale clustering. In: 31st AAAI conference on artificial intelligence
Steinley D, Brusco MJ (2007) Initializing K-means batch clustering: a critical evaluation of several techniques. J Classif 24(1):99–121
Vattani A (2011) K-means requires exponentially many iterations even in the plane. Discret Comput Geom 45(4):596–616
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY et al (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37
Zhao W, Ma H, He Q (2009) Parallel K-means clustering based on MapReduce. In: IEEE international conference on cloud computing, pp 674–679
Acknowledgements
Marco Capó and Aritz Pérez are partially supported by the Basque Government through the BERC 2014-2017 program and the ELKARTEK program, and by the Spanish Ministry of Economy and Competitiveness MINECO: BCAM Severo Ochoa excelence accreditation SVP-2014-068574 and SEV-2017-0718, and through the project TIN2017-82626-R funded by (AEI/FEDER, UE). Jose A. Lozano is partially supported by the Basque Government (IT1244-19), and Spanish Ministry of Economy and Competitiveness MINECO (TIN2016-78365-R).
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Aristides Gionis.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Additional remarks on \(\hbox {BW}K\hbox {M}\)
In this section, we discuss additional features of the \(\hbox {BW}K\hbox {M}\) algorithm, such as the selection of the initialization parameters for \(\hbox {BW}K\hbox {M}\), we also comment on different possible stopping criteria, with their corresponding computational costs and theoretical guarantees.
1.1 Parameter selection
The construction of the initial space partition and the corresponding induced dataset partition of \(\hbox {BW}K\hbox {M}\) (see Algorithm 2 and Step 1 of Algorithm 5) depends on the parameters m, \(m'\), r, s, K and D, while the core of \(\hbox {BW}K\hbox {M}\) (Step 2 and Step 3) only depends on K and D. In this section, we propose how to select the parameters m, \(m'\), r and s, keeping in mind the following objectives: (i) to guarantee \(\hbox {BW}K\hbox {M}\) having a computational complexity equal to or lower than \({\mathcal {O}}(n\cdot K \cdot d)\), which corresponds to the cost of Lloyd’s algorithm, and (ii) to obtain an initial spatial partition with a large amount of well assigned blocks.
In order to ensure that the computational complexity of \(\hbox {BW}K\hbox {M}\)’s initialization is, even in the worst case, \({\mathcal {O}}(n\cdot K \cdot d)\), we must take m, \(m'\), r and s such that \(r \cdot s \cdot m^{2}\) , \(r \cdot m^{2} \cdot K \cdot d\) and \(n \cdot m\) are \({\mathcal {O}}(n \cdot K \cdot d)\). On the other hand, as we want such an initial partition to minimize the number of blocks that may not be well assigned, we must consider the following facts: (i) the larger the diagonal for a certain block \(B \in {\mathcal {B}}\) is, then the more likely it is for B not to be well assigned, (ii) as the number of clusters K increases, then any block \(B \in {\mathcal {B}}\) has more chances of containing instances with different cluster affiliations, and (iii) as s increases, the cutting probabilities become better indicators for detecting those blocks that are not well assigned.
Taking into consideration these observations, and assuming that r is a predefined small integer, satisfying \(r \ll n/s\), we propose the use of \(m={\mathcal {O}}(\sqrt{K \cdot d})\) and \(s={\mathcal {O}}(\sqrt{n})\). Not only does such a choice satisfy the complexity constraints that we just mentioned (See Theorem 5 in “Appendix B)”, but also, in this case, the size of the initial partition increases with respect to both dimensionality of the problem and number of clusters: Since at each iteration, we divide a block only on one of its sides, then, as we increase the dimensionality, we need more cuts (number of blocks) to have a sufficient reduction of its diagonal (observation (i)). Analogously, the number of blocks and the size of the sampling increases with respect to the number of clusters and the actual size of the dataset, respectively (observation (ii) and (iii)). In particular, in the experimental section, Sect. 3, we used \(m=10\cdot \sqrt{K\cdot d}\), \(s=\sqrt{n}\) and \(r=5\).
1.2 Stopping criterion
As we commented in Sect. 1.3, one of the advantages of constructing spatial partitions with only well assigned blocks is that our algorithm, under this setting, converges to a local minima of the K-means problem over the entire dataset and, therefore, there is no need to execute any further run of the \(\hbox {BW}K\hbox {M}\) algorithm as the set of centroids will remain the same for any thinner partition:
Theorem 3
If C is a fixed point of the weighted K-means algorithm for a spatial partition \({\mathcal {B}}\), for which all of its blocks are well assigned, then C is a fixed point of the K-means algorithm on D.Footnote 16
To verify this criterion, we can make use of the concept of boundary of a spatial partition (Definition 4). In particular, observe that if \({\mathcal {F}}_{C,D}({\mathcal {B}})=\emptyset \), then one can guarantee that all the blocks of \({\mathcal {B}}\) are well assigned with respect to both C and D. To check this, we just need to scan the misassignment function value for each block, i.e., it is just \({\mathcal {O}}(|{\mathcal {P}}|)\). In addition to this criterion, in this section we will propose three other stopping criteria:
-
A practical computational criterion We could set, in advance, the amount of computational resources that we are willing to use and stop when we exceed them. In particular, as the computation of distances is the most expensive step of the algorithm, we could set a maximum number of distances as a stopping criterion.
-
A Lloyd’s algorithm type criterion As we mentioned in Sect. 1.2, the common practice is to run Lloyd’s algorithm until the reduction of the error, after a certain iteration, is small, see Eq. 2. As in our weighted approximation we do not have access to the error \(E^{D}(C)\), a similar approach is to stop the algorithm when the obtained set of centroids, in consecutive iterations, is smaller than a fixed threshold, \({\varepsilon }_{w}\). We can actually set this threshold in a way that the stopping criterion of Lloyd’s algorithm is satisfied. For instance, for \({\varepsilon }_{w}=\sqrt{l^{2}+\frac{\varepsilon ^2}{n^2}}-l\), if \(\Vert C-C'\Vert _{\infty } \le {\varepsilon }_{w}\), then the criterion in Eq. 2 is satisfied.Footnote 17 However, this would imply additional \({\mathcal {O}}(K\cdot d)\) computations at each iteration.
-
A criterion based on the accuracy of the weighted error We could also consider the bound obtained at Theorem 2 and stop when it is lower than a predefined threshold. This will let us know how accurate our current weighted error is with respect to the error over the entire dataset. All the information in this bound is obtained from the weighted Lloyd iteration and the information of the block and its computation is just \({\mathcal {O}}(|{\mathcal {P}}|)\).
Proofs
In Theorem 1, we prove the cutting criterion that we use in \(\hbox {BW}K\hbox {M}\). It consists of an inequality that, only by using information referred to the partition of the dataset and the weighted Lloyd’s algorithm, helps us guarantee that a block is well assigned.
Theorem 1
Given a set of K centroids, C, a dataset, \(D \subseteq {\mathbb {R}}^d\), and a block B, if \(\epsilon _{C,D}(B) = 0\), then \(\mathbf{c }_{\mathbf{x }}= \mathbf{c }_{{\overline{P}}}\) for all \(\mathbf{x } \in P=B(D)\ne \emptyset \).
Proof
From the triangular inequality, we know that \(\Vert \mathbf{x }-\mathbf{c }_{{\overline{P}}}\Vert \le \Vert \mathbf{x }-{\overline{P}}\Vert +\Vert {\overline{P}}-\mathbf{c }_{{\overline{P}}}\Vert \). Moreover, observe that \({\overline{P}}\) is contained in the block B, since B is a convex polytope. Then \(\Vert \mathbf{x }-{\overline{P}}\Vert \le l_{B}\).
For this reason, \(\Vert \mathbf{x }-\mathbf{c }_{{\overline{P}}}\Vert \le l_{B} - \delta _{P}(C) + \Vert {\overline{P}}-\mathbf{c }\Vert \le (2\cdot l_{B} - \delta _{P}(C)) + \Vert \mathbf{x }-\mathbf{c }\Vert \) holds. As \(\epsilon _{C,D}(B)=\max \{0,2\cdot l_{B}-\delta _{P}(C)\}=0\), then \(2\cdot l_{B}-\delta _{P}(C) \le 0\) and, therefore, \(\Vert \mathbf{x }-\mathbf{c }_{{\overline{P}}}\Vert \le \Vert \mathbf{x }-\mathbf{c }\Vert \) for all \(\mathbf{c } \in C\). In other words, \(\mathbf{c }_{{\overline{P}}}=\mathop {{{\,\mathrm{arg\,min}\,}}}\nolimits _{\mathbf{c } \in C} \Vert \mathbf{x }-\mathbf{c }\Vert \) for all \(\mathbf{x } \in P\). \(\square \)
The two following results show some properties of the error function when having well assigned blocks.
Lemma 1
If \(\mathbf{c }_{\mathbf{x }}=\mathbf{c }_{{\overline{P}}}\) and \(\mathbf{c }_{\mathbf{x }}'=\mathbf{c }_{{\overline{P}}}'\) for all \(\mathbf{x } \in P\), where \(P \subseteq D\) and C, \(C'\) are a pair of sets of centroids, then \(E^{P}(C)-E^{\{ P \}}(C)=E^{P}(C')-E^{\{ P \}}(C')\).
Proof
From Lemma 1 in Capó et al (2017), we can say that the following function is constant \(f(\mathbf{c })= |P|\cdot \Vert {\overline{P}}- \mathbf{c } \Vert ^2 - \sum _{\mathbf{x } \in P} \Vert \mathbf{x }- \mathbf{c } \Vert ^2 \), for \(\mathbf{c } \in {\mathbb {R}}^d\). In particular, since \(f({\overline{P}})=- \sum _{\mathbf{x } \in P} \Vert \mathbf{x }- {\overline{P}} \Vert ^2\), we have that \(|P|\cdot \Vert {\overline{P}}- \mathbf{c }_{{\overline{P}}} \Vert ^2 =\sum _{\mathbf{x } \in P} \Vert \mathbf{x }- \mathbf{c }_{{\overline{P}}} \Vert ^2 - \sum _{\mathbf{x } \in P} \Vert \mathbf{x }- {\overline{P}} \Vert ^2\) and so we can express the weighted error of a dataset partition, \({\mathcal {P}}\), as follows
In particular, for \(P \in {\mathcal {P}}\), we have
\(\square \)
In the previous result we observe that, if all the instances are correctly assigned in each block, then the difference of the weighted and the entire dataset error, of both sets of centroids, is the same. In other words, if all the blocks of a given partition are correctly assigned, not only can we then actually guarantee a monotone descend of the entire error function for our approximation, a property that can not be guaranteed for the typical coreset type approximations of K-means, but we know exactly the reduction of such an error after a weighted Lloyd iteration.
Lemma 2
Given two set of centroids C, \(C'\), where \(C'\) is obtained after a weighted Lloyd’s iteration (on a partition \({\mathcal {P}}\)) over C and \(\mathbf{c }_{\mathbf{x }}=\mathbf{c }_{{\overline{P}}}\) and \(\mathbf{c }_{\mathbf{x }}'=\mathbf{c }_{{\overline{P}}}'\) for all \(\mathbf{x } \in P\) and \(P \in {\mathcal {P}}\), then \(E^{D}(C') \le E^{D}(C)\).
Proof
Using Lemma 1 over all the subsets \(P \in {\mathcal {P}}\), we know that \(E^{D}(C')-E^{D}(C)=\sum _{P \in {\mathcal {P}}} (E^{P}(C')-E^{P}(C))\)\(= \sum _{P \in {\mathcal {P}}} (E^{\{ P \}}(C')-E^{\{ P \}}(C))=E^{{\mathcal {P}}}(C')-E^{{\mathcal {P}}}(C)\). Moreover, from the chain of inequalities A.1 in Capó et al (2017), we know that \(E^{{\mathcal {P}}}(C')\le E^{{\mathcal {P}}}(C)\) at any weighted Lloyd iteration over a given partition \({\mathcal {P}}\), thus \(E^{D}(C')\le E^{D}(C)\). \(\square \)
Up to this point, most of the quality results assume the case when all the blocks are well assigned. However, in order to achieve this, many \(\hbox {BW}K\hbox {M}\) iterations might be required. In the following result, we provide a bound to the weighted error with respect to the full error. This result shows that our weighted representation improves as more blocks of our partition satisfy the criterion in Algorithm 1 and/or the diagonal of the blocks are smaller.
Theorem 2
Given a dataset, D, a set of K centroids C and a spatial partition \({\mathcal {B}}\) of the dataset D, the following inequality is satisfied:
where \(P=B(D)\) and \({\mathcal {P}}= {\mathcal {B}}(D)\). Furthermore, for a well assigned partition \({\mathcal {P}}\), if \(C_{OPT}^{{\mathcal {P}}}=\mathop {{{\,\mathrm{arg\,min}\,}}}\nolimits _{C \subset {\mathbb {R}}^d, |C|=K}E^{{\mathcal {P}}}(C)\) and \(C_{OPT}=\mathop {{{\,\mathrm{arg\,min}\,}}}\nolimits _{C \subset {\mathbb {R}}^d, |C|=K}E^{D}(C)\), then
where \(l=\max \limits _{B \in {\mathcal {B}}} l_{B}\).
Proof
Using Eq. 6 in Lemma 1, we know that \(|E^{D}(C)-E^{{\mathcal {P}}}(C)| \le \sum \limits _{P \in {\mathcal {P}}} \sum \limits _{\mathbf{x } \in P} \Vert \mathbf{x }- \mathbf{c }_{{\overline{P}}} \Vert ^2 -\Vert \mathbf{x }- \mathbf{c }_{\mathbf{x }} \Vert ^2 + \Vert \mathbf{x }- {\overline{P}} \Vert ^2\).
Observe that, for a certain instance \(\mathbf{x } \in P\), where \(\epsilon _{C,D}(B)=\max \{0,2\cdot l_{B}-\delta _{P}(C)\}=0\), \(\Vert \mathbf{x }- \mathbf{c }_{{\overline{P}}} \Vert ^2 -\Vert \mathbf{x }- \mathbf{c }_{\mathbf{x }} \Vert ^2=0\), as \(\mathbf{c }_{\mathbf{x }}=\mathbf{c }_{{\overline{P}}}\) by Theorem 1. On the other hand, if \(\epsilon _{C,D}(B) > 0\), we have the following inequalities:
Using both inequalities, we have \(\Vert \mathbf{x }- \mathbf{c }_{{\overline{P}}} \Vert ^2 -\Vert \mathbf{x }- \mathbf{c }_{\mathbf{x }} \Vert ^2 \le 2 \cdot \epsilon _{C,D}(B) \cdot (2\cdot l_{B}+\Vert {\overline{P}}- \mathbf{c }_{{\overline{P}}} \Vert )\). On the other hand, observe that \(\sum \limits _{\mathbf{x } \in P} \Vert \mathbf{x }- {\overline{P}} \Vert ^2= \frac{1}{|P|} \cdot \sum \limits _{\mathbf{x },\mathbf{y } \in P} \Vert \mathbf{x }- \mathbf{y } \Vert ^2 \le \frac{1}{|P|} \cdot \frac{|P|\cdot (|P|-1)}{2} \cdot l_{B}^2=\frac{|P|-1}{2} \cdot l_{B}^2\).
Furthermore, if the partition is well assigned, then \(\epsilon _{C,D}(B)=0\) for all \(B \in {\mathcal {B}}\) and so,
\(\square \)
In Theorem 3, we show an interesting property of the \(\hbox {BW}K\hbox {M}\) algorithm. We verify that a fixed point of the weighted Lloyd’s algorithm, over a partition with only well assigned blocks, is also a fixed point of Lloyd’s algorithm over the entire dataset D.
Theorem 3
If C is a fixed point of the weighted K-means algorithm for a spatial partition \({\mathcal {B}}\), for which all of its blocks are well assigned, then C is a fixed point of the K-means algorithm on D.
Proof
\(C=\{\mathbf{c }_1, \ldots , \mathbf{c }_K\}\) is a fixed point of the weighted K-means algorithm, on a partition \({\mathcal {P}}\), if and only if when applying an additional iteration of the weighted K-means algorithm on \({\mathcal {P}}\), the generated clusterings \({\mathcal {G}}_{1}({\mathcal {P}}), \ldots , {\mathcal {G}}_{K}({\mathcal {P}})\), i.e., \({\mathcal {G}}_{i}({\mathcal {P}}) := \{P \in {\mathcal {P}}: \mathbf{c }_{i}=\mathop {{{\,\mathrm{arg\,min}\,}}}\nolimits _{\mathbf{c } \in C} \Vert {\overline{P}}-\mathbf{c }\Vert \}\), satisfies \(\mathbf{c }_{i}=\frac{\sum \limits _{P \in {\mathcal {G}}_{i}({\mathcal {P}})} |P| \cdot {\overline{P}}}{\sum \limits _{P \in {\mathcal {G}}_{i}({\mathcal {P}})} |P|}\) for all \(i=\{1, \ldots , K\}\) (1).
Since all the blocks \(B \in {\mathcal {B}}\) are well assigned, then the clusterings of C in D, \({\mathcal {G}}_{i}(D) := \{\mathbf{x } \in D: \mathbf{c }_{i}=\mathop {{{\,\mathrm{arg\,min}\,}}}\nolimits _{\mathbf{c } \in C} \Vert \mathbf{x }-\mathbf{c }\Vert \}\), satisfy \(|{\mathcal {G}}_{i}(D)| = \sum \limits _{P \in {\mathcal {G}}_{i}({\mathcal {P}})} |P|\) (2) and \(\sum \limits _{\mathbf{x } \in {\mathcal {G}}_{i}(D)} \mathbf{x }=\sum \limits _{P \in {\mathcal {G}}_{i}({\mathcal {P}})} \sum \limits _{\mathbf{x } \in P} \mathbf{x }\) (3). From (1), (2) and (3), we have
this is, C is a fixed point of K-means algorithm on D. \(\square \)
As we do not have access to the error for the entire dataset, \(E^{D}(C)\), since its computation is expensive, in Algorithm 5 we propose a possible stopping criterion that bounds the displacement of the set of centroids. In the following result, we show a possible choice of this bound in a way that, if the proposed criterion is verified, then the common Lloyd’s algorithm stopping criterion is also satisfied.
Theorem 4
Given two sets of centroids \(C=\{\mathbf{c }_{k}\}_{k=1}^{K}\) and \(C'=\{\mathbf{c }_{k}'\}_{k=1}^{K}\) , if \(\Vert C-C'\Vert _{\infty }=\max \limits _{k=1, \ldots ,K} \Vert \mathbf{c }_{k}-\mathbf{c }_{k}' \Vert \le {\varepsilon }_{w}\), where \({\epsilon }_{w}=\sqrt{l^{2}+\frac{\epsilon ^2}{n^2}}-l\), then \(|E^{D}(C)-E^{D}(C')|\le \varepsilon \).
Proof
Initially, we bound the following terms: \(\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}\Vert +\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}'\Vert \) and \(|\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}\Vert -\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}'\Vert |\) for any \(\mathbf{x }\in D\).
If we set j and t as the indexes satisfying \(\mathbf{c }_{j}=\mathbf{c }_{\mathbf{x }}\) and \(\mathbf{c }_{t}'=\mathbf{c }_{\mathbf{x }}'\), then we have \(\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}\Vert +\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}'\Vert =\Vert \mathbf{x }-\mathbf{c }_{j}\Vert +\Vert \mathbf{x }-\mathbf{c }_{t}'\Vert \le \Vert \mathbf{x }-\mathbf{c }_{t}\Vert +\Vert \mathbf{x }-\mathbf{c }_{t}'\Vert \le 2 \cdot \Vert \mathbf{x }-\mathbf{c }_{t}'\Vert +\varepsilon _{w}= 2 \cdot \Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}'\Vert +\varepsilon _{w}\) (1). Analogously, applying the triangular inequality, we have \(|\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}\Vert -\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}'\Vert |\le \varepsilon _{w}\) (2). In the following chain of inequalities, we will make use of (1) and (2):
\(\square \)
As can be seen in Sect. 2.2, there are different parameters that must be tuned. In the following result, we set a criterion to choose the initialization parameters of Algorithm 2 in a way that its complexity, even in the worst case scenario, is still the same as that of Lloyd’s algorithm.
Theorem 5
Given an integer r, if \(m={\mathcal {O}}(\sqrt{K \cdot d})\) and \(s={\mathcal {O}}(\sqrt{n})\), then Algorithm 2 is \({\mathcal {O}}(n\cdot K \cdot d)\).
Proof
It is enough to verify the conditions presented before. Firstly, observe that \(r \cdot s \cdot m^2={\mathcal {O}}(\sqrt{n} \cdot K \cdot d)\) and \(n \cdot m={\mathcal {O}}(n \cdot \sqrt{K \cdot d})\). Moreover, as \(K\cdot d= {\mathcal {O}}(n)\), then \(r \cdot m^2={\mathcal {O}}(n)\). \(\square \)
Finally, we present a complimentary property of the grid based \(\hbox {RP}K\hbox {M}\) proposed in Capó et al (2017). Each iteration of the \(\hbox {RP}K\hbox {M}\) can be proved to be a coreset with an exponential decrease in the error with respect to the number of iterations. This result could actually bound the \(\hbox {BW}K\hbox {M}\) error, if we fix i as the minimum number of cuts that a block, of a certain partition generated by \(\hbox {BW}K\hbox {M}\), \({\mathcal {P}}\), has.
Theorem 6
Given a set of points D in \({\mathbb {R}}^d\), the i-th iteration of the grid based \(\hbox {RP}K\hbox {M}\) produces a \((K,\varepsilon )\)-coreset with \(\varepsilon =\frac{1}{2^{i-1}}\cdot (1+\frac{1}{2^{i+2}} \cdot \frac{n-1}{n})\cdot \frac{n \cdot l^2}{OPT}\), where \(OPT=\min \limits _{C \subseteq {\mathbb {R}}^d, |C|=K} E^{D}(C)\) and l the length of the diagonal of the smallest bounding box containing D.
Proof
Firstly, we denote by \(\mathbf{x }'\) to the representative of \(\mathbf{x }\in D\) at the i-th grid based \(\hbox {RP}K\hbox {M}\) iteration, i.e., if \(\mathbf{x }\in P\) then \(\mathbf{x }'={\overline{P}}\), where P is a block of the corresponding dataset partition \({\mathcal {P}}\) of D. Observe that, at the i-th grid based \(\hbox {RP}K\hbox {M}\) iteration, the length of the diagonal of each cell is \(\frac{1}{2^i}\cdot l\) and we set a positive constant, c, as the positive real number satisfying \(\frac{1}{2^i}\cdot l=\sqrt{c \cdot \frac{OPT}{n}}\). By the triangular inequality, we have
Analogously, observe that the following inequalities hold \(\Vert \mathbf{x }'-\mathbf{c }_{\mathbf{x }'}\Vert +\Vert \mathbf{x }-\mathbf{x }'\Vert \ge \Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}\Vert \) and \(\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}\Vert +\Vert \mathbf{x }-\mathbf{x }'\Vert \ge \Vert \mathbf{x }'-\mathbf{c }_{\mathbf{x }'}\Vert \). Thus, \(\Vert \mathbf{x }-\mathbf{x }'\Vert \ge |\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}\Vert -\Vert \mathbf{x }'-\mathbf{c }_{\mathbf{x }'}\Vert |\):
On the other hand, we know that \(\sum \limits _{\mathbf{x }\in D} \Vert \mathbf{x }-\mathbf{x }'\Vert ^2 \le \frac{n-1}{2^{2i+1}}\cdot l^{2}\) and that, as both \(\mathbf{x }\) and \(\mathbf{x }'\) must be located in the same cell, \(\Vert \mathbf{x }-\mathbf{x }'\Vert \le \frac{1}{2^i}\cdot l\). Therefore, as \(\mathbf{d}(\mathbf{x },C) \le l\),
In other words, the i-th \(\hbox {RP}K\hbox {M}\) iteration is a \((K,\varepsilon )\)- coreset with \(\varepsilon =(\frac{1}{2^{i+2}} \cdot \frac{n-1}{n}+1) \cdot 2^{i+1} \cdot c=\frac{1}{2^{i-1}}\cdot (1+\frac{1}{2^{i+2}} \cdot \frac{n-1}{n})\cdot \frac{n \cdot l^2}{OPT}\). \(\square \)
About the grid based \(\hbox {RP}K\hbox {M}\)
In the experimental section in Capó et al (2017), the partition sequence used (grid based \(\hbox {RP}K\hbox {M}\)) consisted on sequentially constructing a new spatial partition by dividing each block of the previous partition into \(2^{d}\) new blocks, i.e., \({\mathcal {P}}\) can have up to \(2^{i \cdot d}\) representatives. In this section, we provide some additional results in which we compare the performance of the grid based \(\hbox {RP}K\hbox {M}\) with respect to the methods and datasets presented in Sect. 3 and \(K \in \{3,5,10,25,50\}\).
As in Capó et al (2017), we fix the maximum number of iterations, M, as the stopping criterion for the grid based \(\hbox {RP}K\hbox {M}\). Initially, we considered \(M=10\), however just for the CIF and 3RN datasets -case (i)- the grid based \(\hbox {RP}K\hbox {M}\) managed to converge before reaching the limit running time (24 hours). Moreover, for the HPC and WUY datasets -case (ii)-, we obtained results for \(M=5\) and, unfortunately, for the datasets with the largests dimensionality (GS and SUSY), the grid based \(\hbox {RP}K\hbox {M}\) failed to provide any output (Table 3). The obtained results are summarized in Figs. 9, 10, 11, 12, 13, 14, 15 and 16.
In the datasets of case (i), we have better view of the evolution of the number of representatives of the grid based \(\hbox {RP}K\hbox {M}\) with respect to the number of iterations. In Figs. 10, 12 and Table 4, we observe that the number of representatives of the grid based \(\hbox {RP}K\hbox {M}\), after 10 iterations, is about the number of instances in both the CIF and 3RN datasets, while, for the \(\hbox {BW}K\hbox {M}\), we observe a much slower growth in the number of representatives. In particular, for the 3RN dataset, the number of representatives, for the different number of clusters and after 100 iterations, is still under \(13\%\) the number of instances, while generating approximations of similar and/or better quality than those of the grid based \(\hbox {RP}K\hbox {M}\). Furthermore, we observe that, for all the datasets, the number of representatives of \(\hbox {BW}K\hbox {M}\) reaches a plateau way before the final number of iterations, meaning that, for a small number of iterations, most of the blocks generated by \(\hbox {BW}K\hbox {M}\) are well assigned. On the other hand, as the number of representatives for the grid based \(\hbox {RP}K\hbox {M}\), for the datasets in case (ii), is smaller than in the previous case, we observe in Figs. 13 and 15, that the quality of the approximation of the grid based \(\hbox {RP}K\hbox {M}\) is commonly much less competitive than the solutions obtained via \(\hbox {BW}K\hbox {M}\): the grid based \(\hbox {RP}K\hbox {M}\) commonly has over 10% of relative error with respect to \(\hbox {BW}K\hbox {M}\).
In Table 4, we present the proportion of cases in which \(\hbox {BW}K\hbox {M}\) generates a well assigned partition verified via the assignment function of Theorem 1. As we commented in Sect. A.2, this is a sufficient condition to verify that the solution obtained via \(\hbox {BW}K\hbox {M}\) is also a fixed point of Lloyd’s algorithm for the entire dataset. We observe that, specially for a low number of clusters, \(\hbox {BW}K\hbox {M}\) is very likely to converge to a local minima of the K-means problem. For instance, for WUY dataset and \(K \in \{3,5\}\), \(\hbox {BW}K\hbox {M}\) always generated well assigned partitions, which is quite remarkable as the number of representatives in these cases is under 0.6% of the number of instances. As expected, as the number of cluster increases, it is harder to verify such a condition, however we must remember that this is just a sufficient condition since we are using Theorem 1, rather than computing all the pairwise distances instance-centroid.
From the results presented in this section, it is clear that \(\hbox {BW}K\hbox {M}\) alleviates the main drawback of the grid based \(\hbox {RP}K\hbox {M}\), as it also controls the growth of the number of representatives, which, in the worst case scenario, only has a linearly growth in this case. This is an important factor, as it allows \(\hbox {BW}K\hbox {M}\) to scale better with respect to both the dimensionality and the number of iterations. Furthermore, \(\hbox {BW}K\hbox {M}\) is still a \(\hbox {RP}K\hbox {M}\) type approach, meaning that, besides the theoretical guarantees that we have developed during article and the results that just commented on, \(\hbox {BW}K\hbox {M}\) also has the guarantees of the grid based \(\hbox {RP}K\hbox {M}\).
Rights and permissions
About this article
Cite this article
Capó, M., Pérez, A. & Lozano, J.A. An efficient K-means clustering algorithm for tall data. Data Min Knowl Disc 34, 776–811 (2020). https://doi.org/10.1007/s10618-020-00678-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-020-00678-9