Skip to main content
Log in

An efficient K-means clustering algorithm for tall data

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

The analysis of continously larger datasets is a task of major importance in a wide variety of scientific fields. Therefore, the development of efficient and parallel algorithms to perform such an analysis is a a crucial topic in unsupervised learning. Cluster analysis algorithms are a key element of exploratory data analysis and, among them, the K-means algorithm stands out as the most popular approach due to its easiness in the implementation, straightforward parallelizability and relatively low computational cost. Unfortunately, the K-means algorithm also has some drawbacks that have been extensively studied, such as its high dependency on the initial conditions, as well as to the fact that it might not scale well on massive datasets. In this article, we propose a recursive and parallel approximation to the K-means algorithm that scales well on the number of instances of the problem, without affecting the quality of the approximation. In order to achieve this, instead of analyzing the entire dataset, we work on small weighted sets of representative points that are distributed in such a way that more importance is given to those regions where it is harder to determine the correct cluster assignment of the original instances. In addition to different theoretical properties, which explain the reasoning behind the algorithm, experimental results indicate that our method outperforms the state-of-the-art in terms of the trade-off between number of distance computations and the quality of the solution obtained.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Notes

  1. Algorithm A is an \(\alpha \) factor approximation of the K-means problem, if \(E^D(C') \le \alpha \cdot \min \limits _{C \subseteq {\mathbb {R}}^d, |C|=K} E^{D}(C)\), for any output \(C'\) of A.

  2. A weighted set of points W is a (K, \(\varepsilon \))-coreset if, for all set of centroids C, \(|F^{W}(C)-E^{D}(C)|\le \varepsilon \cdot E^{D}(C)\), where \(F^{W}(C)=\sum \limits _{\mathbf{y }\in W} w(\mathbf{y })\cdot \Vert \mathbf{y }-\mathbf{c }_{\mathbf{y }}\Vert ^2\) and \(w(\mathbf{y })\) is the weight associated to a representative \(\mathbf{y } \in W\).

  3. A partition of the dataset \({\mathcal {P}}'\) is thinner than \({\mathcal {P}}\), if each subset of \({\mathcal {P}}\) can be written as the union of subsets of \({\mathcal {P}}'\).

  4. From now on, we will refer to each \(B \in {\mathcal {B}}\) as a block of the spatial partition \({\mathcal {B}}\).

  5. See Theorem 6 in “Appendix B”.

  6. Data sets with an enormous number of instances and low number of dimensions.

  7. See Lemma 1 in “Appendix B”.

  8. See Theorem 2 in “Appendix B”.

  9. See Theorem 3 in “Appendix B”.

  10. From now on, we assume each block \(B \in {\mathcal {B}}\) to be a hyperrectangle aligned with the coordinate axes.

  11. The proof of Theorem 1 is in “Appendix B”.

  12. The proof of Theorem 2 is in “Appendix B”.

  13. Additionally, in “Appendix C”, we comment on the grid based \(\hbox {RP}K\hbox {M}\).

  14. The output of such an initialization is presented as KM++_init.

  15. Similar values were used in the original paper (Sculley 2010).

  16. The proof of Theorem 3 is in “Appendix B”.

  17. See Theorem 4 in “Appendix B”.

References

  • Aloise D, Deshpande A, Hansen P, Popat P (2009) NP-hardness of Euclidean sum-of-squares clustering. Mach Learn 75(2):245–248

    Article  Google Scholar 

  • Arthur D, Vassilvitskii S (2007) k-means++: The Advantages of Careful Seeding. In: Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms, pp 1027–1035

  • Äyrämö S, Kärkkäinen T (2006) Introduction to Partitioning-Based Clustering Methods with a Robust Example. Reports of the Department of Mathematical Information Technology Series C, Software engineering and computational intelligence 1/2006

  • Bachem O, Lucic M, Hassani H, Krause A (2016) Fast and Provably Good Seedings for K-means. In: Advances in neural information processing systems, pp 55–63

  • Bachem O, Lucic M, Krause A (2018) Scalable K-means Clustering via Lightweight Coresets. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1119–1127

  • Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S (2012) Scalable K-means++. Proc VLDB Endow 5(7):622–633

    Article  Google Scholar 

  • Balcan MFF, Ehrlich S, Liang Y (2013) Distributed K-means and K-median clustering on general topologies. In: Advances in neural information processing systems, pp 1995–2003

  • Berkhin P et al (2006) A survey of clustering data mining techniques. Group Multidimens Data 25:71

    Google Scholar 

  • Bottou L, Bengio Y (1995) Convergence Properties of the K-means Algorithms. In: Advances in neural information processing systems, pp 585–592

  • Boutsidis C, Drineas P, Mahoney MW (2009) Unsupervised Feature Selection for the K-means Clustering Problem. In: Advances in neural information processing systems, pp 153–161

  • Boutsidis C, Zouzias A, Drineas P (2010) Random Projections for K-means clustering. In: Advances in neural information processing systems, pp 298–306

  • Bradley PS, Fayyad UM (1998) Refining initial points for K-means clustering. In: Proceedings of the 15th international conference on machine learning, vol 98, pp 91–99

  • Capó M, Pérez A, Lozano JA (2017) An efficient approximation to the K-means clustering for massive data. Knowl-Based Syst 117:56–69

    Article  Google Scholar 

  • Cohen MB, Elder S, Musco C, Musco C, Persu M (2015) Dimensionality reduction for K-means clustering and low rank approximation. In: Proceedings of the 47th annual ACM symposium on theory of computing. ACM, pp 163–172

  • Davidson I, Satyanarayana A (2003) Speeding up K-means clustering by bootstrap averaging. In: IEEE data mining workshop on clustering large data sets

  • Ding C, He X (2004) K-means Clustering via Principal Component Analysis. In: Proceedings of the 21st international conference on Machine learning. ACM, p 29

  • Ding Y, Zhao Y, Shen X, Musuvathi M, Mytkowicz T (2015) Yinyang k-means: a drop-in replacement of the classic K-means with consistent speedup. In: International conference on machine learning, pp 579–587

  • Drake J, Hamerly G (2012) Accelerated K-means with adaptive distance bounds. In: 5th NIPS workshop on optimization for machine learning, pp 42–53

  • Elkan C (2003) Using the triangle inequality to accelerate K-means. In: Proceedings of the 20th international conference on machine learning, pp 147–153

  • Feldman D, Monemizadeh M, Sohler C (2007) A PTAS for K-means clustering based on weak coresets. In: Proceedings of the 23rd annual symposium on computational geometry, pp 11–18

  • Forgy EW (1965) Cluster analysis of multivariate data: efficiency vs interpretability of classifications. Biometrics 21:768–769

    Google Scholar 

  • Hamerly G (2010) Making K-means even faster. In: Proceedings of the SIAM international conference on data mining, pp 130–140

  • Har-Peled S, Mazumdar S (2004) On Coresets for K-means and K-median Clustering. In: Proceedings of the 36th ACM symposium on theory of computing, pp 291–300

  • Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recogn Lett 31(8):651–666

    Article  Google Scholar 

  • Jain AK, Dubes RC (1988) Algorithms for Clustering Data. Prentice-Hall, Inc, Upper Saddle River

    MATH  Google Scholar 

  • Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323

    Article  Google Scholar 

  • Jordan M (2013) Committee on the analysis of massive data, committee on applied and theoretical statistics, board on mathematical sciences and their applications, division on engineering and physical sciences, council, nr. frontiers in massive data analysis. Front Mass Data Anal

  • Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY (2002a) An efficient K-means clustering algorithm: analysis and implementation. IEEE Trans Pattern Anal Mach Intell 24(7):881–892

    Article  Google Scholar 

  • Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY (2002b) A local search approximation algorithm for K-means clustering. In: Proceedings of the 18th annual symposium on computational geometry, pp 10–18

  • Kaufman L, Rousseeuw P (1987) Clustering by means of medoids. North-Holland, Amsterdam

    Google Scholar 

  • Kumar A, Sabharwal Y, Sen S (2004) A simple linear time (1 + \(\varepsilon \))-approximation algorithm for K-means clustering in any dimensions. In: Proceedings of the 45th annual IEEE symposium on foundations of computer science, pp 454–462

  • Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137

    Article  MathSciNet  Google Scholar 

  • Lucic M, Bachem O, Krause A (2016) Strong coresets for hard and soft Bregman clustering with applications to exponential family mixtures. In: Artificial intelligence and statistics, pp 1–9

  • Mahajan M, Nimbhorkar P, Varadarajan K (2009) The Planar k-means problem is NP-hard. In: International workshop on algorithms and computation, pp 274–285

  • Manning CD, Raghavan P, Schütze H (2008) Evaluation in information retrieval. In: Introduction to information retrieval pp 151–175

  • Matoušek J (2000) On approximate geometric K-clustering. Discret Comput Geom 24(1):61–84

    Article  MathSciNet  Google Scholar 

  • Newling J, Fleuret F (2016) Nested mini-batch K-means. In: Advances in neural information processing systems, pp 1352–1360

  • Peña JM, Lozano JA, Larrañaga P (1999) An empirical comparison of four initialization methods for the K-means algorithm. Pattern Recogn Lett 20(10):1027–1040

    Article  Google Scholar 

  • Redmond SJ, Heneghan C (2007) A method for initialising the K-means clustering algorithm using KD-trees. Pattern Recogn Lett 28(8):965–973

    Article  Google Scholar 

  • Sculley D (2010) Web-scale K-means clustering. In: Proceedings of the 19th international conference on world wide web, pp 1177–1178

  • Shen X, Liu W, Tsang I, Shen F, Sun QS (2017) Compressed K-means for large-scale clustering. In: 31st AAAI conference on artificial intelligence

  • Steinley D, Brusco MJ (2007) Initializing K-means batch clustering: a critical evaluation of several techniques. J Classif 24(1):99–121

    Article  MathSciNet  Google Scholar 

  • Vattani A (2011) K-means requires exponentially many iterations even in the plane. Discret Comput Geom 45(4):596–616

    Article  MathSciNet  Google Scholar 

  • Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY et al (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1):1–37

    Article  Google Scholar 

  • Zhao W, Ma H, He Q (2009) Parallel K-means clustering based on MapReduce. In: IEEE international conference on cloud computing, pp 674–679

Download references

Acknowledgements

Marco Capó and Aritz Pérez are partially supported by the Basque Government through the BERC 2014-2017 program and the ELKARTEK program, and by the Spanish Ministry of Economy and Competitiveness MINECO: BCAM Severo Ochoa excelence accreditation SVP-2014-068574 and SEV-2017-0718, and through the project TIN2017-82626-R funded by (AEI/FEDER, UE). Jose A. Lozano is partially supported by the Basque Government (IT1244-19), and Spanish Ministry of Economy and Competitiveness MINECO (TIN2016-78365-R).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marco Capó.

Additional information

Responsible editor: Aristides Gionis.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Additional remarks on \(\hbox {BW}K\hbox {M}\)

In this section, we discuss additional features of the \(\hbox {BW}K\hbox {M}\) algorithm, such as the selection of the initialization parameters for \(\hbox {BW}K\hbox {M}\), we also comment on different possible stopping criteria, with their corresponding computational costs and theoretical guarantees.

1.1 Parameter selection

The construction of the initial space partition and the corresponding induced dataset partition of \(\hbox {BW}K\hbox {M}\) (see Algorithm 2 and Step 1 of Algorithm 5) depends on the parameters m, \(m'\), r, s, K and D, while the core of \(\hbox {BW}K\hbox {M}\) (Step 2 and Step 3) only depends on K and D. In this section, we propose how to select the parameters m, \(m'\), r and s, keeping in mind the following objectives: (i) to guarantee \(\hbox {BW}K\hbox {M}\) having a computational complexity equal to or lower than \({\mathcal {O}}(n\cdot K \cdot d)\), which corresponds to the cost of Lloyd’s algorithm, and (ii) to obtain an initial spatial partition with a large amount of well assigned blocks.

In order to ensure that the computational complexity of \(\hbox {BW}K\hbox {M}\)’s initialization is, even in the worst case, \({\mathcal {O}}(n\cdot K \cdot d)\), we must take m, \(m'\), r and s such that \(r \cdot s \cdot m^{2}\) , \(r \cdot m^{2} \cdot K \cdot d\) and \(n \cdot m\) are \({\mathcal {O}}(n \cdot K \cdot d)\). On the other hand, as we want such an initial partition to minimize the number of blocks that may not be well assigned, we must consider the following facts: (i) the larger the diagonal for a certain block \(B \in {\mathcal {B}}\) is, then the more likely it is for B not to be well assigned, (ii) as the number of clusters K increases, then any block \(B \in {\mathcal {B}}\) has more chances of containing instances with different cluster affiliations, and (iii) as s increases, the cutting probabilities become better indicators for detecting those blocks that are not well assigned.

Taking into consideration these observations, and assuming that r is a predefined small integer, satisfying \(r \ll n/s\), we propose the use of \(m={\mathcal {O}}(\sqrt{K \cdot d})\) and \(s={\mathcal {O}}(\sqrt{n})\). Not only does such a choice satisfy the complexity constraints that we just mentioned (See Theorem 5 in “Appendix B)”, but also, in this case, the size of the initial partition increases with respect to both dimensionality of the problem and number of clusters: Since at each iteration, we divide a block only on one of its sides, then, as we increase the dimensionality, we need more cuts (number of blocks) to have a sufficient reduction of its diagonal (observation (i)). Analogously, the number of blocks and the size of the sampling increases with respect to the number of clusters and the actual size of the dataset, respectively (observation (ii) and (iii)). In particular, in the experimental section, Sect. 3, we used \(m=10\cdot \sqrt{K\cdot d}\), \(s=\sqrt{n}\) and \(r=5\).

1.2 Stopping criterion

As we commented in Sect. 1.3, one of the advantages of constructing spatial partitions with only well assigned blocks is that our algorithm, under this setting, converges to a local minima of the K-means problem over the entire dataset and, therefore, there is no need to execute any further run of the \(\hbox {BW}K\hbox {M}\) algorithm as the set of centroids will remain the same for any thinner partition:

Theorem 3

If C is a fixed point of the weighted K-means algorithm for a spatial partition \({\mathcal {B}}\), for which all of its blocks are well assigned, then C is a fixed point of the K-means algorithm on D.Footnote 16

To verify this criterion, we can make use of the concept of boundary of a spatial partition (Definition 4). In particular, observe that if \({\mathcal {F}}_{C,D}({\mathcal {B}})=\emptyset \), then one can guarantee that all the blocks of \({\mathcal {B}}\) are well assigned with respect to both C and D. To check this, we just need to scan the misassignment function value for each block, i.e., it is just \({\mathcal {O}}(|{\mathcal {P}}|)\). In addition to this criterion, in this section we will propose three other stopping criteria:

  • A practical computational criterion We could set, in advance, the amount of computational resources that we are willing to use and stop when we exceed them. In particular, as the computation of distances is the most expensive step of the algorithm, we could set a maximum number of distances as a stopping criterion.

  • A Lloyd’s algorithm type criterion As we mentioned in Sect. 1.2, the common practice is to run Lloyd’s algorithm until the reduction of the error, after a certain iteration, is small, see Eq. 2. As in our weighted approximation we do not have access to the error \(E^{D}(C)\), a similar approach is to stop the algorithm when the obtained set of centroids, in consecutive iterations, is smaller than a fixed threshold, \({\varepsilon }_{w}\). We can actually set this threshold in a way that the stopping criterion of Lloyd’s algorithm is satisfied. For instance, for \({\varepsilon }_{w}=\sqrt{l^{2}+\frac{\varepsilon ^2}{n^2}}-l\), if \(\Vert C-C'\Vert _{\infty } \le {\varepsilon }_{w}\), then the criterion in Eq. 2 is satisfied.Footnote 17 However, this would imply additional \({\mathcal {O}}(K\cdot d)\) computations at each iteration.

  • A criterion based on the accuracy of the weighted error We could also consider the bound obtained at Theorem 2 and stop when it is lower than a predefined threshold. This will let us know how accurate our current weighted error is with respect to the error over the entire dataset. All the information in this bound is obtained from the weighted Lloyd iteration and the information of the block and its computation is just \({\mathcal {O}}(|{\mathcal {P}}|)\).

Proofs

In Theorem 1, we prove the cutting criterion that we use in \(\hbox {BW}K\hbox {M}\). It consists of an inequality that, only by using information referred to the partition of the dataset and the weighted Lloyd’s algorithm, helps us guarantee that a block is well assigned.

Theorem 1

Given a set of K centroids, C, a dataset, \(D \subseteq {\mathbb {R}}^d\), and a block B, if \(\epsilon _{C,D}(B) = 0\), then \(\mathbf{c }_{\mathbf{x }}= \mathbf{c }_{{\overline{P}}}\) for all \(\mathbf{x } \in P=B(D)\ne \emptyset \).

Proof

From the triangular inequality, we know that \(\Vert \mathbf{x }-\mathbf{c }_{{\overline{P}}}\Vert \le \Vert \mathbf{x }-{\overline{P}}\Vert +\Vert {\overline{P}}-\mathbf{c }_{{\overline{P}}}\Vert \). Moreover, observe that \({\overline{P}}\) is contained in the block B, since B is a convex polytope. Then \(\Vert \mathbf{x }-{\overline{P}}\Vert \le l_{B}\).

For this reason, \(\Vert \mathbf{x }-\mathbf{c }_{{\overline{P}}}\Vert \le l_{B} - \delta _{P}(C) + \Vert {\overline{P}}-\mathbf{c }\Vert \le (2\cdot l_{B} - \delta _{P}(C)) + \Vert \mathbf{x }-\mathbf{c }\Vert \) holds. As \(\epsilon _{C,D}(B)=\max \{0,2\cdot l_{B}-\delta _{P}(C)\}=0\), then \(2\cdot l_{B}-\delta _{P}(C) \le 0\) and, therefore, \(\Vert \mathbf{x }-\mathbf{c }_{{\overline{P}}}\Vert \le \Vert \mathbf{x }-\mathbf{c }\Vert \) for all \(\mathbf{c } \in C\). In other words, \(\mathbf{c }_{{\overline{P}}}=\mathop {{{\,\mathrm{arg\,min}\,}}}\nolimits _{\mathbf{c } \in C} \Vert \mathbf{x }-\mathbf{c }\Vert \) for all \(\mathbf{x } \in P\). \(\square \)

The two following results show some properties of the error function when having well assigned blocks.

Lemma 1

If \(\mathbf{c }_{\mathbf{x }}=\mathbf{c }_{{\overline{P}}}\) and \(\mathbf{c }_{\mathbf{x }}'=\mathbf{c }_{{\overline{P}}}'\) for all \(\mathbf{x } \in P\), where \(P \subseteq D\) and C, \(C'\) are a pair of sets of centroids, then \(E^{P}(C)-E^{\{ P \}}(C)=E^{P}(C')-E^{\{ P \}}(C')\).

Proof

From Lemma 1 in Capó et al (2017), we can say that the following function is constant \(f(\mathbf{c })= |P|\cdot \Vert {\overline{P}}- \mathbf{c } \Vert ^2 - \sum _{\mathbf{x } \in P} \Vert \mathbf{x }- \mathbf{c } \Vert ^2 \), for \(\mathbf{c } \in {\mathbb {R}}^d\). In particular, since \(f({\overline{P}})=- \sum _{\mathbf{x } \in P} \Vert \mathbf{x }- {\overline{P}} \Vert ^2\), we have that \(|P|\cdot \Vert {\overline{P}}- \mathbf{c }_{{\overline{P}}} \Vert ^2 =\sum _{\mathbf{x } \in P} \Vert \mathbf{x }- \mathbf{c }_{{\overline{P}}} \Vert ^2 - \sum _{\mathbf{x } \in P} \Vert \mathbf{x }- {\overline{P}} \Vert ^2\) and so we can express the weighted error of a dataset partition, \({\mathcal {P}}\), as follows

$$\begin{aligned} E^{{\mathcal {P}}}(C)= \sum \limits _{P \in {\mathcal {P}}} \sum _{\mathbf{x } \in P} \left( \Vert \mathbf{x }- \mathbf{c }_{{\overline{P}}} \Vert ^2 - \Vert \mathbf{x }- {\overline{P}} \Vert ^2\right) \end{aligned}$$
(6)

In particular, for \(P \in {\mathcal {P}}\), we have

$$\begin{aligned} E^{P}(C)-E^{\{ P \}}(C)&=\sum _{\mathbf{x } \in P} \left( \Vert \mathbf{x }- \mathbf{c }_{\mathbf{x }} \Vert ^2 -\Vert \mathbf{x }- \mathbf{c }_{{\overline{P}}} \Vert ^2 + \Vert \mathbf{x }- {\overline{P}} \Vert ^2\right) \\&= \sum _{\mathbf{x } \in P} \Vert \mathbf{x }- {\overline{P}} \Vert ^2 \\&=\sum _{\mathbf{x } \in P} \left( \Vert \mathbf{x }- \mathbf{c }_{\mathbf{x }}' \Vert ^2 -\Vert \mathbf{x }- \mathbf{c }_{{\overline{P}}}' \Vert ^2 + \Vert \mathbf{x }- {\overline{P}} \Vert ^2\right) \\&= E^{P}(C')-E^{\{ P \}}(C') \end{aligned}$$

\(\square \)

In the previous result we observe that, if all the instances are correctly assigned in each block, then the difference of the weighted and the entire dataset error, of both sets of centroids, is the same. In other words, if all the blocks of a given partition are correctly assigned, not only can we then actually guarantee a monotone descend of the entire error function for our approximation, a property that can not be guaranteed for the typical coreset type approximations of K-means, but we know exactly the reduction of such an error after a weighted Lloyd iteration.

Lemma 2

Given two set of centroids C, \(C'\), where \(C'\) is obtained after a weighted Lloyd’s iteration (on a partition \({\mathcal {P}}\)) over C and \(\mathbf{c }_{\mathbf{x }}=\mathbf{c }_{{\overline{P}}}\) and \(\mathbf{c }_{\mathbf{x }}'=\mathbf{c }_{{\overline{P}}}'\) for all \(\mathbf{x } \in P\) and \(P \in {\mathcal {P}}\), then \(E^{D}(C') \le E^{D}(C)\).

Proof

Using Lemma 1 over all the subsets \(P \in {\mathcal {P}}\), we know that \(E^{D}(C')-E^{D}(C)=\sum _{P \in {\mathcal {P}}} (E^{P}(C')-E^{P}(C))\)\(= \sum _{P \in {\mathcal {P}}} (E^{\{ P \}}(C')-E^{\{ P \}}(C))=E^{{\mathcal {P}}}(C')-E^{{\mathcal {P}}}(C)\). Moreover, from the chain of inequalities A.1 in Capó et al (2017), we know that \(E^{{\mathcal {P}}}(C')\le E^{{\mathcal {P}}}(C)\) at any weighted Lloyd iteration over a given partition \({\mathcal {P}}\), thus \(E^{D}(C')\le E^{D}(C)\). \(\square \)

Up to this point, most of the quality results assume the case when all the blocks are well assigned. However, in order to achieve this, many \(\hbox {BW}K\hbox {M}\) iterations might be required. In the following result, we provide a bound to the weighted error with respect to the full error. This result shows that our weighted representation improves as more blocks of our partition satisfy the criterion in Algorithm 1 and/or the diagonal of the blocks are smaller.

Theorem 2

Given a dataset, D, a set of K centroids C and a spatial partition \({\mathcal {B}}\) of the dataset D, the following inequality is satisfied:

$$\begin{aligned} \Big |E^{D}(C)-E^{{\mathcal {P}}}(C)\Big | \le \sum \limits _{\begin{array}{c} B \in {\mathcal {B}} \end{array}} 2 \cdot |P| \cdot \epsilon _{C,D}(B) \cdot \left( 2 \cdot l_{B}+ \Vert {\overline{P}}- \mathbf{c }_{{\overline{P}}} \Vert \right) + \frac{|P|-1}{2} \cdot l_{B}^2, \end{aligned}$$

where \(P=B(D)\) and \({\mathcal {P}}= {\mathcal {B}}(D)\). Furthermore, for a well assigned partition \({\mathcal {P}}\), if \(C_{OPT}^{{\mathcal {P}}}=\mathop {{{\,\mathrm{arg\,min}\,}}}\nolimits _{C \subset {\mathbb {R}}^d, |C|=K}E^{{\mathcal {P}}}(C)\) and \(C_{OPT}=\mathop {{{\,\mathrm{arg\,min}\,}}}\nolimits _{C \subset {\mathbb {R}}^d, |C|=K}E^{D}(C)\), then

$$\begin{aligned} E^{D}\Big (C_{OPT}^{{\mathcal {P}}}\Big )\le E^{D}(C_{OPT})+ (n-|{\mathcal {P}}|)\cdot l^{2}, \end{aligned}$$

where \(l=\max \limits _{B \in {\mathcal {B}}} l_{B}\).

Proof

Using Eq. 6 in Lemma 1, we know that \(|E^{D}(C)-E^{{\mathcal {P}}}(C)| \le \sum \limits _{P \in {\mathcal {P}}} \sum \limits _{\mathbf{x } \in P} \Vert \mathbf{x }- \mathbf{c }_{{\overline{P}}} \Vert ^2 -\Vert \mathbf{x }- \mathbf{c }_{\mathbf{x }} \Vert ^2 + \Vert \mathbf{x }- {\overline{P}} \Vert ^2\).

Observe that, for a certain instance \(\mathbf{x } \in P\), where \(\epsilon _{C,D}(B)=\max \{0,2\cdot l_{B}-\delta _{P}(C)\}=0\), \(\Vert \mathbf{x }- \mathbf{c }_{{\overline{P}}} \Vert ^2 -\Vert \mathbf{x }- \mathbf{c }_{\mathbf{x }} \Vert ^2=0\), as \(\mathbf{c }_{\mathbf{x }}=\mathbf{c }_{{\overline{P}}}\) by Theorem 1. On the other hand, if \(\epsilon _{C,D}(B) > 0\), we have the following inequalities:

$$\begin{aligned} \Vert \mathbf{x }- \mathbf{c }_{{\overline{P}}} \Vert -\Vert \mathbf{x }- \mathbf{c }_{\mathbf{x }} \Vert&\le 2\cdot \Big \Vert \mathbf{x }- {\overline{P}} \Big \Vert -\left( \Big \Vert {\overline{P}}- \mathbf{c }_{\mathbf{x }} \Big \Vert -\Big \Vert {\overline{P}}- \mathbf{c }_{{\overline{P}}} \Big \Vert \right) \\&\le \epsilon _{C,D}(B)\\ \Big \Vert \mathbf{x }- \mathbf{c }_{{\overline{P}}} \Big \Vert +\Big \Vert \mathbf{x }- \mathbf{c }_{\mathbf{x }} \Big \Vert&\le 2\cdot \Big \Vert \mathbf{x }- {\overline{P}} \Big \Vert + \Big \Vert {\overline{P}}- \mathbf{c }_{\mathbf{x }} \Big \Vert +\Big \Vert {\overline{P}}- \mathbf{c }_{{\overline{P}}} \Big \Vert \\&< 2\cdot l_{B}+\left( 2\cdot l_{B}+\Big \Vert {\overline{P}}- \mathbf{c }_{{\overline{P}}} \Big \Vert \right) \\&\quad + \Big \Vert {\overline{P}}- \mathbf{c }_{{\overline{P}}} \Big \Vert \\&= 2\cdot \left( 2\cdot l_{B}+\Big \Vert {\overline{P}}- \mathbf{c }_{{\overline{P}}} \Big \Vert \right) \end{aligned}$$

Using both inequalities, we have \(\Vert \mathbf{x }- \mathbf{c }_{{\overline{P}}} \Vert ^2 -\Vert \mathbf{x }- \mathbf{c }_{\mathbf{x }} \Vert ^2 \le 2 \cdot \epsilon _{C,D}(B) \cdot (2\cdot l_{B}+\Vert {\overline{P}}- \mathbf{c }_{{\overline{P}}} \Vert )\). On the other hand, observe that \(\sum \limits _{\mathbf{x } \in P} \Vert \mathbf{x }- {\overline{P}} \Vert ^2= \frac{1}{|P|} \cdot \sum \limits _{\mathbf{x },\mathbf{y } \in P} \Vert \mathbf{x }- \mathbf{y } \Vert ^2 \le \frac{1}{|P|} \cdot \frac{|P|\cdot (|P|-1)}{2} \cdot l_{B}^2=\frac{|P|-1}{2} \cdot l_{B}^2\).

Furthermore, if the partition is well assigned, then \(\epsilon _{C,D}(B)=0\) for all \(B \in {\mathcal {B}}\) and so,

$$\begin{aligned} E^{D}(C_{OPT}^{{\mathcal {P}}})&\le E^{{\mathcal {P}}}(C_{OPT}^{{\mathcal {P}}})+\sum \limits _{\begin{array}{c} B \in {\mathcal {B}} \end{array}} \frac{|P|-1}{2} \cdot l_{B}^2 \\&\le E^{D}(C_{OPT})+2 \cdot \sum \limits _{\begin{array}{c} B \in {\mathcal {B}} \end{array}} \frac{|P|-1}{2} \cdot l_{B}^2 \\&\le E^{D}(C_{OPT})+(n-|{\mathcal {P}}|)\cdot l^2 \end{aligned}$$

\(\square \)

In Theorem 3, we show an interesting property of the \(\hbox {BW}K\hbox {M}\) algorithm. We verify that a fixed point of the weighted Lloyd’s algorithm, over a partition with only well assigned blocks, is also a fixed point of Lloyd’s algorithm over the entire dataset D.

Theorem 3

If C is a fixed point of the weighted K-means algorithm for a spatial partition \({\mathcal {B}}\), for which all of its blocks are well assigned, then C is a fixed point of the K-means algorithm on D.

Proof

\(C=\{\mathbf{c }_1, \ldots , \mathbf{c }_K\}\) is a fixed point of the weighted K-means algorithm, on a partition \({\mathcal {P}}\), if and only if when applying an additional iteration of the weighted K-means algorithm on \({\mathcal {P}}\), the generated clusterings \({\mathcal {G}}_{1}({\mathcal {P}}), \ldots , {\mathcal {G}}_{K}({\mathcal {P}})\), i.e., \({\mathcal {G}}_{i}({\mathcal {P}}) := \{P \in {\mathcal {P}}: \mathbf{c }_{i}=\mathop {{{\,\mathrm{arg\,min}\,}}}\nolimits _{\mathbf{c } \in C} \Vert {\overline{P}}-\mathbf{c }\Vert \}\), satisfies \(\mathbf{c }_{i}=\frac{\sum \limits _{P \in {\mathcal {G}}_{i}({\mathcal {P}})} |P| \cdot {\overline{P}}}{\sum \limits _{P \in {\mathcal {G}}_{i}({\mathcal {P}})} |P|}\) for all \(i=\{1, \ldots , K\}\) (1).

Since all the blocks \(B \in {\mathcal {B}}\) are well assigned, then the clusterings of C in D, \({\mathcal {G}}_{i}(D) := \{\mathbf{x } \in D: \mathbf{c }_{i}=\mathop {{{\,\mathrm{arg\,min}\,}}}\nolimits _{\mathbf{c } \in C} \Vert \mathbf{x }-\mathbf{c }\Vert \}\), satisfy \(|{\mathcal {G}}_{i}(D)| = \sum \limits _{P \in {\mathcal {G}}_{i}({\mathcal {P}})} |P|\) (2) and \(\sum \limits _{\mathbf{x } \in {\mathcal {G}}_{i}(D)} \mathbf{x }=\sum \limits _{P \in {\mathcal {G}}_{i}({\mathcal {P}})} \sum \limits _{\mathbf{x } \in P} \mathbf{x }\) (3). From (1), (2) and (3), we have

$$\begin{aligned} \mathbf{c }_{i}= & {} \frac{\sum \limits _{P \in {\mathcal {G}}_{i}({\mathcal {P}})} |P| \cdot {\overline{P}}}{\sum \limits _{P \in {\mathcal {G}}_{i}({\mathcal {P}})} |P|}= \frac{\sum \limits _{P \in {\mathcal {G}}_{i}({\mathcal {P}})} |P| \cdot \sum \limits _{\mathbf{x } \in P} \frac{\mathbf{x }}{|P|}}{\sum \limits _{P \in {\mathcal {G}}_{i}({\mathcal {P}})} |P|}\\= & {} \frac{\sum \limits _{P \in {\mathcal {G}}_{i}({\mathcal {P}})} \sum \limits _{\mathbf{x } \in P} \mathbf{x }}{\sum \limits _{P \in {\mathcal {G}}_{i}({\mathcal {P}})} |P|}=\frac{\sum \limits _{\mathbf{x } \in {\mathcal {G}}_{i}(D)} \mathbf{x }}{|{\mathcal {G}}_{i}(D)|} \forall \ i \in {1,\ldots , K}, \end{aligned}$$

this is, C is a fixed point of K-means algorithm on D. \(\square \)

As we do not have access to the error for the entire dataset, \(E^{D}(C)\), since its computation is expensive, in Algorithm 5 we propose a possible stopping criterion that bounds the displacement of the set of centroids. In the following result, we show a possible choice of this bound in a way that, if the proposed criterion is verified, then the common Lloyd’s algorithm stopping criterion is also satisfied.

Theorem 4

Given two sets of centroids \(C=\{\mathbf{c }_{k}\}_{k=1}^{K}\) and \(C'=\{\mathbf{c }_{k}'\}_{k=1}^{K}\) , if \(\Vert C-C'\Vert _{\infty }=\max \limits _{k=1, \ldots ,K} \Vert \mathbf{c }_{k}-\mathbf{c }_{k}' \Vert \le {\varepsilon }_{w}\), where \({\epsilon }_{w}=\sqrt{l^{2}+\frac{\epsilon ^2}{n^2}}-l\), then \(|E^{D}(C)-E^{D}(C')|\le \varepsilon \).

Proof

Initially, we bound the following terms: \(\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}\Vert +\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}'\Vert \) and \(|\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}\Vert -\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}'\Vert |\) for any \(\mathbf{x }\in D\).

If we set j and t as the indexes satisfying \(\mathbf{c }_{j}=\mathbf{c }_{\mathbf{x }}\) and \(\mathbf{c }_{t}'=\mathbf{c }_{\mathbf{x }}'\), then we have \(\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}\Vert +\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}'\Vert =\Vert \mathbf{x }-\mathbf{c }_{j}\Vert +\Vert \mathbf{x }-\mathbf{c }_{t}'\Vert \le \Vert \mathbf{x }-\mathbf{c }_{t}\Vert +\Vert \mathbf{x }-\mathbf{c }_{t}'\Vert \le 2 \cdot \Vert \mathbf{x }-\mathbf{c }_{t}'\Vert +\varepsilon _{w}= 2 \cdot \Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}'\Vert +\varepsilon _{w}\) (1). Analogously, applying the triangular inequality, we have \(|\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}\Vert -\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}'\Vert |\le \varepsilon _{w}\) (2). In the following chain of inequalities, we will make use of (1) and (2):

$$\begin{aligned} |E^{D}(C)-E^{D}(C')|\le & {} \Big |\sum \limits _{\mathbf{x }\in D} \Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}\Vert ^2-\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}'\Vert ^2\Big |\\\le & {} \sum \limits _{\mathbf{x }\in D} \Big |\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}\Vert ^2-\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}'\Vert ^2\Big |\\\le & {} \sum \limits _{\mathbf{x }\in D} \Big (\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}\Vert +\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}'\Vert \Big )\cdot \\&\Big |\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}\Vert -\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}'\Vert \Big | \\\le & {} \sum \limits _{\mathbf{x }\in D} \varepsilon _{w} \cdot \Big (2 \cdot \Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}'\Vert +\varepsilon _{w}\Big )\\\le & {} n \cdot \varepsilon _{w}^{2} + 2 \cdot n \cdot \max \limits _{\mathbf{x } \in D}\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}'\Vert \cdot \varepsilon _{w}\\\le & {} n \cdot \varepsilon _{w}^{2} + 2 \cdot n \cdot l\cdot \varepsilon _{w}= \varepsilon \end{aligned}$$

\(\square \)

As can be seen in Sect. 2.2, there are different parameters that must be tuned. In the following result, we set a criterion to choose the initialization parameters of Algorithm 2 in a way that its complexity, even in the worst case scenario, is still the same as that of Lloyd’s algorithm.

Theorem 5

Given an integer r, if \(m={\mathcal {O}}(\sqrt{K \cdot d})\) and \(s={\mathcal {O}}(\sqrt{n})\), then Algorithm 2 is \({\mathcal {O}}(n\cdot K \cdot d)\).

Proof

It is enough to verify the conditions presented before. Firstly, observe that \(r \cdot s \cdot m^2={\mathcal {O}}(\sqrt{n} \cdot K \cdot d)\) and \(n \cdot m={\mathcal {O}}(n \cdot \sqrt{K \cdot d})\). Moreover, as \(K\cdot d= {\mathcal {O}}(n)\), then \(r \cdot m^2={\mathcal {O}}(n)\). \(\square \)

Finally, we present a complimentary property of the grid based \(\hbox {RP}K\hbox {M}\) proposed in Capó et al (2017). Each iteration of the \(\hbox {RP}K\hbox {M}\) can be proved to be a coreset with an exponential decrease in the error with respect to the number of iterations. This result could actually bound the \(\hbox {BW}K\hbox {M}\) error, if we fix i as the minimum number of cuts that a block, of a certain partition generated by \(\hbox {BW}K\hbox {M}\), \({\mathcal {P}}\), has.

Theorem 6

Given a set of points D in \({\mathbb {R}}^d\), the i-th iteration of the grid based \(\hbox {RP}K\hbox {M}\) produces a \((K,\varepsilon )\)-coreset with \(\varepsilon =\frac{1}{2^{i-1}}\cdot (1+\frac{1}{2^{i+2}} \cdot \frac{n-1}{n})\cdot \frac{n \cdot l^2}{OPT}\), where \(OPT=\min \limits _{C \subseteq {\mathbb {R}}^d, |C|=K} E^{D}(C)\) and l the length of the diagonal of the smallest bounding box containing D.

Proof

Firstly, we denote by \(\mathbf{x }'\) to the representative of \(\mathbf{x }\in D\) at the i-th grid based \(\hbox {RP}K\hbox {M}\) iteration, i.e., if \(\mathbf{x }\in P\) then \(\mathbf{x }'={\overline{P}}\), where P is a block of the corresponding dataset partition \({\mathcal {P}}\) of D. Observe that, at the i-th grid based \(\hbox {RP}K\hbox {M}\) iteration, the length of the diagonal of each cell is \(\frac{1}{2^i}\cdot l\) and we set a positive constant, c, as the positive real number satisfying \(\frac{1}{2^i}\cdot l=\sqrt{c \cdot \frac{OPT}{n}}\). By the triangular inequality, we have

$$\begin{aligned} \Big |E^{D}(C)-E^{{\mathcal {P}}}(C)\Big |&\le \sum \limits _{\mathbf{x }\in D} \Big |\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}\Vert ^2-\Vert \mathbf{x }'-\mathbf{c }_{\mathbf{x }'}\Vert ^2\Big |\\&\le \sum \limits _{\mathbf{x }\in D} \Bigg |\Bigg (\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}\Vert -\Vert \mathbf{x }'-\mathbf{c }_{\mathbf{x }'}\Vert )(\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}\Vert +\Vert \mathbf{x }'-\mathbf{c }_{\mathbf{x }'}\Vert \Bigg )\Bigg | \end{aligned}$$

Analogously, observe that the following inequalities hold \(\Vert \mathbf{x }'-\mathbf{c }_{\mathbf{x }'}\Vert +\Vert \mathbf{x }-\mathbf{x }'\Vert \ge \Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}\Vert \) and \(\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}\Vert +\Vert \mathbf{x }-\mathbf{x }'\Vert \ge \Vert \mathbf{x }'-\mathbf{c }_{\mathbf{x }'}\Vert \). Thus, \(\Vert \mathbf{x }-\mathbf{x }'\Vert \ge |\Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}\Vert -\Vert \mathbf{x }'-\mathbf{c }_{\mathbf{x }'}\Vert |\):

$$\begin{aligned} \Big |E^{D}(C)-E^{{\mathcal {P}}}(C)\Big |\le \sum \limits _{\mathbf{x }\in D} \Vert \mathbf{x }-\mathbf{x }'\Vert \cdot \left( 2\cdot \Vert \mathbf{x }-\mathbf{c }_{\mathbf{x }}\Vert +\Vert \mathbf{x }-\mathbf{x }'\Vert \right) \end{aligned}$$

On the other hand, we know that \(\sum \limits _{\mathbf{x }\in D} \Vert \mathbf{x }-\mathbf{x }'\Vert ^2 \le \frac{n-1}{2^{2i+1}}\cdot l^{2}\) and that, as both \(\mathbf{x }\) and \(\mathbf{x }'\) must be located in the same cell, \(\Vert \mathbf{x }-\mathbf{x }'\Vert \le \frac{1}{2^i}\cdot l\). Therefore, as \(\mathbf{d}(\mathbf{x },C) \le l\),

$$\begin{aligned} \Big |E^{D}(C)-E^{{\mathcal {P}}}(C)\Big |\le & {} \left( \frac{n-1}{2^{2i+1}}+\frac{n}{2^{i-1}}\right) \cdot l^{2} \\\le & {} \left( \frac{n-1}{2^{2i+1}}+\frac{n}{2^{i-1}}\right) \cdot 2^{2i} \cdot c \cdot \frac{OPT}{n}\\\le & {} \left( \frac{1}{2^{i+2}} \cdot \frac{n-1}{n}+1\right) \cdot 2^{i+1} \cdot c \cdot E^{D}(C) \end{aligned}$$

In other words, the i-th \(\hbox {RP}K\hbox {M}\) iteration is a \((K,\varepsilon )\)- coreset with \(\varepsilon =(\frac{1}{2^{i+2}} \cdot \frac{n-1}{n}+1) \cdot 2^{i+1} \cdot c=\frac{1}{2^{i-1}}\cdot (1+\frac{1}{2^{i+2}} \cdot \frac{n-1}{n})\cdot \frac{n \cdot l^2}{OPT}\). \(\square \)

About the grid based \(\hbox {RP}K\hbox {M}\)

In the experimental section in Capó et al (2017), the partition sequence used (grid based \(\hbox {RP}K\hbox {M}\)) consisted on sequentially constructing a new spatial partition by dividing each block of the previous partition into \(2^{d}\) new blocks, i.e., \({\mathcal {P}}\) can have up to \(2^{i \cdot d}\) representatives. In this section, we provide some additional results in which we compare the performance of the grid based \(\hbox {RP}K\hbox {M}\) with respect to the methods and datasets presented in Sect. 3 and \(K \in \{3,5,10,25,50\}\).

As in Capó et al (2017), we fix the maximum number of iterations, M, as the stopping criterion for the grid based \(\hbox {RP}K\hbox {M}\). Initially, we considered \(M=10\), however just for the CIF and 3RN datasets -case (i)- the grid based \(\hbox {RP}K\hbox {M}\) managed to converge before reaching the limit running time (24 hours). Moreover, for the HPC and WUY datasets -case (ii)-, we obtained results for \(M=5\) and, unfortunately, for the datasets with the largests dimensionality (GS and SUSY), the grid based \(\hbox {RP}K\hbox {M}\) failed to provide any output (Table 3). The obtained results are summarized in Figs. 9, 10, 11, 12, 13, 14, 15 and 16.

Table 3 Proportion final number of representatives/instances for the different datasets and number of clusters
Fig. 9
figure 9

Relative distance computations versus relative error on the CIF dataset

Fig. 10
figure 10

Proportion representatives/instances with respect to the number of iterations on the CIF dataset

Fig. 11
figure 11

Relative distance computations versus relative error on the 3RN dataset

Fig. 12
figure 12

Proportion representatives/instances with respect to the number of iterations on the 3RN dataset

Fig. 13
figure 13

Relative distance computations versus relative error on the HPC dataset

Fig. 14
figure 14

Proportion representatives/instances with respect to the number of iterations on the HPC dataset

Fig. 15
figure 15

Relative distance computations versus relative error on the WUY dataset

Fig. 16
figure 16

Proportion representatives/instances with respect to the number of iterations on the WUY dataset

In the datasets of case (i), we have better view of the evolution of the number of representatives of the grid based \(\hbox {RP}K\hbox {M}\) with respect to the number of iterations. In Figs. 10, 12 and Table 4, we observe that the number of representatives of the grid based \(\hbox {RP}K\hbox {M}\), after 10 iterations, is about the number of instances in both the CIF and 3RN datasets, while, for the \(\hbox {BW}K\hbox {M}\), we observe a much slower growth in the number of representatives. In particular, for the 3RN dataset, the number of representatives, for the different number of clusters and after 100 iterations, is still under \(13\%\) the number of instances, while generating approximations of similar and/or better quality than those of the grid based \(\hbox {RP}K\hbox {M}\). Furthermore, we observe that, for all the datasets, the number of representatives of \(\hbox {BW}K\hbox {M}\) reaches a plateau way before the final number of iterations, meaning that, for a small number of iterations, most of the blocks generated by \(\hbox {BW}K\hbox {M}\) are well assigned. On the other hand, as the number of representatives for the grid based \(\hbox {RP}K\hbox {M}\), for the datasets in case (ii), is smaller than in the previous case, we observe in Figs. 13 and 15, that the quality of the approximation of the grid based \(\hbox {RP}K\hbox {M}\) is commonly much less competitive than the solutions obtained via \(\hbox {BW}K\hbox {M}\): the grid based \(\hbox {RP}K\hbox {M}\) commonly has over 10% of relative error with respect to \(\hbox {BW}K\hbox {M}\).

In Table 4, we present the proportion of cases in which \(\hbox {BW}K\hbox {M}\) generates a well assigned partition verified via the assignment function of Theorem 1. As we commented in Sect. A.2, this is a sufficient condition to verify that the solution obtained via \(\hbox {BW}K\hbox {M}\) is also a fixed point of Lloyd’s algorithm for the entire dataset. We observe that, specially for a low number of clusters, \(\hbox {BW}K\hbox {M}\) is very likely to converge to a local minima of the K-means problem. For instance, for WUY dataset and \(K \in \{3,5\}\), \(\hbox {BW}K\hbox {M}\) always generated well assigned partitions, which is quite remarkable as the number of representatives in these cases is under 0.6% of the number of instances. As expected, as the number of cluster increases, it is harder to verify such a condition, however we must remember that this is just a sufficient condition since we are using Theorem 1, rather than computing all the pairwise distances instance-centroid.

Table 4 Proportion of cases in which the spatial partition obtained by \(\hbox {BW}K\hbox {M}\) satisifes Theorem 3 verified via the misassigment function of Theorem 1

From the results presented in this section, it is clear that \(\hbox {BW}K\hbox {M}\) alleviates the main drawback of the grid based \(\hbox {RP}K\hbox {M}\), as it also controls the growth of the number of representatives, which, in the worst case scenario, only has a linearly growth in this case. This is an important factor, as it allows \(\hbox {BW}K\hbox {M}\) to scale better with respect to both the dimensionality and the number of iterations. Furthermore, \(\hbox {BW}K\hbox {M}\) is still a \(\hbox {RP}K\hbox {M}\) type approach, meaning that, besides the theoretical guarantees that we have developed during article and the results that just commented on, \(\hbox {BW}K\hbox {M}\) also has the guarantees of the grid based \(\hbox {RP}K\hbox {M}\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Capó, M., Pérez, A. & Lozano, J.A. An efficient K-means clustering algorithm for tall data. Data Min Knowl Disc 34, 776–811 (2020). https://doi.org/10.1007/s10618-020-00678-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-020-00678-9

Keywords

Navigation