Skip to main content

A Simple D 2-Sampling Based PTAS for k-Means and Other Clustering Problems

Abstract

Given a set of points \(P \subset\mathbb{R}^{d}\), the k-means clustering problem is to find a set of k centers \(C = \{ c_{1},\ldots,c_{k}\}, c_{i} \in\mathbb{R}^{d}\), such that the objective function ∑ xP e(x,C)2, where e(x,C) denotes the Euclidean distance between x and the closest center in C, is minimized. This is one of the most prominent objective functions that has been studied with respect to clustering.

D 2-sampling (Arthur and Vassilvitskii, Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’07, pp. 1027–1035, SIAM, Philadelphia, 2007) is a simple non-uniform sampling technique for choosing points from a set of points. It works as follows: given a set of points \(P \subset\mathbb{R}^{d}\), the first point is chosen uniformly at random from P. Subsequently, a point from P is chosen as the next sample with probability proportional to the square of the distance of this point to the nearest previously sampled point.

D 2-sampling has been shown to have nice properties with respect to the k-means clustering problem. Arthur and Vassilvitskii (Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’07, pp. 1027–1035, SIAM, Philadelphia, 2007) show that k points chosen as centers from P using D 2-sampling give an O(logk) approximation in expectation. Ailon et al. (NIPS, pp. 10–18, 2009) and Aggarwal et al. (Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques, pp. 15–28, Springer, Berlin, 2009) extended results of Arthur and Vassilvitskii (Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’07, pp. 1027–1035, SIAM, Philadelphia, 2007) to show that O(k) points chosen as centers using D 2-sampling give an O(1) approximation to the k-means objective function with high probability. In this paper, we further demonstrate the power of D 2-sampling by giving a simple randomized (1+ϵ)-approximation algorithm that uses the D 2-sampling in its core.

This is a preview of subscription content, access via your institution.

Algorithm 1
Algorithm 2

Notes

  1. \(\tilde{O}\) notation hides a O(logk/ϵ) factor which simplifies the expression.

  2. It can be used in conjunction with Chen [14] to obtain a superior running time but at the cost of the simplicity of our approach.

  3. It turns out that even minor perturbations from uniform distribution can be catastrophic and indeed in this paper we had to work around this.

References

  1. Ackermann, M.R.: Algorithms for the Bregman k-Median Problem. Ph.D. Thesis, University of Paderborn (2010)

  2. Ackermann, M.R., Blömer, J.: Coresets and approximate clustering for Bregman divergences. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’09, pp. 1088–1097. SIAM, Philadelphia (2009)

    Chapter  Google Scholar 

  3. Ackermann, M.R., Blömer, J.: Bregman clustering for separable instances. In: Proceedings of the 12th Scandinavian Conference on Algorithm Theory, SWAT’10, pp. 212–223. Springer, Berlin (2010)

    Google Scholar 

  4. Ackermann, M.R., Blömer, J., Sohler, C.: Clustering for metric and nonmetric distance measures. ACM Trans. Algorithms 6, 59 (2010)

    Article  MathSciNet  Google Scholar 

  5. Aggarwal, A., Deshpande, A., Kannan, R.: Adaptive sampling for k-means clustering. In: Dinur, I., Jansen, K., Naor, J., Rolim, J. (eds.) Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques. Lecture Notes in Computer Science, vol. 5687, pp. 15–28. Springer, Berlin (2009)

    Chapter  Google Scholar 

  6. Ailon, N., Jaiswal, R., Monteleoni, C.: Streaming k-means approximation. In: NIPS, pp. 10–18 (2009)

    Google Scholar 

  7. Arthur, D., Manthey, B., Röglin, H.: Smoothed analysis of the k-means method. J. ACM 19, 1 (2011)

    Article  Google Scholar 

  8. Arthur, D., Vassilvitskii, S.: How slow is the k-means method? In: Proceedings of the Twenty-Second Annual Symposium on Computational Geometry, SCG’06, pp. 144–153. ACM, New York (2006)

    Chapter  Google Scholar 

  9. Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’07, pp. 1027–1035. SIAM, Philadelphia (2007)

    Google Scholar 

  10. Awasthi, P., Blum, A., Sheffet, O.: Stability yields a PTAS for k-median and k-means clustering. In: Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science, FOCS’10, pp. 309–318. IEEE Comput. Soc., Los Alamitos (2010)

    Chapter  Google Scholar 

  11. Bādoiu, M., Har-Peled, S., Indyk, P.: Approximate clustering via core-sets. In: Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, STOC’02, pp. 250–257. ACM, New York (2002)

    Chapter  Google Scholar 

  12. Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with Bregman divergences. J. Mach. Learn. Res. 6, 1705–1749 (2005)

    MATH  MathSciNet  Google Scholar 

  13. Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Comput. Netw. ISDN Syst. 29(8–13), 1157–1166 (1997)

    Article  Google Scholar 

  14. Ke, C.: On k-median clustering in high dimensions. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, SODA’06, pp. 1177–1185. ACM, New York (2006)

    Google Scholar 

  15. Dasgupta, S.: The hardness of k-means clustering. Technical Report CS2008-0916, Department of Computer Science and Engineering, University of California, San Diego (2008)

  16. Fernandez de la Vega, W., Karpinski, M., Kenyon, C., Rabani, Y.: Approximation schemes for clustering problems. In: Proceedings of the Thirty-Fifth Annual ACM Symposium on Theory of Computing, STOC’03, pp. 50–58. ACM, New York (2003)

    Chapter  Google Scholar 

  17. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, A.R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 6 (1990)

    Article  Google Scholar 

  18. Faloutsos, C., Barber, R., Flickner, M., Hafner, J., Niblack, W., Petkovic, D., Equitz, W.: Efficient and effective querying by image content. J. Intell. Inf. Syst. 3(3–4), 231–262 (1994)

    Article  Google Scholar 

  19. Feldman, D., Langberg, M.: A unified framework for approximating and clustering data. In: Proceedings of the 43rd Annual ACM Symposium on Theory of Computing, STOC’11, pp. 569–578. ACM, New York (2011)

    Chapter  Google Scholar 

  20. Feldman, D., Monemizadeh, M., Sohler, C.: A PTAS for k-means clustering based on weak coresets. In: Proceedings of the Twenty-Third Annual Symposium on Computational Geometry, SCG’07, pp. 11–18. ACM, New York (2007)

    Chapter  Google Scholar 

  21. Feldman, D., Schmidt, M., Sohler, C.: Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering. In: Proceedings of the Twenty-Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’13. SIAM, Philadelphia (2013)

    Google Scholar 

  22. Feldman, D., Schulman, L.J.: Data reduction for weighted and outlier-resistant clustering. In: Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’12, pp. 1343–1354. SIAM, Philadelphia (2012)

    Chapter  Google Scholar 

  23. Frahling, G., Sohler, C.: Coresets in dynamic geometric data streams. In: Proceedings of the Thirty-Seventh Annual ACM Symposium on Theory of Computing, STOC’05, pp. 209–217. ACM, New York (2005)

    Chapter  Google Scholar 

  24. Har-Peled, S., Kushal, A.: Smaller coresets for k-median and k-means clustering. In: Proceedings of the Twenty-First Annual Symposium on Computational Geometry, SCG’05, pp. 126–134. ACM, New York (2005)

    Chapter  Google Scholar 

  25. Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: Proceedings of the Thirty-Sixth Annual ACM Symposium on Theory of Computing, STOC’04, pp. 291–300. ACM, New York (2004)

    Chapter  Google Scholar 

  26. Har-Peled, S., Sadri, B.: How fast is the k-means method? In: Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA’05, pp. 877–885. SIAM, Philadelphia (2005)

    Google Scholar 

  27. Inaba, M., Katoh, N., Imai, H.: Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract). In: Proceedings of the Tenth Annual Symposium on Computational Geometry, SCG’94, pp. 332–339. ACM, New York (1994)

    Chapter  Google Scholar 

  28. Kumar, A., Sabharwal, Y., Sen, S.: Linear-time approximation schemes for clustering problems in any dimensions. J. ACM 5(2), 1–32 (2010)

    Article  MATH  MathSciNet  Google Scholar 

  29. Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)

    Article  MATH  MathSciNet  Google Scholar 

  30. Matoušek, J.: On approximate geometric k-clustering. Discrete Comput. Geom. 24(1), 61–84 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  31. Ostrovsky, R., Rabani, Y., Schulman, L.J., Swamy, C.: The effectiveness of Lloyd-type methods for the k-means problem. J. ACM 28, 1 (2013)

    Google Scholar 

  32. Swain, M., Ballard, D.: Color indexing. Int. J. Comput. Vis. 7(1), 11–32 (1991)

    Article  Google Scholar 

  33. Vattani, A.: k-means requires exponentially many iterations even in the plane. In: Proceedings of the 25th Annual Symposium on Computational Geometry, SCG’09, pp. 324–332. ACM, New York (2009)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ragesh Jaiswal.

Appendix: A Sampling Lemma

Appendix: A Sampling Lemma

Here is argue that the approximation analysis of Algorithm 1 in Sect. 4 can be carried over to Algorithm 2. The following sampling lemma will be useful for this.

Lemma 14

(Sampling Lemma)

Let MN be positive integers and 0<α≤1. Let P be a finite set of elements. Consider multisets of size N comprising of elements from P. Let \(\mathcal{S}\) denote a set of such multisets. Let \(\mathcal{D}\) be an arbitrary distribution on elements of \(\mathcal{S}\) and sampling an element from this distribution is denoted by \(S \leftarrow\mathcal{D}\). Consider a fixed ordering of elements of P that defines ordering of elements in multisets of \(\mathcal{S}\) and a fixed ordering of subsets of size M of any set of size N so that \(\forall i \in\{1, \ldots, \binom{N}{M}\}\), the ith subset of a multiset \(S \in\mathcal {S}\) is clearly defined. Let \(f: \mathcal{S} \times\{1, \ldots, \binom{N}{M}\} \rightarrow\{0, 1\}\) be any function denoting some relation between S and its ith subset. Let T be any subset of \(\{1, \ldots, \binom{N}{M}\}\). Consider the following two randomized procedures:

figure a

If Pr[A outputs 1]≥α, then Pr[B outputs 1]≥1−e α.

Proof

Let n=|T|. Let the probability of sampling S from \(\mathcal{S}\) be denoted by \(\mathcal{D}(S)\). Consider a complete bipartite graph where the nodes on the left are members of the set \(\mathcal{S}\) and nodes on the right are element in T. We label an edge (S,i) red if f(S,i)=1 and black otherwise. Edge (S,i) is given a weight of \(\frac{\mathcal{D}(S)}{n}\) so that the sum of weight of all edges in this complete bipartite graph is 1. We first observe that the sum of weight of red edges is at least α/n. This is indeed the case since \(\sum_{S \in\mathcal{S}, \exists i\ f(S, i)=1} \mathcal{D}(S) = \mathbf{Pr}[A \mbox{ outputs } 1]\geq \alpha\). Now let w i denote the sum of weight of red edges incident on the node i on the right. The probability that in iteration i, B does not output 1 is (1−nw i ). So, the probability that B outputs 0 is at most

$$\prod_{i=1}^{n} (1 - n\cdot w_i) \leq e^{-n \cdot\sum_{i=1}^{n} w_i} = e^{-n \cdot{\scriptsize\mbox{(sum of weight of red edges)}}} = e^{-n \cdot (\alpha/n)} = e^{-\alpha} $$

So, Pr[B outputs 1]≥1−e α. □

The following argument shows that the analysis of Algorithm 1 extends to Algorithm 2. In each step of Algorithm 1, we sample S with D 2 sampling and then iterate over all subsets of S of size M. Suppose the probability that there is a subset i such that the ith subset of S is good (in the sense that its centroid is close to an optimal center) is at least 3/4. We indeed show this in Sect. 4. On the other hand, in Algorithm 2, we iterate over all subsets i and in the iteration corresponding to subset i, we sample a multiset S with D 2 sampling. We succeed in finding a good center if there is a subset i such that the ith subset is good for S sampled in iteration i. We note that if the success probability in the first randomized procedure is large, then the probability of success in the second procedure is also large. For this, consider the Sampling Lemma given above. The above two randomized procedures correspond to procedures A and B in Lemma 14. Using this lemma, we get that the success probability in any iteration of step 1(b) in Algorithm 2 is at least (1−e −3/4)≥1/2. So, the probability that Algorithm 2 succeeds in finding a set of k good centers is at least 1/2k. Since we repeat 2k times, we get that Algorithm 2 succeeds with high probability.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Jaiswal, R., Kumar, A. & Sen, S. A Simple D 2-Sampling Based PTAS for k-Means and Other Clustering Problems. Algorithmica 70, 22–46 (2014). https://doi.org/10.1007/s00453-013-9833-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-013-9833-9

Keywords

  • k-Means clustering
  • PTAS
  • Sampling
  • k-Median