A Simple D2-Sampling Based PTAS for k-Means and other Clustering Problems

  • Ragesh Jaiswal
  • Amit Kumar
  • Sandeep Sen
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7434)


Given a set of points P ⊂ ℝ d , the k-means clustering problem is to find a set of k centers C = {c 1,...,c k }, c i  ∈ ℝ d , such that the objective function ∑  x ∈ P d(x,C)2, where d(x,C) denotes the distance between x and the closest center in C, is minimized. This is one of the most prominent objective functions that have been studied with respect to clustering.

D 2-sampling [1] is a simple non-uniform sampling technique for choosing points from a set of points. It works as follows: given a set of points P ⊆ ℝ d , the first point is chosen uniformly at random from P. Subsequently, a point from P is chosen as the next sample with probability proportional to the square of the distance of this point to the nearest previously sampled points.

D 2-sampling has been shown to have nice properties with respect to the k-means clustering problem. Arthur and Vassilvitskii [1] show that k points chosen as centers from P using D 2-sampling gives an O(logk) approximation in expectation. Ailon et. al. [2] and Aggarwal et. al. [3] extended results of [1] to show that O(k) points chosen as centers using D 2-sampling give O(1) approximation to the k-means objective function with high probability. In this paper, we further demonstrate the power of D 2-sampling by giving a simple randomized (1 + ε)-approximation algorithm that uses the D 2-sampling in its core.


Mahalanobis Distance Cluster Problem Dissimilarity Measure Optimal Cluster Randomized Algorithm 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035 (2007)Google Scholar
  2. 2.
    Ailon, N., Jaiswal, R., Monteleoni, C.: Streaming k-means approximation. In: Advances in Neural Information Processing Systems, vol. 22, pp. 10–18 (2009)Google Scholar
  3. 3.
    Aggarwal, A., Deshpande, A., Kannan, R.: Adaptive Sampling for k-Means Clustering. In: Dinur, I., Jansen, K., Naor, J., Rolim, J. (eds.) APPROX 2009. LNCS, vol. 5687, pp. 15–28. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  4. 4.
    Broder, A., Glassman, S., Manasse, M., Zweig, G.: Syntactic clustering of the webGoogle Scholar
  5. 5.
    Faloutsos, C., Barber, R., Flickner, M., Hafner, J.: Efficient and effective querying by image content. Journal of Intelligent Information Systems (1994)Google Scholar
  6. 6.
    Deerwester, S., Dumais, S., Landauer, T., Furnas, G., Harshman, A.: Indexing by latent semantic analysis. Journal of the American Society for Information Science (1990)Google Scholar
  7. 7.
    Swain, M., Ballard, D.: Color indexing. International Journal of Computer Vision (1991)Google Scholar
  8. 8.
    Dasgupta, S.: The hardness of k-means clustering. Technical Report CS2008-0916, Department of Computer Science and Engineering. University of California San Diego (2008)Google Scholar
  9. 9.
    Lloyd, S.: Least squares quantization in PCM. IEEE Transactions on Information Theory 28(2), 129–137 (1982)MathSciNetzbMATHCrossRefGoogle Scholar
  10. 10.
    Arthur, D., Vassilvitskii, S.: How slow is the k-means method? In: Proc. 22nd Annual Symposium on Computational Geometry, pp. 144–153 (2006)Google Scholar
  11. 11.
    Ostrovsky, R., Rabani, Y., Schulman, L.J., Swamy, C.: The effectiveness of lloyd-type methods for the k-means problem. In: Proc. 47th IEEE FOCS, pp. 165–176 (2006)Google Scholar
  12. 12.
    Ackermann, M.R., Blömer, J.: Coresets and approximate clustering for bregman divergences. In: ACM SIAM Symposium on Discrete Algorithms, pp. 1088–1097 (2009)Google Scholar
  13. 13.
    Chen, K.: On k-median clustering in high dimensions. In: SODA, pp. 1177–1185 (2006)Google Scholar
  14. 14.
    Feldman, D., Monemizadeh, M., Sohler, C.: A ptas for k-means clustering based on weak coresets. In: Symposium on Computational Geometry, pp. 11–18 (2007)Google Scholar
  15. 15.
    Inaba, M., Katoh, N., Imai, H.: Applications of weighted voronoi diagrams and randomization to variance based k-clustering. In: Proceedings of the Tenth Annual Symposium on Computational Geometry, pp. 332–339 (1994)Google Scholar
  16. 16.
    Matousek, J.: On approximate geometric k-clustering. In: Discrete and Computational Geometry (2000)Google Scholar
  17. 17.
    Badoiu, M., Har-Peled, S., Indyk, P.: Approximate clustering via core-sets. In: STOC, pp. 250–257 (2002)Google Scholar
  18. 18.
    de la Vega, W.F., Karpinski, M., Kenyon, C., Rabani, Y.: Approximation schemes for clustering problems. In: ACM Symposium on Theory of Computing, pp. 50–58 (2003)Google Scholar
  19. 19.
    Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: ACM Symposium on Theory of Computing, pp. 291–300 (2004)Google Scholar
  20. 20.
    Kumar, A., Sabharwal, Y., Sen, S.: Linear-time approximation schemes for clustering problems in any dimensions. J. ACM 57(2) (2010)Google Scholar
  21. 21.
    Awasthi, P., Blum, A., Sheffet, O.: Stability yields a ptas for k-median and k-means clustering. In: FOCS, pp. 309–318 (2010)Google Scholar
  22. 22.
    Har-Peled, S., Sadri, B.: How fast is the k-means method? In: ACM SIAM Symposium on Discrete Algorithms, pp. 877–885 (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Ragesh Jaiswal
    • 1
  • Amit Kumar
    • 1
  • Sandeep Sen
    • 1
  1. 1.Department of Computer Science and EngineeringIndian Institute of Technology DelhiIndia

Personalised recommendations