COCOON 2012: Computing and Combinatorics pp 13-24

# A Simple D2-Sampling Based PTAS for k-Means and other Clustering Problems

• Ragesh Jaiswal
• Amit Kumar
• Sandeep Sen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7434)

## Abstract

Given a set of points P ⊂ ℝ d , the k-means clustering problem is to find a set of k centers C = {c 1,...,c k }, c i  ∈ ℝ d , such that the objective function ∑  x ∈ P d(x,C)2, where d(x,C) denotes the distance between x and the closest center in C, is minimized. This is one of the most prominent objective functions that have been studied with respect to clustering.

D 2-sampling [1] is a simple non-uniform sampling technique for choosing points from a set of points. It works as follows: given a set of points P ⊆ ℝ d , the first point is chosen uniformly at random from P. Subsequently, a point from P is chosen as the next sample with probability proportional to the square of the distance of this point to the nearest previously sampled points.

D 2-sampling has been shown to have nice properties with respect to the k-means clustering problem. Arthur and Vassilvitskii [1] show that k points chosen as centers from P using D 2-sampling gives an O(logk) approximation in expectation. Ailon et. al. [2] and Aggarwal et. al. [3] extended results of [1] to show that O(k) points chosen as centers using D 2-sampling give O(1) approximation to the k-means objective function with high probability. In this paper, we further demonstrate the power of D 2-sampling by giving a simple randomized (1 + ε)-approximation algorithm that uses the D 2-sampling in its core.

## Keywords

Mahalanobis Distance Cluster Problem Dissimilarity Measure Optimal Cluster Randomized Algorithm
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

## Preview

Unable to display preview. Download preview PDF.

## References

1. 1.
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035 (2007)Google Scholar
2. 2.
Ailon, N., Jaiswal, R., Monteleoni, C.: Streaming k-means approximation. In: Advances in Neural Information Processing Systems, vol. 22, pp. 10–18 (2009)Google Scholar
3. 3.
Aggarwal, A., Deshpande, A., Kannan, R.: Adaptive Sampling for k-Means Clustering. In: Dinur, I., Jansen, K., Naor, J., Rolim, J. (eds.) APPROX 2009. LNCS, vol. 5687, pp. 15–28. Springer, Heidelberg (2009)
4. 4.
Broder, A., Glassman, S., Manasse, M., Zweig, G.: Syntactic clustering of the webGoogle Scholar
5. 5.
Faloutsos, C., Barber, R., Flickner, M., Hafner, J.: Efficient and effective querying by image content. Journal of Intelligent Information Systems (1994)Google Scholar
6. 6.
Deerwester, S., Dumais, S., Landauer, T., Furnas, G., Harshman, A.: Indexing by latent semantic analysis. Journal of the American Society for Information Science (1990)Google Scholar
7. 7.
Swain, M., Ballard, D.: Color indexing. International Journal of Computer Vision (1991)Google Scholar
8. 8.
Dasgupta, S.: The hardness of k-means clustering. Technical Report CS2008-0916, Department of Computer Science and Engineering. University of California San Diego (2008)Google Scholar
9. 9.
Lloyd, S.: Least squares quantization in PCM. IEEE Transactions on Information Theory 28(2), 129–137 (1982)
10. 10.
Arthur, D., Vassilvitskii, S.: How slow is the k-means method? In: Proc. 22nd Annual Symposium on Computational Geometry, pp. 144–153 (2006)Google Scholar
11. 11.
Ostrovsky, R., Rabani, Y., Schulman, L.J., Swamy, C.: The effectiveness of lloyd-type methods for the k-means problem. In: Proc. 47th IEEE FOCS, pp. 165–176 (2006)Google Scholar
12. 12.
Ackermann, M.R., Blömer, J.: Coresets and approximate clustering for bregman divergences. In: ACM SIAM Symposium on Discrete Algorithms, pp. 1088–1097 (2009)Google Scholar
13. 13.
Chen, K.: On k-median clustering in high dimensions. In: SODA, pp. 1177–1185 (2006)Google Scholar
14. 14.
Feldman, D., Monemizadeh, M., Sohler, C.: A ptas for k-means clustering based on weak coresets. In: Symposium on Computational Geometry, pp. 11–18 (2007)Google Scholar
15. 15.
Inaba, M., Katoh, N., Imai, H.: Applications of weighted voronoi diagrams and randomization to variance based k-clustering. In: Proceedings of the Tenth Annual Symposium on Computational Geometry, pp. 332–339 (1994)Google Scholar
16. 16.
Matousek, J.: On approximate geometric k-clustering. In: Discrete and Computational Geometry (2000)Google Scholar
17. 17.
Badoiu, M., Har-Peled, S., Indyk, P.: Approximate clustering via core-sets. In: STOC, pp. 250–257 (2002)Google Scholar
18. 18.
de la Vega, W.F., Karpinski, M., Kenyon, C., Rabani, Y.: Approximation schemes for clustering problems. In: ACM Symposium on Theory of Computing, pp. 50–58 (2003)Google Scholar
19. 19.
Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: ACM Symposium on Theory of Computing, pp. 291–300 (2004)Google Scholar
20. 20.
Kumar, A., Sabharwal, Y., Sen, S.: Linear-time approximation schemes for clustering problems in any dimensions. J. ACM 57(2) (2010)Google Scholar
21. 21.
Awasthi, P., Blum, A., Sheffet, O.: Stability yields a ptas for k-median and k-means clustering. In: FOCS, pp. 309–318 (2010)Google Scholar
22. 22.
Har-Peled, S., Sadri, B.: How fast is the k-means method? In: ACM SIAM Symposium on Discrete Algorithms, pp. 877–885 (2005)Google Scholar

## Copyright information

© Springer-Verlag Berlin Heidelberg 2012

## Authors and Affiliations

• Ragesh Jaiswal
• 1
• Amit Kumar
• 1
• Sandeep Sen
• 1
1. 1.Department of Computer Science and EngineeringIndian Institute of Technology DelhiIndia