Abstract
Given a set of points P ⊂ ℝd, the k-means clustering problem is to find a set of k centers C = {c 1,...,c k }, c i ∈ ℝd, such that the objective function ∑ x ∈ P d(x,C)2, where d(x,C) denotes the distance between x and the closest center in C, is minimized. This is one of the most prominent objective functions that have been studied with respect to clustering.
D 2-sampling [1] is a simple non-uniform sampling technique for choosing points from a set of points. It works as follows: given a set of points P ⊆ ℝd, the first point is chosen uniformly at random from P. Subsequently, a point from P is chosen as the next sample with probability proportional to the square of the distance of this point to the nearest previously sampled points.
D 2-sampling has been shown to have nice properties with respect to the k-means clustering problem. Arthur and Vassilvitskii [1] show that k points chosen as centers from P using D 2-sampling gives an O(logk) approximation in expectation. Ailon et. al. [2] and Aggarwal et. al. [3] extended results of [1] to show that O(k) points chosen as centers using D 2-sampling give O(1) approximation to the k-means objective function with high probability. In this paper, we further demonstrate the power of D 2-sampling by giving a simple randomized (1 + ε)-approximation algorithm that uses the D 2-sampling in its core.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035 (2007)
Ailon, N., Jaiswal, R., Monteleoni, C.: Streaming k-means approximation. In: Advances in Neural Information Processing Systems, vol. 22, pp. 10–18 (2009)
Aggarwal, A., Deshpande, A., Kannan, R.: Adaptive Sampling for k-Means Clustering. In: Dinur, I., Jansen, K., Naor, J., Rolim, J. (eds.) APPROX 2009. LNCS, vol. 5687, pp. 15–28. Springer, Heidelberg (2009)
Broder, A., Glassman, S., Manasse, M., Zweig, G.: Syntactic clustering of the web
Faloutsos, C., Barber, R., Flickner, M., Hafner, J.: Efficient and effective querying by image content. Journal of Intelligent Information Systems (1994)
Deerwester, S., Dumais, S., Landauer, T., Furnas, G., Harshman, A.: Indexing by latent semantic analysis. Journal of the American Society for Information Science (1990)
Swain, M., Ballard, D.: Color indexing. International Journal of Computer Vision (1991)
Dasgupta, S.: The hardness of k-means clustering. Technical Report CS2008-0916, Department of Computer Science and Engineering. University of California San Diego (2008)
Lloyd, S.: Least squares quantization in PCM. IEEE Transactions on Information Theory 28(2), 129–137 (1982)
Arthur, D., Vassilvitskii, S.: How slow is the k-means method? In: Proc. 22nd Annual Symposium on Computational Geometry, pp. 144–153 (2006)
Ostrovsky, R., Rabani, Y., Schulman, L.J., Swamy, C.: The effectiveness of lloyd-type methods for the k-means problem. In: Proc. 47th IEEE FOCS, pp. 165–176 (2006)
Ackermann, M.R., Blömer, J.: Coresets and approximate clustering for bregman divergences. In: ACM SIAM Symposium on Discrete Algorithms, pp. 1088–1097 (2009)
Chen, K.: On k-median clustering in high dimensions. In: SODA, pp. 1177–1185 (2006)
Feldman, D., Monemizadeh, M., Sohler, C.: A ptas for k-means clustering based on weak coresets. In: Symposium on Computational Geometry, pp. 11–18 (2007)
Inaba, M., Katoh, N., Imai, H.: Applications of weighted voronoi diagrams and randomization to variance based k-clustering. In: Proceedings of the Tenth Annual Symposium on Computational Geometry, pp. 332–339 (1994)
Matousek, J.: On approximate geometric k-clustering. In: Discrete and Computational Geometry (2000)
Badoiu, M., Har-Peled, S., Indyk, P.: Approximate clustering via core-sets. In: STOC, pp. 250–257 (2002)
de la Vega, W.F., Karpinski, M., Kenyon, C., Rabani, Y.: Approximation schemes for clustering problems. In: ACM Symposium on Theory of Computing, pp. 50–58 (2003)
Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: ACM Symposium on Theory of Computing, pp. 291–300 (2004)
Kumar, A., Sabharwal, Y., Sen, S.: Linear-time approximation schemes for clustering problems in any dimensions. J. ACM 57(2) (2010)
Awasthi, P., Blum, A., Sheffet, O.: Stability yields a ptas for k-median and k-means clustering. In: FOCS, pp. 309–318 (2010)
Har-Peled, S., Sadri, B.: How fast is the k-means method? In: ACM SIAM Symposium on Discrete Algorithms, pp. 877–885 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Jaiswal, R., Kumar, A., Sen, S. (2012). A Simple D 2-Sampling Based PTAS for k-Means and other Clustering Problems. In: Gudmundsson, J., Mestre, J., Viglas, T. (eds) Computing and Combinatorics. COCOON 2012. Lecture Notes in Computer Science, vol 7434. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32241-9_2
Download citation
DOI: https://doi.org/10.1007/978-3-642-32241-9_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32240-2
Online ISBN: 978-3-642-32241-9
eBook Packages: Computer ScienceComputer Science (R0)