Improved Spectral-Norm Bounds for Clustering

Awasthi, Pranjal; Sheffet, Or

doi:10.1007/978-3-642-32512-0_4

Pranjal Awasthi²⁰ &
Or Sheffet²⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7408))

Included in the following conference series:

International Workshop on Approximation Algorithms for Combinatorial Optimization
International Workshop on Randomization and Approximation Techniques in Computer Science

1632 Accesses
18 Citations

Abstract

Aiming to unify known results about clustering mixtures of distributions under separation conditions, Kumar and Kannan [1] introduced a deterministic condition for clustering datasets. They showed that this single deterministic condition encompasses many previously studied clustering assumptions. More specifically, their proximity condition requires that in the target k-clustering, the projection of a point x onto the line joining its cluster center μ and some other center μ′, is a large additive factor closer to μ than to μ′. This additive factor can be roughly described as k times the spectral norm of the matrix representing the differences between the given (known) dataset and the means of the (unknown) target clustering. Clearly, the proximity condition implies center separation – the distance between any two centers must be as large as the above mentioned bound.

In this paper we improve upon the work of Kumar and Kannan [1] along several axes. First, we weaken the center separation bound by a factor of \(\sqrt{k}\), and secondly we weaken the proximity condition by a factor of k (in other words, the revised separation condition is independent of k). Using these weaker bounds we still achieve the same guarantees when all points satisfy the proximity condition. Under the same weaker bounds, we achieve even better guarantees when only (1 − ε)-fraction of the points satisfy the condition. Specifically, we correctly cluster all but a (ε + O(1/c ⁴))-fraction of the points, compared to O(k ² ε)-fraction of [1], which is meaningful even in the particular setting when ε is a constant and k = ω(1). Most importantly, we greatly simplify the analysis of Kumar and Kannan. In fact, in the bulk of our analysis we ignore the proximity condition and use only center separation, along with the simple triangle and Markov inequalities. Yet these basic tools suffice to produce a clustering which (i) is correct on all but a constant fraction of the points, (ii) has k-means cost comparable to the k-means cost of the target clustering, and (iii) has centers very close to the target centers.

Our improved separation condition allows us to match the results of the Planted Partition Model of McSherry [2], improve upon the results of Ostrovsky et al [3], and improve separation results for mixture of Gaussian models in a particular setting.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Kumar, A., Kannan, R.: Clustering with spectral norm and the k-means algorithm. In: FOCS (2010)
Google Scholar
McSherry, F.: Spectral partitioning of random graphs. In: FOCS (2001)
Google Scholar
Ostrovsky, R., Rabani, Y., Schulman, L.J., Swamy, C.: The effectiveness of lloyd-type methods for the k-means problem. In: FOCS, pp. 165–176 (2006)
Google Scholar
Dasgupta, S.: Learning mixtures of gaussians. In: FOCS (1999)
Google Scholar
Dasgupta, S., Schulman, L.: A probabilistic analysis of em for mixtures of separated, spherical gaussians. J. Mach. Learn. Res. (2007)
Google Scholar
Sanjeev, A., Kannan, R.: Learning mixtures of arbitrary gaussians. In: STOC (2001)
Google Scholar
Vempala, S., Wang, G.: A spectral algorithm for learning mixtures of distributions. Journal of Computer and System Sciences (2002)
Google Scholar
Achlioptas, D., McSherry, F.: On Spectral Learning of Mixtures of Distributions. In: Auer, P., Meir, R. (eds.) COLT 2005. LNCS (LNAI), vol. 3559, pp. 458–469. Springer, Heidelberg (2005)
Chapter Google Scholar
Chaudhuri, K., Rao, S.: Learning mixtures of product distributions using correlations and independence. In: COLT (2008)
Google Scholar
Kannan, R., Salmasian, H., Vempala, S.: The spectral method for general mixture models. SIAM J. Comput. (2008)
Google Scholar
Dasgupta, A., Hopcroft, J., Kannan, R., Mitra, P.: Spectral clustering with limited independence. In: SODA (2007)
Google Scholar
Brubaker, S.C., Vempala, S.: Isotropic pca and affine-invariant clustering. In: FOCS (2008)
Google Scholar
Coja-Oghlan, A.: Graph partitioning via adaptive spectral techniques. Comb. Probab. Comput. 19, 227–284 (2010)
Article MathSciNet MATH Google Scholar
Awasthi, P., Sheffet, O.: Improved spectral-norm bounds for clustering, full version (2012), http://arxiv.org/abs/1206.3204
Kumar, A., Sabharwal, Y., Sen, S.: A simple linear time (1 + ε)-approximation algorithm for k-means clustering in any dimensions. In: FOCS (2004)
Google Scholar
Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: KDD, pp. 475–480 (2002)
Google Scholar
Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: Scop: a structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology 247(4), 536–540 (1995)
Google Scholar
Kannan, R., Vempala, S.: Spectral algorithms. Found. Trends Theor. Comput. Sci. (March 2009)
Google Scholar
Golub, G.H., Van Loan, C.F.: Matrix computations, 3rd edn. Johns Hopkins University Press, Baltimore (1996)
MATH Google Scholar
Chaudhuri, K., Rao, S.: Beyond gaussians: Spectral methods for learning mixtures of heavy-tailed distributions. In: COLT (2008)
Google Scholar
Kalai, A.T., Moitra, A., Valiant, G.: Efficiently learning mixtures of two gaussians. In: STOC 2010, pp. 553–562 (2010)
Google Scholar
Moitra, A., Valiant, G.: Settling the polynomial learnability of mixtures of gaussians. In: FOCS 2010 (2010)
Google Scholar
Belkin, M., Sinha, K.: Polynomial learning of distribution families. Computing Research Repository abs/1004.4, 103–112 (2010)
Google Scholar
Schulman, L.J.: Clustering for edge-cost minimization (extended abstract). In: STOC, pp. 547–555 (2000)
Google Scholar
Balcan, M.F., Blum, A., Gupta, A.: Approximate clustering without the approximation. In: SODA, pp. 1068–1077 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA, 15213, USA
Pranjal Awasthi & Or Sheffet

Authors

Pranjal Awasthi
View author publications
You can also search for this author in PubMed Google Scholar
Or Sheffet
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Carnegie Mellon University, Gates Building 7203, 15213, Pittsburgh, PA, USA
Anupam Gupta
Department of Computer Science, University Kiel, Olshausenstraße 40, 24098, Kiel, Germany
Klaus Jansen
Centre Universitaire d’Informatique, University of Geneva, Battelle A, 7 route de Drize, 1227, Carouge, Switzerland
José Rolim
Department of Computer Science, Foundation School of Engineering and Applied Science, Columbia University, Amsterdam Avenue 1214, 10027-7003, New York, NY, USA
Rocco Servedio

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Awasthi, P., Sheffet, O. (2012). Improved Spectral-Norm Bounds for Clustering. In: Gupta, A., Jansen, K., Rolim, J., Servedio, R. (eds) Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques. APPROX RANDOM 2012 2012. Lecture Notes in Computer Science, vol 7408. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32512-0_4

Download citation

DOI: https://doi.org/10.1007/978-3-642-32512-0_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32511-3
Online ISBN: 978-3-642-32512-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics