Abstract
Recently, Bandeira (C R Math, 2015) introduced a new type of algorithm (the so-called probably certifiably correct algorithm) that combines fast solvers with the optimality certificates provided by convex relaxations. In this paper, we devise such an algorithm for the problem of k-means clustering. First, we prove that Peng and Wei’s semidefinite relaxation of k-means Peng and Wei (SIAM J Optim 18(1):186–205, 2007) is tight with high probability under a distribution of planted clusters called the stochastic ball model. Our proof follows from a new dual certificate for integral solutions of this semidefinite program. Next, we show how to test the optimality of a proposed k-means solution using this dual certificate in quasilinear time. Finally, we analyze a version of spectral clustering from Peng and Wei (SIAM J Optim 18(1):186–205, 2007) that is designed to solve k-means in the case of two clusters. In particular, we show that this quasilinear-time method typically recovers planted clusters under the stochastic ball model.
Similar content being viewed by others
References
Abbe, E., Bandeira, A.S., Hall, G.: Exact recovery in the stochastic block model. IEEE Trans. Inf. Theory 62(1), 471–487 (2016)
Abbe, E., Sandon, C.: Community detection in general stochastic block models: fundamental limits and efficient algorithms for recovery. In: IEEE 56th Annual Symposium on Foundations of Computer Science, FOCS 2015, Berkeley, CA, USA, pp. 670–688, 17–20 October 2015
Arthur, D., Vassilvitskii, S.: k-Means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete algorithms (2007)
Awasthi, P., Bandeira, A.S., Charikar, M., Krishnaswamy, R., Villar, S., Ward, R.: Relax, no need to round: integrality of clustering formulations. In: Proceedings of the 2015 Conference on Innovations in Theoretical Computer Science, pp. 191–200. ACM (2015)
Bandeira, A.S.: A note on probably certifiably correct algorithms. C. R. Math. 354(3), 329–333 (2015)
Chen, H., Peng, J.: 0–1 Semidefinite programming for graph-cut clustering: modelling and approximation. In: Data Mining and Mathematical Programming. CRM Proceedings and Lecture Notes of the American Mathematical Society, pp. 15–40 (2008)
Dhillon, I.S., Guan, Y., Kulis, B.: Kernel k-means: spectral clustering and normalized cuts. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 551–556. ACM (2004)
Dhillon, I.S., Guan, Y., Kulis, B.: Weighted graph cuts without eigenvectors a multilevel approach. IEEE Trans. Pattern Anal. Mach. Intell. 29(11), 1944–1957 (2007)
Elhamifar, E., Sapiro, G., Vidal, R.: Finding exemplars from pairwise dissimilarities via simultaneous sparse recovery. In: Advances in Neural Information Processing Systems, pp. 19–27 (2012)
Golub, G.H., Van Loan, C.F.: Matrix Computations, vol. 3. JHU Press, Baltimore (2012)
Grant, M., Boyd, S., Ye, Y.: Graph implementations for nonsmooth convex programs. In: Blondel, V., Boyd, S., Kimura,H., (eds.) Recent Advances in Learning and Control. Lecture Notes in Control and Information Sciences. Springer, London, pp. 95–110 (2008)
Grant, M., Boyd, S.: CVX: matlab software for disciplined convex programming, version 2.1 (2014). http://cvxr.com/cvx
Iguchi, T., Mixon, D.G., Peterson, J., Villar, S.: On the tightness of an SDP relaxation of k-means. arXiv preprint arXiv:1505.04778 (2015)
Jain, K., Mahdian, M., Saberi, A.: A new greedy approach for facility location problems. In: Proceedings of the 34th Annual ACM Symposium on Theory of Computing (2002)
Laurent, B., Massart, P.: Adaptive estimation of a quadratic functional by model selection. Ann. Stat. 28, 1302–1338 (2000)
Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)
Mallat, S.G., Zhang, Z.: Matching pursuits with time-frequency dictionaries. IEEE Trans. Sig. Process. 41(12), 3397–3415 (1993)
Mixon, D.G.: Cone programming cheat sheet. Short, Fat Matrices (weblog) (2015)
Nellore, A., Ward, R.: Recovery guarantees for exemplar-based clustering. Inf. Comput. 245, 165–180 (2015)
Nesterov, Y., Nemirovskii, A.: Interior-Point Polynomial Algorithms in Convex Programming, vol. 13. SIAM, Philadelphia (1994). doi:10.1137/1.9781611970791
Ostrovsky, R., Rabani, Y., Schulman, L., Swamy, C.: The effectiveness of lloyd-type methods for the k-means problem. In: Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (2006)
Peng, J., Wei, Y.: Approximating k-means-type clustering via semidefinite programming. SIAM J. Optim. 18(1), 186–205 (2007)
Tropp, J.A.: User-friendly tail bounds for sums of random matrices. Found. Comput. Math. 12(4), 389–434 (2012)
Vershynin, R.: Introduction to the non-asymptotic analysis of random matrices. arXiv:1011.3027v7 (2011)
Vinayak, R.K., Hassibi, B.: Similarity clustering in the presence of outliers: Exact recovery via convex program. In: IEEE International Symposium on Information Theory, ISIT 2016, Barcelona, Spain, pp. 91–95, 10–15 July 2016
Wang, H., Song, M.: Ckmeans.1d.dp: optimal k-means clustering in one dimension by dynamic programming. R J. 3(2), 29–33 (2011)
Acknowledgements
The authors thank the anonymous referees, whose suggestions significantly improved this paper’s presentation and literature review. The authors also thank Afonso S. Bandeira and Nicolas Boumal for interesting discussions and valuable comments on an earlier version of this manuscript, and Xiaodong Li and Yang Li for interesting comments on our dual certificate. DGM was supported by an AFOSR Young Investigator Research Program award, NSF Grant No. DMS-1321779, and AFOSR Grant No. F4FGA05076J002. SV was supported by Rachel Ward’s NSF CAREER award and AFOSR Young Investigator Research Program award. The views expressed in this article are those of the authors and do not reflect the official policy or position of the United States Air Force, Department of Defense, or the U.S. Government.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: Proof of Proposition 5
Proof
(a) \(\Leftrightarrow \) (b): By complementary slackness, (a) is equivalent to having both
and
Since \(Q\succeq 0\), we have
with equality if and only if \(Q1_a=0\) for every \(a\in \{1,\ldots ,k\}\). Next, we recall that \(y=z\oplus \alpha \oplus (-\beta )\), \(b-A(X)\in L=0\oplus 0\oplus \mathbb {R}_{\ge 0}^{N(N+1)/2}\), and \(b=k\oplus 1\oplus 0\). As such, (22) is equivalent to \(\beta \) having disjoint support with \(\{\langle X,\frac{1}{2}(e_ie_j^\top +e_je_i^\top )\rangle \}_{i,j=1,i\le j}^N\), i.e., \(\beta ^{(a,a)}=0\) for every cluster a.
(b) \(\Rightarrow \) (c): Take any solution to the dual SDP (8), and note that
where the 1 vectors in the second line are \(n_a\)-dimensional (instead of N-dimensional, as in the first line), and similarly for \(e_i\) (instead of \(e_{t,i}\)). We now consider each entry of \(Q^{(a,a)}1\), which is zero by assumption:
As one might expect, these \(n_a\) linear equations determine the variables \(\{\alpha _{a,i}\}_{i\in a}\). To solve this system, we first observe
and so rearranging gives
We use this identity to continue (23):
and rearranging yields the desired formula for \(\alpha _{a,r}\).
(c) \(\Rightarrow \) (a): Take any solution to the dual SDP (8). Then by assumption, the dual objective at this point is given by
i.e., the primal objective (3) evaluated at X. Since X is feasible in the primal SDP, we conclude that X is optimal by strong duality. \(\square \)
Appendix 2: Proof of Corollary 8
It suffices to have
We will bound the terms in (24) separately and then combine the bounds to derive a sufficient condition for Theorem 7. To bound the first term in (24), let \(\nu \) be the \(N\times 1\) vector whose (a, i)th entry is \(\Vert x_{a,i}\Vert _2^2\), and let \(\Phi \) be the \(m\times N\) matrix whose (a, i)th column is \(x_{a,i}\). Then
meaning \(D=\nu 1^\top -2\Phi ^\top \Phi +1\nu ^\top \). With this, we appeal to the blockwise definition of M (10):
For the second term in (24), we first write the decomposition
where \(H_{(a,b)}:\mathbb {R}^{n_a\times n_b}\rightarrow \mathbb {R}^{N\times N}\) produces a matrix whose (a, b)th block is the input matrix, and is otherwise zero. Then
and so the triangle inequality gives
where the last equality can be verified by considering the spectrum of the square:
At this point, we use the definition of B (13) to get
Recalling the definition of \(u_{(a,b)}\) (13) and combining these estimates then produces the result.
Appendix 3: Proof Theorem 9
In this section, we apply the certificate from Corollary 8 to the \((\mathcal {D},\gamma ,n)\)-stochastic ball model (see Definition 2) to prove our main result. We will prove Theorem 9 with the help of several lemmas.
Lemma 16
Denote
Then the \((\mathcal {D},\gamma ,n)\)-stochastic ball model satisfies the following estimates:
Proof
Since \(\mathbb {E}r=0\) and \(\Vert r\Vert _2^2\le 1\) almost surely, one may lift
and apply the Matrix Hoeffding inequality [23] to conclude that
Taking \(t:=\epsilon n\) then gives (25). For (26) and (27), notice that the random variables in each sum are iid and confined to an interval almost surely, and so the result follows from Hoeffding’s inequality. \(\square \)
Lemma 17
Under the \((\mathcal {D},\gamma ,n)\)-stochastic ball model, we have \(D^{(a,b)}1-D^{(a,a)}1=4np+q\), where
and \(|q_i|\le (6+2\Delta _{ab})n\epsilon \) with probability \(1-e^{-\Omega _{m,\Delta _{ab},\epsilon }(n)}\).
Proof
Add and subtract \(O_{ab}\) and then expand the squares to get
Add and subtract \(\gamma _a-\gamma _b\) to \(c_a-c_b\) and distribute over the resulting sum to obtain
Distributing and identifying \(\Vert \gamma _a-O_{ab}\Vert _2^2=\Delta _{ab}^2/4\) explains the definition of p. To show \(|q_i|\le (6+2\Delta _{ab})n\epsilon \), apply triangle and Cauchy–Schwarz to obtain
To finish the argument, apply (25) to the first term while adding and subtracting
from the second and apply (27). \(\square \)
Lemma 18
Under the \((\mathcal {D},\gamma ,n)\)-stochastic ball model, we have
Proof
Add and subtract \(\gamma _a\) and expand the square to get
The triangle and Cauchy–Schwarz inequalities then give
where the last step occurs with probability \(1-e^{-\Omega _{\Delta _{ab},\epsilon }(n)}\) by a union bound over (26) and (25). \(\square \)
Lemma 19
Under the \((\mathcal {D},\gamma ,n)\)-stochastic ball model, we have
Proof
Lemma 17 gives
Cauchy–Schwarz along with (25) then gives the result. \(\square \)
Lemma 20
Under the \((\mathcal {D},\gamma ,n)\)-stochastic ball model, there exists \(C=C(\gamma )\) such that
where \(\displaystyle {\Delta :=\mathop {\mathop {\min }\limits _{a,b\in \{1,\ldots ,k\}}}\limits _{a\ne b}\Delta _{ab}}\).
Proof
Fix a and b. Then by Lemma 17, the following holds with probability \(1-e^{-\Omega _{m,\Delta _{ab},\epsilon }(n)}\):
where the last step is by Cauchy–Schwarz. Taking a union bound with Lemma 18 then gives
with probability \(1-e^{-\Omega _{\Delta _{ab},\epsilon }(n)}\). The result then follows from a union bound over a and b. \(\square \)
Lemma 21
Suppose \(\epsilon \le 1\). Then there exists \(C=C(\Delta _{ab},m)\) such that under the \((\mathcal {D},\gamma ,n)\)-stochastic ball model, we have
with probability \(1-e^{-\Omega _{m,\Delta _{ab},\epsilon }(n)}\).
Proof
First, a quick calculation reveals
from which it follows that
As such, we have
To bound the first term, we apply the triangle inequality over Lemma 17:
We proceed by bounding \(\Vert p\Vert _2\). To this end, note that the \(p_i\)’s are iid random variables whose outcomes lie in a finite interval (of width determined by \(\Delta _{ab}\)) with probability 1. As such, Hoeffding’s inequality gives
With this, we then have
in the same event. To determine \(\mathbb {E}p_1^2\), first take \(r_1:=e_1^\top r\). Then since the distribution of r is rotation invariant, we may write
where the second equality above is equality in distribution. We then have
We also note that \(1\ge \mathbb {E}\Vert r\Vert _2^2=m\mathbb {E}r_1^2\) by linearity of expectation, and so
Combining (29), (30), (31) and (32) then gives
To bound the second term of (28), first note that
Lemma 19 then gives
with probability \(1-e^{-\Omega _{m,\Delta _{ab},\epsilon }(n)}\). Using (28) to combine (33) with (34) and (35) then gives the result. \(\square \)
Lemma 22
There exists \(C=C(\gamma )\) such that under the \((\mathcal {D},\gamma ,n)\)-stochastic ball model, we have
Proof
Recall from (13) that
To bound the first term, we leverage Lemma 19:
with probability \(1-e^{-\Omega _{m,\Delta _{ab},\epsilon }(n)}\). To bound the second term in (36), note from Lemma 18 that
with probability \(1-e^{-\Omega _{\Delta _{ab},\epsilon }(n)}\). Next, Lemma 17 gives
By assumption, we know \(\Vert r\Vert _2\ge 1-\epsilon \) with positive probability regardless of \(\epsilon >0\). It then follows that
with some (\(\epsilon \)-dependent) positive probability. As such, we may conclude that
Combining these estimates then gives
Performing a union bound over a and b then gives
Combining these estimates then gives the result. \(\square \)
Lemma 23
Under the \((\mathcal {D},\gamma ,n)\)-stochastic ball model, we have
where \(\sigma ^2:=\mathbb {E}\Vert r\Vert _2^2\) for \(r\sim \mathcal {D}\).
Proof
Let R denote the matrix whose (a, i)th column is \(r_{a,i}\). Then
and so the triangle inequality gives
where the last estimate passes to the Frobenius norm. For the first term, since \(\mathcal {D}\) is rotation invariant, we may apply Theorem 5.41 in [24]:
For the second term, apply (25). The union bound then gives the result. \(\square \)
Proof of Theorem 9
First, we combine Lemmas 21, 22 and 23: For every \(\delta >0\), there exists an \(\epsilon >0\) such that
with probability \(1-e^{-\Omega _{\mathcal {D},\gamma ,\epsilon }(n)}\). Next, the uniform bound \(\Delta _{ab}\ge \Delta \) implies
Combining this with (37) and considering Lemma 20, it then suffices to have
Rearranging then gives
which is implied by the hypothesis since \(\Delta \ge 2\). \(\square \)
Appendix 4: Proof of Theorem 14
Put \(g=\gamma /\Vert \gamma \Vert _2\) and let z have unit 2-norm. Since \(\Vert \Phi _0^\top z\Vert _2\ge \Vert \Phi _0^\top g\Vert _2\), then considering Lemma 15, it suffices to show that the containment
holds with probability \(1-e^{-\Omega _{m,\Delta }(N)}\). To this end, we will first show that each \(v\in S_1\) is also a member of \(S_2\) with high probability, and then we will perform a union bound over an \(\epsilon \)-net of \(S_1\).
We start by considering \(\Vert \Phi ^\top v\Vert _2\) and \(\Vert \Phi ^\top g\Vert _2\). Decompose \(x_i\) as either \(\gamma +r_i\) or \(-\gamma +r_i\) depending on whether \(x_i\) belongs to the ball centered at \(\gamma \) or \(-\gamma \). Let w with \(\Vert w\Vert _2=1\) be arbitrary. Then
and so \(\mathbb {E}(x_i^\top w)^2=(\gamma ^\top w)^2+\mathbb {E}(e_1^\top r)^2\). Linearity of expectation then gives
Since \(|(x_i^\top g)^2-(x_i^\top v)^2|\le 2(1+\Delta /2)^2\) almost surely, we may apply Hoeffding’s inequality to get
For a properly chosen t, rearranging gives that \(\Vert \Phi ^\top v\Vert _2<\Vert \Phi ^\top g\Vert _2\). Instead, we will use (38) to prove the closely related inequality \(\Vert \Phi _0^\top v\Vert _2<\Vert \Phi _0^\top g\Vert _2\). Letting \(\mu \) denote the centroid of the columns of \(\Phi \), we know by (25) that \(\Vert \mu \Vert _2\le \delta \) with probability \(1-e^{-\Omega _{m,\delta }(N)}\). In this event, every w with \(\Vert w\Vert _2=1\) satisfies
Furthermore,
where the last inequality follows from Cauchy–Schwarz along with the fact that \(\Vert x_i\Vert _2\le \Delta /2+1\) for every i. Taking a supremum over w then gives
In (38), pick \(s=(N/2)(1-4/\Delta ^2)=:c_1(\Delta )N\). Then taking a union bound with (39) gives
with probability \(1-e^{-\Omega _{m,\Delta ,\delta }(N)}\). Expanding both sides and rearranging then gives
where the last step follows from (40). Thus, picking \(\delta =\delta (\Delta )\) sufficiently small ensures \(c_2(\Delta )>0\). Since \(c_2(\Delta )N\le \Vert \Phi _0^\top g\Vert _2^2-\Vert \Phi _0^\top v\Vert _2^2=(\Vert \Phi _0^\top g\Vert _2+\Vert \Phi _0^\top v\Vert _2)(\Vert \Phi _0^\top g\Vert _2-\Vert \Phi _0^\top v\Vert _2)\), we further have
where the last inequality takes \(c_3(\Delta ):=c_2(\Delta )/(\Delta /2+1+\delta )\), following (40).
At this point, we know that if \(v\in S_1\), then \(v\in S_2\) with probability \(1-e^{-\Omega _{m,\Delta }(N)}\). It remains to perform a union bound over an \(\epsilon \)-net of \(S_1\) to conclude that \(S_1\subseteq S_2\) with high probability. To this end, pick \(\epsilon <c_3(\Delta )/(\Delta /2+1+\delta )\), consider an \(\epsilon \)-net \(\mathcal {N}_\epsilon \) of \(S_1\), and suppose
Then for every \(x\in S_1\), there exists \(v\in \mathcal {N}_\epsilon \) such that \(\Vert x-v\Vert _2\le \epsilon \), and so (40) gives
as desired. To measure the probability of the success event (41), a standard volume comparison argument establishes the existence of an \(\epsilon \)-net of size \(|\mathcal {N}_\epsilon |\le (1+2/\epsilon )^m\); see Lemma 5.2 in [24]. As such, the union bound gives that (41) occurs with probability \(1-e^{-\Omega _{m,\Delta }(N)}\).
Rights and permissions
About this article
Cite this article
Iguchi, T., Mixon, D.G., Peterson, J. et al. Probably certifiably correct k-means clustering. Math. Program. 165, 605–642 (2017). https://doi.org/10.1007/s10107-016-1097-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-016-1097-0