Abstract
Semidefinite programming (SDP) with diagonal constraints arise in many optimization problems, such as Max-Cut, community detection and group synchronization. Although SDPs can be solved to arbitrary precision in polynomial time, generic convex solvers do not scale well with the dimension of the problem. In order to address this issue, Burer and Monteiro (Math Program 95(2):329–357, 2003) proposed to reduce the dimension of the problem by appealing to a low-rank factorization and solve the subsequent non-convex problem instead. In this paper, we present coordinate ascent based methods to solve this non-convex problem with provable convergence guarantees. More specifically, we prove that the block-coordinate maximization algorithm applied to the non-convex Burer–Monteiro method globally converges to a first-order stationary point with a sublinear rate without any assumptions on the problem. We further show that this algorithm converges linearly around a local maximum provided that the objective function exhibits quadratic decay. We establish that this condition generically holds when the rank of the factorization is sufficiently large. Furthermore, incorporating Lanczos method to the block-coordinate maximization, we propose an algorithm that is guaranteed to return a solution that provides \(1-{\mathcal {O}}\left( 1/r\right) \) approximation to the original SDP without any assumptions, where r is the rank of the factorization. This approximation ratio is known to be optimal (up to constants) under the unique games conjecture, and we can explicitly quantify the number of iterations to obtain such a solution.
Similar content being viewed by others
Notes
Note that the dimension of \({\mathcal {V}}_{\varvec{\sigma }}\) depends on the rank of \(\varvec{\sigma }\), and hence the quotient space is not a manifold.
References
Absil, P.-A., Baker, C.G., Gallivan, K.A.: Trust-region methods on Riemannian manifolds. Found. Comput. Math. 7(3), 303–330 (2007)
Absil, P.-A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2007)
Alizadeh, F., Haeberly, J.-P.A., Overton, M.L.: Complementarity and nondegeneracy in semidefinite programming. Math. Program. 77(1), 111–128 (1997)
Anitescu, M.: Degenerate nonlinear programming with a quadratic growth condition. SIAM J. Optim. 10(4), 1116–1135 (2000)
Arora, S., Hazan, E., Kale, S.: Fast algorithms for approximate semidefiniite programming using the multiplicative weights update method. In: Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science, FOCS’05, pp. 339–348 (2005)
Bandeira, A.S., Boumal, N., Voroninski, V.: On the low-rank approach for semidefinite programs arising in synchronization and community detection. arXiv:1602.04426 (2016)
Barvinok, A.I.: Problems of distance geometry and convex properties of quadratic maps. Discrete Comput. Geom. 13(2), 189–202 (1995)
Bonnans, J.F., Ioffe, A.: Second-order sufficiency and quadratic growth for nonisolated minima. Math. Oper. Res. 20(4), 801–817 (1995)
Boumal, N., Absil, P.-A., Cartis, C.: Global rates of convergence for nonconvex optimization on manifolds. arXiv preprint arXiv:1605.08101 (2016)
Boumal, N., Mishra, B., Absil, P.-A., Sepulchre, R.: Manopt, a Matlab toolbox for optimization on manifolds. J. Mach. Learn. Res. 15, 1455–1459 (2014)
Boumal, N., Voroninski, V., Bandeira, A.S.: The non-convex Burer–Monteiro approach works on smooth semidefinite programs. In: Advances in Neural Information Processing Systems, pp. 2757–2765 (2016)
Boumal, N., Voroninski, V., Bandeira, A.S.: Deterministic guarantees for Burer–Monteiro factorizations of smooth semidefinite programs. arXiv preprint arXiv:1804.02008 (2018)
Briat, C.: Linear Parameter-Varying and Time-Delay Systems. Springer (2014)
Briët, J., de Oliveira Filho, F.M., Vallentin, F.: The positive semidefinite grothendieck problem with rank constraint. In: Automata, Languages and Programming, pp. 31–42 (2010)
Burer, S., Monteiro, R.D.C.: A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Math. Program. 95(2), 329–357 (2003)
Burer, S., Monteiro, R.D.C.: Local minima and convergence in low-rank semidefinite programming. Math. Program. 103(3), 427–444 (2005)
Cifuentes, D., Moitra, A.: Polynomial time guarantees for the Burer–Monteiro method. arXiv preprint arXiv:1912.01745 (2019)
Coakley, E.S., Rokhlin, V.: A fast divide-and-conquer algorithm for computing the spectra of real symmetric tridiagonal matrices. Appl. Comput. Harmon. Anal. 34(3), 379–414 (2013)
Erdogdu, M.A., Deshpande, Y., Montanari, A.: Inference in graphical models via semidefinite programming hierarchies. In: Advances in Neural Information Processing Systems, pp. 416–424 (2017)
Gamarnik, D., Li, Q.: On the max-cut of sparse random graphs. arXiv preprint arXiv:1411.1698 (2014)
Garber, D., Hazan, E.: Approximating semidefinite programs in sublinear time. In: Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, pp. 1080–1088 (2011)
Goemans, M.X., Williamson, D.P.: Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM 42(6), 1115–1145 (1995)
Gurbuzbalaban, M., Ozdaglar, A., Parrilo, P.A., Vanli, N.D.: When cyclic coordinate descent outperforms randomized coordinate descent. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, and (eds.) Advances in Neural Information Processing Systems, volume 30, pp. 6999–7007. Curran Associates, Inc. (2017)
Gurbuzbalaban, M., Ozdaglar, A., Vanli, N.D., Wright, S.J.: Randomness and permutations in coordinate descent methods. Math. Program. 181, 03 (2018)
Javanmard, A., Montanari, A., Ricci-Tersenghi, F.: Phase transitions in semidefinite relaxations. Proc. Natl. Acad. Sci. 113(16), E2218–E2223 (2016)
Journee, M., Bach, F., Absil, P.-A., Sepulchre, R.: Low-rank optimization on the cone of positive semidefinite matrices. SIAM J. Optim. 20(5), 2327–2351 (2010)
Klein, P., Lu, H.-I.: Efficient approximation algorithms for semidefinite programs arising from MAX CUT and COLORING. In: Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing, STOC’96, pp. 338–347. ACM, New York, NY, USA (1996)
Kuczyński, J., Woźniakowski, H.: Estimating the largest eigenvalues by the power and Lanczos algorithms with a random start. SIAM J. Matrix Anal. Appl. 13(4), 1094–1122 (1992)
Lee, C.-P., Wright, S.J.: Random permutations fix a worst case for cyclic coordinate descent. IMA J. Numer. Anal. 39, 07 (2016)
Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent only converges to minimizers. In: 29th Annual Conference on Learning Theory, vol. 49, pp. 1246–1257. PMLR (2016)
Lu, Z., Xiao, L.: Randomized block coordinate non-monotone gradient method for a class of nonlinear programming. Technical Report MSR-TR-2013-66 (2013)
Mei, S., Misiakiewicz, T., Montanari, A., Oliveira, R.I.: Solving SDPs for synchronization and MaxCut problems via the Grothendieck inequality. arXiv preprint arXiv:1703.08729 (2017)
Montanari, A.: A Grothendieck-type inequality for local maxima. arXiv preprint arXiv:1603.04064 (2016)
Parrilo, P.A.: Semidefinite programming relaxations for semialgebraic problems. Math. Program. 96(2), 293–320 (2003)
Pataki, G.: On the rank of extreme matrices in semidefinite programs and the multiplicity of optimal eigenvalues. Math. Oper. Res. 23(2), 339–358 (1998)
Patrascu, A., Necoara, I.: Efficient random coordinate descent algorithms for large-scale structured nonconvex optimization. J. Glob. Optim. 61, 05 (2013)
Pumir, T., Jelassi, S., Boumal, N.: Smoothed analysis of the low-rank approach for smooth semidefinite programs. In: Advances in Neural Information Processing Systems, pp. 2281–2290 (2018)
Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 144, 07 (2011)
Steurer, D.: Fast SDP algorithms for constraint satisfaction problems. In: Proceedings of the Twenty-First Annual ACM–SIAM Symposium on Discrete Algorithms, pp. 684–697 (2010)
Tropp, J.A., Yurtsever, A., Udell, M., Cevher, V.: Practical sketching algorithms for low-rank matrix approximation. SIAM J. Matrix Anal. Appl. 38(4), 1454–1485 (2017)
Tseng, P., Yun, S.: A coordinate gradient descent method for nonsmooth separable minimization. Math. Program. 117, 387–423 (2009)
Vandenberghe, L., Boyd, S.: Semidefinite programming. SIAM Rev. 38(1), 49–95 (1996)
Wang, P.-W., Chang, W.-C., Kolter, J.Z.: The mixing method: coordinate descent for low-rank semidefinite programming. arXiv preprint arXiv:1706.00476 (2017)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Part of this work has previously appeared in ICML 2018 Workshop on Modern Trends in Nonconvex Optimization for Machine Learning.
Appendices
Proof of Corollary 1
Similar to the proof of Theorem 1, from Proposition 1, we have
where the inequality follows since \(\Vert {g_{i_k}^k}\Vert \ge {\langle \sigma _{i_k}^k,g_{i_k}^k \rangle }\), for all \(\sigma _{i_k}^k \in {{\mathbb {R}}}^{n \times r}\). Letting \(\mathbb {E}_k\) denote the expectation over \(i_k\) given \(\sigma ^k\), we have
In particular, when \(p_i=\frac{1}{n}\), for all \(i \in [n]\) (i.e., for uniform sampling case), we have
since \(\Vert {g_i^k}\Vert \le \Vert {{\varvec{A}}}\Vert _1\), for all \(i \in [n]\) by (11). Therefore, we have
On the other hand, when \(p_i=\frac{\Vert {g_i^k}\Vert }{\sum _{j=1}^n \Vert {g_j^k}\Vert }\) (i.e., for importance sampling case), we have
Letting \(\Vert {{\varvec{A}}}\Vert _{1,1} = \sum _{i,j=1}^n |{\varvec{A}}_{ij}|\) denote the \(L_{1,1}\) norm of matrix \({\varvec{A}}\), we observe that \(\sum _{j=1}^n \Vert {g_j^k}\Vert \le \Vert {{\varvec{A}}}\Vert _{1,1}\), which in the above inequality yields
In order to prove (13), which corresponds to uniform sampling case, we assume the contrary that \(\mathbb {E}\Vert \mathrm {grad} f(\varvec{\sigma }^k)\Vert _{\mathrm {F}}^2 > \epsilon \), for all \(k\in [K-1]\). Then, using the boundedness of f, we get
Using the expected functional ascent of BCM in (48) above, we get
where the last inequality follows by the assumption. Then, by contradiction, the algorithm returns a solution with \(\mathbb {E}\Vert \mathrm {grad} f(\varvec{\sigma }^k)\Vert _{\mathrm {F}}^2 \le \epsilon \), for some \(k\in [K-1]\), provided that
The proof of (14), which corresponds to importance sampling case, can be obtained by using (49) (instead of (48)) in (50), and hence is omitted.
Rest of the Proof of Theorem 2
In order to quantify how close \(\varvec{\sigma }^0\) and \(\varvec{\sigma }\) should be so that this convergence rate holds, we need to derive explicit bounds on the higher order terms in (21) and (23), which we do in the following. The Taylor expansion of \(\varvec{\sigma }^k\) around \(\varvec{\sigma }\) yields
Using this expansion, we can compute \(f(\varvec{\sigma }^k) = \sum _{i,j=1}^n A_{ij} {\langle \sigma _i^k, \sigma _j^k \rangle }\). The first three terms in the expansion are already given in (22) as follows
where \(\beta _f\) represents the higher order terms. In order to find an upper bound on \(|\beta _f|\), we use the Cauchy-Schwarz inequality in the higher order terms in the expansion of \(f(\varvec{\sigma }^k)\), which yields the following bound
As \(\Vert {\varvec{u}}\Vert _{\mathrm {F}}=1\), we have \(\Vert {u_i}\Vert \le 1\) for all \(i\in [n]\), which implies
where we note that t denotes the geodesic distance between \(\varvec{\sigma }^k\) and \([{\bar{\varvec{\sigma }}}]\) as highlighted in (19). Assuming that \(t\le 1\), we obtain the following upper bound
Using the inequality \(\sum _{\ell =3}^\infty \frac{2^\ell }{\ell !} = e^2-5 \le 5/2\) above, we get
Plugging this value back in (51), we obtain
Considering the same expansion for \(\Vert {\mathrm {grad}}f(\varvec{\sigma }^k)\Vert _{\mathrm {F}}^2 = 2 \sum _{i=1}^n (\Vert {g_i^k}\Vert ^2 - {\langle \sigma _i^k,g_i^k \rangle }^2)\), we get the following (see (21)):
where \(\beta _g\) represents the higher order terms. Upper bounding each higher order terms using the Cauchy-Schwarz inequality as follows, we obtain
Using the fact that \(\Vert {u_i}\Vert \le 1\) for all \(i\in [n]\), we get the following upper bound
Using the upper bound \(\sum _{j,m=1}^n |A_{ij}| |A_{im}| \le \Vert {{\varvec{A}}}\Vert _1^2\) above, we obtain
Introducing a change of variables in the last sum, we get
Assuming that \(t\le 1\), we obtain the following upper bound
Using the inequality \(\sum _{\ell =3}^\infty \frac{2^\ell +4^\ell }{\ell !} = e^2+e^4-18 \le 44\) above, we get
Plugging this value back in (53), we obtain
Using the same bounding technique as in (24), we get
Therefore, in order for (25) to hold, we need
which can be equivalently rewritten as follows
As \(f(\varvec{\sigma }^k)\) is a monotonically non-decreasing sequence, then as soon as \(\varvec{\sigma }^0\) is sufficiently close to \([{\bar{\varvec{\sigma }}}]\) in the sense that
then the linear convergence rate presented in (26) holds.
Proof of Theorem 7
Before presenting the proof of Theorem 7, we first introduce the following theorem that characterizes the convergence rate of the Lanczos method with random initialization.
Theorem 8
[28, Theorem 4.2] Let \({\varvec{A}}\in {{\mathbb {R}}}^{n \times n}\) be a positive semidefinite matrix, \(b\in {{\mathbb {R}}}^n\) be an arbitrary vector and \(\lambda _L^\ell ({\varvec{A}},b)\) denote the output of the Lanczos algorithm after \(\ell \) iterations when applied to find the leading eigenvalue of \({\varvec{A}}\) (denoted by \(\lambda _1({\varvec{A}})\)) with initialization b. In particular,
Assume that b is uniformly distributed over the set \(\{b\in {{\mathbb {R}}}^n : \Vert {b}\Vert =1\}\) and let \(\epsilon \in [0,1)\). Then, the probability that the Lanczos algorithm does not return an \(\epsilon \)-approximation to the leading eigenvalue of \({\varvec{A}}\) exponentially decreases as follows
Using this result, Theorem 7 is proven as follows. Since the tangent space \(T_{\varvec{\sigma }}{\mathcal {M}}_r\) has dimension \(n(r-1)\), then we can define a symmetric matrix (where we drop the notational dependency on \(\varvec{\sigma }\) for simplicity) \({\varvec{H}}\in {{\mathbb {R}}}^{n(r-1) \times n(r-1)}\) that represents the linear operator \({\mathrm {Hess}}f(\varvec{\sigma })\) in the basis \(\{ {\varvec{u}}^1,\dots ,{\varvec{u}}^{n(r-1)} \}\) such that \(\mathrm {span}({\varvec{u}}^1,\dots ,{\varvec{u}}^{n(r-1)}) = T_{\varvec{\sigma }}{\mathcal {M}}_r\). In particular, letting \(H_{ij} = {\langle {\varvec{u}}^i, {\mathrm {Hess}}f(\varvec{\sigma })[{\varvec{u}}^j] \rangle }\) yields the desired matrix \({\varvec{H}}\) and the Lanczos algorithm is run to find the leading eigenvalue of this matrix. Here, it is important to note that \({\varvec{H}}\) is not a psd matrix, so it is required to shift \({\varvec{H}}\) with a large enough multiple of the identity matrix so that the resulting matrix is guaranteed to be positive semidefinite. In particular, by inspecting the definition of \({\mathrm {Hess}}f(\varvec{\sigma })\) in (5), it is easy to observe that \(\Vert {{\mathrm {Hess}}f(\varvec{\sigma })}\Vert _{\text {op}} \le 4\Vert {{\varvec{A}}}\Vert _1\). Therefore, it is sufficient to run the Lanczos algorithm to find the leading eigenvalue of \({\widetilde{{\varvec{H}}}} = {\varvec{H}}+4\Vert {{\varvec{A}}}\Vert _1 {\varvec{I}}\), where \({\varvec{I}}\) denotes the appropriate sized identity matrix. On the other hand, we initialize the Lanczos algorithm with a random vector \({\varvec{u}}\) of unit norm (i.e., \(\Vert {\varvec{u}}\Vert _{\mathrm {F}}=1\)) in the tangent space \(T_{\varvec{\sigma }}{\mathcal {M}}_r\). Notice that \({\varvec{u}}\) can equivalently be represented as a vector \(b\in {{\mathbb {R}}}^{n(r-1)}\) in the basis \(\{ {\varvec{u}}^1,\dots ,{\varvec{u}}^{n(r-1)} \}\) as \({\varvec{u}}= \sum _{i=1}^{n(r-1)} b_i {\varvec{u}}^i\) such that \(\Vert {b}\Vert =1\). Then, by Theorem 8, we have
Letting \(\lambda _1({\varvec{H}})\) denote the leading eigenvalue of \({\varvec{H}}\), we run the Lanczos algorithm to obtain a vector \(b^*\) such that \(\Vert {b^*}\Vert =1\) and \({\langle b^*, {\varvec{H}}b^* \rangle } \ge \lambda _1({\varvec{H}})/2\). Thus, we want \(\mathbb {P}\left( \lambda _L^\ell ({\widetilde{{\varvec{H}}}},b) < 4\Vert {{\varvec{A}}}\Vert _1 + \lambda _1({\varvec{H}})/2 \right) \) to be small. Setting \(\epsilon ^* = \frac{\lambda _1({\varvec{H}})}{16\Vert {{\varvec{A}}}\Vert _1}\), we can observe that
where the inequality follows since \(\lambda _1({\varvec{H}}) \le 4\Vert {{\varvec{A}}}\Vert _1\). Consequently, we have
By Theorem 6, we know that the Lanczos method is called at most \(\left\lceil 675 n \Vert {{\varvec{A}}}\Vert _1^2 / \varepsilon ^2 \right\rceil \) times to search for an \(\varepsilon \)-approximate concave point and for any non-desired solution we have \(\lambda _1({\varvec{H}}) \ge \varepsilon \) by the definition of \(\varepsilon \)-approximate concave point. Then, by using a union bound over all calls to the Lanczos method, we conclude that when the Lanczos method is run for \(\ell \) iterations, we have the following guarantee
In order to set this probability to some \(\delta \in (0,1)\), we let
where tilde is used to hide poly-logarithmic factors in \(\Vert {{\varvec{A}}}\Vert _1 / \varepsilon \). Since the Lanczos algorithm is guaranteed to return the leading eigenvalue with probability 1 in at most \(n(r-1)\) iterations, then running each Lanczos subroutine for \(\min (\ell ^*,n(r-1))\) iterations, it is guaranteed that Algorithm 2+3 returns an \(\varepsilon \)-approximate concave point with probability at least \(1-\delta \).
Rights and permissions
About this article
Cite this article
Erdogdu, M.A., Ozdaglar, A., Parrilo, P.A. et al. Convergence rate of block-coordinate maximization Burer–Monteiro method for solving large SDPs. Math. Program. 195, 243–281 (2022). https://doi.org/10.1007/s10107-021-01686-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-021-01686-3
Keywords
- Semidefinite programming
- Burer–Monteiro method
- Coordinate descent
- Non-convex optimization
- Large-scale optimization