Skip to main content
Log in

Convergence rate of block-coordinate maximization Burer–Monteiro method for solving large SDPs

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

Semidefinite programming (SDP) with diagonal constraints arise in many optimization problems, such as Max-Cut, community detection and group synchronization. Although SDPs can be solved to arbitrary precision in polynomial time, generic convex solvers do not scale well with the dimension of the problem. In order to address this issue, Burer and Monteiro (Math Program 95(2):329–357, 2003) proposed to reduce the dimension of the problem by appealing to a low-rank factorization and solve the subsequent non-convex problem instead. In this paper, we present coordinate ascent based methods to solve this non-convex problem with provable convergence guarantees. More specifically, we prove that the block-coordinate maximization algorithm applied to the non-convex Burer–Monteiro method globally converges to a first-order stationary point with a sublinear rate without any assumptions on the problem. We further show that this algorithm converges linearly around a local maximum provided that the objective function exhibits quadratic decay. We establish that this condition generically holds when the rank of the factorization is sufficiently large. Furthermore, incorporating Lanczos method to the block-coordinate maximization, we propose an algorithm that is guaranteed to return a solution that provides \(1-{\mathcal {O}}\left( 1/r\right) \) approximation to the original SDP without any assumptions, where r is the rank of the factorization. This approximation ratio is known to be optimal (up to constants) under the unique games conjecture, and we can explicitly quantify the number of iterations to obtain such a solution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. Note that the dimension of \({\mathcal {V}}_{\varvec{\sigma }}\) depends on the rank of \(\varvec{\sigma }\), and hence the quotient space is not a manifold.

References

  1. Absil, P.-A., Baker, C.G., Gallivan, K.A.: Trust-region methods on Riemannian manifolds. Found. Comput. Math. 7(3), 303–330 (2007)

    Article  MathSciNet  Google Scholar 

  2. Absil, P.-A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton (2007)

    MATH  Google Scholar 

  3. Alizadeh, F., Haeberly, J.-P.A., Overton, M.L.: Complementarity and nondegeneracy in semidefinite programming. Math. Program. 77(1), 111–128 (1997)

    Article  MathSciNet  Google Scholar 

  4. Anitescu, M.: Degenerate nonlinear programming with a quadratic growth condition. SIAM J. Optim. 10(4), 1116–1135 (2000)

    Article  MathSciNet  Google Scholar 

  5. Arora, S., Hazan, E., Kale, S.: Fast algorithms for approximate semidefiniite programming using the multiplicative weights update method. In: Proceedings of the 46th Annual IEEE Symposium on Foundations of Computer Science, FOCS’05, pp. 339–348 (2005)

  6. Bandeira, A.S., Boumal, N., Voroninski, V.: On the low-rank approach for semidefinite programs arising in synchronization and community detection. arXiv:1602.04426 (2016)

  7. Barvinok, A.I.: Problems of distance geometry and convex properties of quadratic maps. Discrete Comput. Geom. 13(2), 189–202 (1995)

    Article  MathSciNet  Google Scholar 

  8. Bonnans, J.F., Ioffe, A.: Second-order sufficiency and quadratic growth for nonisolated minima. Math. Oper. Res. 20(4), 801–817 (1995)

    Article  MathSciNet  Google Scholar 

  9. Boumal, N., Absil, P.-A., Cartis, C.: Global rates of convergence for nonconvex optimization on manifolds. arXiv preprint arXiv:1605.08101 (2016)

  10. Boumal, N., Mishra, B., Absil, P.-A., Sepulchre, R.: Manopt, a Matlab toolbox for optimization on manifolds. J. Mach. Learn. Res. 15, 1455–1459 (2014)

    MATH  Google Scholar 

  11. Boumal, N., Voroninski, V., Bandeira, A.S.: The non-convex Burer–Monteiro approach works on smooth semidefinite programs. In: Advances in Neural Information Processing Systems, pp. 2757–2765 (2016)

  12. Boumal, N., Voroninski, V., Bandeira, A.S.: Deterministic guarantees for Burer–Monteiro factorizations of smooth semidefinite programs. arXiv preprint arXiv:1804.02008 (2018)

  13. Briat, C.: Linear Parameter-Varying and Time-Delay Systems. Springer (2014)

  14. Briët, J., de Oliveira Filho, F.M., Vallentin, F.: The positive semidefinite grothendieck problem with rank constraint. In: Automata, Languages and Programming, pp. 31–42 (2010)

  15. Burer, S., Monteiro, R.D.C.: A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Math. Program. 95(2), 329–357 (2003)

    Article  MathSciNet  Google Scholar 

  16. Burer, S., Monteiro, R.D.C.: Local minima and convergence in low-rank semidefinite programming. Math. Program. 103(3), 427–444 (2005)

    Article  MathSciNet  Google Scholar 

  17. Cifuentes, D., Moitra, A.: Polynomial time guarantees for the Burer–Monteiro method. arXiv preprint arXiv:1912.01745 (2019)

  18. Coakley, E.S., Rokhlin, V.: A fast divide-and-conquer algorithm for computing the spectra of real symmetric tridiagonal matrices. Appl. Comput. Harmon. Anal. 34(3), 379–414 (2013)

    Article  MathSciNet  Google Scholar 

  19. Erdogdu, M.A., Deshpande, Y., Montanari, A.: Inference in graphical models via semidefinite programming hierarchies. In: Advances in Neural Information Processing Systems, pp. 416–424 (2017)

  20. Gamarnik, D., Li, Q.: On the max-cut of sparse random graphs. arXiv preprint arXiv:1411.1698 (2014)

  21. Garber, D., Hazan, E.: Approximating semidefinite programs in sublinear time. In: Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS’11, pp. 1080–1088 (2011)

  22. Goemans, M.X., Williamson, D.P.: Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM 42(6), 1115–1145 (1995)

    Article  MathSciNet  Google Scholar 

  23. Gurbuzbalaban, M., Ozdaglar, A., Parrilo, P.A., Vanli, N.D.: When cyclic coordinate descent outperforms randomized coordinate descent. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, and (eds.) Advances in Neural Information Processing Systems, volume 30, pp. 6999–7007. Curran Associates, Inc. (2017)

  24. Gurbuzbalaban, M., Ozdaglar, A., Vanli, N.D., Wright, S.J.: Randomness and permutations in coordinate descent methods. Math. Program. 181, 03 (2018)

    MathSciNet  MATH  Google Scholar 

  25. Javanmard, A., Montanari, A., Ricci-Tersenghi, F.: Phase transitions in semidefinite relaxations. Proc. Natl. Acad. Sci. 113(16), E2218–E2223 (2016)

    Article  MathSciNet  Google Scholar 

  26. Journee, M., Bach, F., Absil, P.-A., Sepulchre, R.: Low-rank optimization on the cone of positive semidefinite matrices. SIAM J. Optim. 20(5), 2327–2351 (2010)

    Article  MathSciNet  Google Scholar 

  27. Klein, P., Lu, H.-I.: Efficient approximation algorithms for semidefinite programs arising from MAX CUT and COLORING. In: Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing, STOC’96, pp. 338–347. ACM, New York, NY, USA (1996)

  28. Kuczyński, J., Woźniakowski, H.: Estimating the largest eigenvalues by the power and Lanczos algorithms with a random start. SIAM J. Matrix Anal. Appl. 13(4), 1094–1122 (1992)

    Article  MathSciNet  Google Scholar 

  29. Lee, C.-P., Wright, S.J.: Random permutations fix a worst case for cyclic coordinate descent. IMA J. Numer. Anal. 39, 07 (2016)

    MathSciNet  Google Scholar 

  30. Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent only converges to minimizers. In: 29th Annual Conference on Learning Theory, vol. 49, pp. 1246–1257. PMLR (2016)

  31. Lu, Z., Xiao, L.: Randomized block coordinate non-monotone gradient method for a class of nonlinear programming. Technical Report MSR-TR-2013-66 (2013)

  32. Mei, S., Misiakiewicz, T., Montanari, A., Oliveira, R.I.: Solving SDPs for synchronization and MaxCut problems via the Grothendieck inequality. arXiv preprint arXiv:1703.08729 (2017)

  33. Montanari, A.: A Grothendieck-type inequality for local maxima. arXiv preprint arXiv:1603.04064 (2016)

  34. Parrilo, P.A.: Semidefinite programming relaxations for semialgebraic problems. Math. Program. 96(2), 293–320 (2003)

    Article  MathSciNet  Google Scholar 

  35. Pataki, G.: On the rank of extreme matrices in semidefinite programs and the multiplicity of optimal eigenvalues. Math. Oper. Res. 23(2), 339–358 (1998)

    Article  MathSciNet  Google Scholar 

  36. Patrascu, A., Necoara, I.: Efficient random coordinate descent algorithms for large-scale structured nonconvex optimization. J. Glob. Optim. 61, 05 (2013)

    MathSciNet  MATH  Google Scholar 

  37. Pumir, T., Jelassi, S., Boumal, N.: Smoothed analysis of the low-rank approach for smooth semidefinite programs. In: Advances in Neural Information Processing Systems, pp. 2281–2290 (2018)

  38. Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 144, 07 (2011)

    MathSciNet  MATH  Google Scholar 

  39. Steurer, D.: Fast SDP algorithms for constraint satisfaction problems. In: Proceedings of the Twenty-First Annual ACM–SIAM Symposium on Discrete Algorithms, pp. 684–697 (2010)

  40. Tropp, J.A., Yurtsever, A., Udell, M., Cevher, V.: Practical sketching algorithms for low-rank matrix approximation. SIAM J. Matrix Anal. Appl. 38(4), 1454–1485 (2017)

    Article  MathSciNet  Google Scholar 

  41. Tseng, P., Yun, S.: A coordinate gradient descent method for nonsmooth separable minimization. Math. Program. 117, 387–423 (2009)

    Article  MathSciNet  Google Scholar 

  42. Vandenberghe, L., Boyd, S.: Semidefinite programming. SIAM Rev. 38(1), 49–95 (1996)

    Article  MathSciNet  Google Scholar 

  43. Wang, P.-W., Chang, W.-C., Kolter, J.Z.: The mixing method: coordinate descent for low-rank semidefinite programming. arXiv preprint arXiv:1706.00476 (2017)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nuri Denizcan Vanli.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Part of this work has previously appeared in ICML 2018 Workshop on Modern Trends in Nonconvex Optimization for Machine Learning.

Appendices

Proof of Corollary 1

Similar to the proof of Theorem 1, from Proposition 1, we have

$$\begin{aligned} f(\varvec{\sigma }^{k+1}) - f(\varvec{\sigma }^k)&= 2 \left( \Vert {g_{i_k}^k}\Vert - \langle \sigma _{i_k}^k, g_{i_k}^k \rangle \right) , \nonumber \\&= \frac{2 \Vert {g_{i_k}^k}\Vert \left( \Vert {g_{i_k}^k}\Vert - \langle \sigma _{i_k}^k, g_{i_k}^k \rangle \right) }{\Vert {g_{i_k}^k}\Vert } ,\nonumber \\&\ge \frac{ \Vert {g_{i_k}^k}\Vert ^2 - \langle \sigma _{i_k}^k, g_{i_k}^k \rangle ^2 }{\Vert {g_{i_k}^k}\Vert }, \end{aligned}$$
(47)

where the inequality follows since \(\Vert {g_{i_k}^k}\Vert \ge {\langle \sigma _{i_k}^k,g_{i_k}^k \rangle }\), for all \(\sigma _{i_k}^k \in {{\mathbb {R}}}^{n \times r}\). Letting \(\mathbb {E}_k\) denote the expectation over \(i_k\) given \(\sigma ^k\), we have

$$\begin{aligned} \mathbb {E}_k f(\varvec{\sigma }^{k+1}) - f(\varvec{\sigma }^k) \ge \sum _{i=1}^n p_i \frac{ \Vert {g_i^k}\Vert ^2 - \langle \sigma _i^k, g_i^k \rangle ^2 }{\Vert {g_i^k}\Vert }. \end{aligned}$$

In particular, when \(p_i=\frac{1}{n}\), for all \(i \in [n]\) (i.e., for uniform sampling case), we have

$$\begin{aligned} \mathbb {E}_k f(\varvec{\sigma }^{k+1}) - f(\varvec{\sigma }^k) \ge \frac{1}{n \Vert {{\varvec{A}}}\Vert _1} \, \sum _{i=1}^n \left( \Vert {g_i^k}\Vert ^2 - \langle \sigma _i^k, g_i^k \rangle ^2 \right) , \end{aligned}$$

since \(\Vert {g_i^k}\Vert \le \Vert {{\varvec{A}}}\Vert _1\), for all \(i \in [n]\) by (11). Therefore, we have

$$\begin{aligned} \mathbb {E}_k f(\varvec{\sigma }^{k+1}) - f(\varvec{\sigma }^k) \ge \frac{\Vert \mathrm {grad} f(\varvec{\sigma }^k)\Vert _{\mathrm {F}}^2}{2n \Vert {{\varvec{A}}}\Vert _1}. \end{aligned}$$
(48)

On the other hand, when \(p_i=\frac{\Vert {g_i^k}\Vert }{\sum _{j=1}^n \Vert {g_j^k}\Vert }\) (i.e., for importance sampling case), we have

$$\begin{aligned} \mathbb {E}_k f(\varvec{\sigma }^{k+1}) - f(\varvec{\sigma }^k) \ge \frac{ \sum _{i=1}^n \Vert {g_i^k}\Vert ^2 - \langle \sigma _i^k, g_i^k \rangle ^2 }{\sum _{j=1}^n \Vert {g_j^k}\Vert } = \frac{ \Vert \mathrm {grad} f(\varvec{\sigma }^k)\Vert _{\mathrm {F}}^2}{2 \sum _{j=1}^n \Vert {g_j^k}\Vert }. \end{aligned}$$

Letting \(\Vert {{\varvec{A}}}\Vert _{1,1} = \sum _{i,j=1}^n |{\varvec{A}}_{ij}|\) denote the \(L_{1,1}\) norm of matrix \({\varvec{A}}\), we observe that \(\sum _{j=1}^n \Vert {g_j^k}\Vert \le \Vert {{\varvec{A}}}\Vert _{1,1}\), which in the above inequality yields

$$\begin{aligned} \mathbb {E}_k f(\varvec{\sigma }^{k+1}) - f(\varvec{\sigma }^k) \ge \frac{\Vert \mathrm {grad} f(\varvec{\sigma }^k)\Vert _{\mathrm {F}}^2}{2\Vert {{\varvec{A}}}\Vert _{1,1}}. \end{aligned}$$
(49)

In order to prove (13), which corresponds to uniform sampling case, we assume the contrary that \(\mathbb {E}\Vert \mathrm {grad} f(\varvec{\sigma }^k)\Vert _{\mathrm {F}}^2 > \epsilon \), for all \(k\in [K-1]\). Then, using the boundedness of f, we get

$$\begin{aligned} f^* - f(\varvec{\sigma }^0)\ge & {} \mathbb {E}f(\varvec{\sigma }^K) - f(\varvec{\sigma }^0) = \sum _{k=0}^{K-1} \mathbb {E}\left[ f(\varvec{\sigma }^{k+1}) - f(\varvec{\sigma }^k) \right] \\= & {} \sum _{k=0}^{K-1} \mathbb {E}\left[ \mathbb {E}_k f(\varvec{\sigma }^{k+1}) - f(\varvec{\sigma }^k) \right] . \end{aligned}$$

Using the expected functional ascent of BCM in (48) above, we get

$$\begin{aligned} f^* - f(\varvec{\sigma }^0) \ge \sum _{k=0}^{K-1} \frac{\mathbb {E}\Vert \mathrm {grad} f(\varvec{\sigma }^k)\Vert _{\mathrm {F}}^2}{2n\Vert {{\varvec{A}}}\Vert _1} > \frac{K \epsilon }{2n\Vert {{\varvec{A}}}\Vert _1}, \end{aligned}$$
(50)

where the last inequality follows by the assumption. Then, by contradiction, the algorithm returns a solution with \(\mathbb {E}\Vert \mathrm {grad} f(\varvec{\sigma }^k)\Vert _{\mathrm {F}}^2 \le \epsilon \), for some \(k\in [K-1]\), provided that

$$\begin{aligned} K \ge \frac{2n\Vert {{\varvec{A}}}\Vert _1 (f^* - f(\varvec{\sigma }^0))}{\epsilon }. \end{aligned}$$

The proof of (14), which corresponds to importance sampling case, can be obtained by using (49) (instead of (48)) in (50), and hence is omitted.

Rest of the Proof of Theorem 2

In order to quantify how close \(\varvec{\sigma }^0\) and \(\varvec{\sigma }\) should be so that this convergence rate holds, we need to derive explicit bounds on the higher order terms in (21) and (23), which we do in the following. The Taylor expansion of \(\varvec{\sigma }^k\) around \(\varvec{\sigma }\) yields

$$\begin{aligned} \sigma _i^k&= \sigma _i \cos (\Vert {u_i}\Vert t) + \frac{u_i}{\Vert {u_i}\Vert } \sin (\Vert {u_i}\Vert t), \\&= \sigma _i \left[ \sum _{\ell =0}^\infty \frac{(-1)^\ell }{(2\ell )!} \left( \Vert {u_i}\Vert t \right) ^{2\ell } \right] + \frac{u_i}{\Vert {u_i}\Vert } \left[ \sum _{\ell =0}^\infty \frac{(-1)^\ell }{(2\ell +1)!} \left( \Vert {u_i}\Vert t \right) ^{2\ell +1} \right] . \end{aligned}$$

Using this expansion, we can compute \(f(\varvec{\sigma }^k) = \sum _{i,j=1}^n A_{ij} {\langle \sigma _i^k, \sigma _j^k \rangle }\). The first three terms in the expansion are already given in (22) as follows

$$\begin{aligned} f(\varvec{\sigma }^k) = f(\varvec{\sigma }) + t^2 \sum _{i=1}^n \left( {\langle u_i,v_i \rangle } - \Vert {u_i}\Vert ^2 \Vert {g_i}\Vert \right) + \beta _f, \end{aligned}$$
(51)

where \(\beta _f\) represents the higher order terms. In order to find an upper bound on \(|\beta _f|\), we use the Cauchy-Schwarz inequality in the higher order terms in the expansion of \(f(\varvec{\sigma }^k)\), which yields the following bound

$$\begin{aligned} |\beta _f| \le \sum _{i,j=1}^n |A_{ij}| \left( \sum _{\ell =3}^\infty \frac{t^\ell }{\ell !} ( \Vert {u_i}\Vert + \Vert {u_j}\Vert )^\ell \right) . \end{aligned}$$

As \(\Vert {\varvec{u}}\Vert _{\mathrm {F}}=1\), we have \(\Vert {u_i}\Vert \le 1\) for all \(i\in [n]\), which implies

$$\begin{aligned} |\beta _f| \le \sum _{i,j=1}^n |A_{ij}| \left( \sum _{\ell =3}^\infty \frac{t^\ell }{\ell !} 2^\ell \right) , \end{aligned}$$

where we note that t denotes the geodesic distance between \(\varvec{\sigma }^k\) and \([{\bar{\varvec{\sigma }}}]\) as highlighted in (19). Assuming that \(t\le 1\), we obtain the following upper bound

$$\begin{aligned} |\beta _f| \le t^3 n \Vert {{\varvec{A}}}\Vert _1 \left( \sum _{\ell =3}^\infty \frac{2^\ell }{\ell !} \right) . \end{aligned}$$

Using the inequality \(\sum _{\ell =3}^\infty \frac{2^\ell }{\ell !} = e^2-5 \le 5/2\) above, we get

$$\begin{aligned} |\beta _f| \le \frac{5n \Vert {{\varvec{A}}}\Vert _1t^3 }{2}. \end{aligned}$$

Plugging this value back in (51), we obtain

$$\begin{aligned} f(\varvec{\sigma }^k) \le f(\varvec{\sigma }) + t^2 \sum _{i=1}^n \left( {\langle u_i,v_i \rangle } - \Vert {u_i}\Vert ^2 \Vert {g_i}\Vert \right) + \frac{5n \Vert {{\varvec{A}}}\Vert _1t^3 }{2}. \end{aligned}$$
(52)

Considering the same expansion for \(\Vert {\mathrm {grad}}f(\varvec{\sigma }^k)\Vert _{\mathrm {F}}^2 = 2 \sum _{i=1}^n (\Vert {g_i^k}\Vert ^2 - {\langle \sigma _i^k,g_i^k \rangle }^2)\), we get the following (see (21)):

$$\begin{aligned} \Vert {\mathrm {grad}}f(\varvec{\sigma }^k)\Vert _{\mathrm {F}}^2 = 2 t^2 \sum _{i=1}^n \left( \Vert {u_i}\Vert \Vert {g_i}\Vert - {\langle \frac{u_i}{\Vert {u_i}\Vert },v_i \rangle } \right) ^2 + \beta _g, \end{aligned}$$
(53)

where \(\beta _g\) represents the higher order terms. Upper bounding each higher order terms using the Cauchy-Schwarz inequality as follows, we obtain

$$\begin{aligned} |\beta _g|\le & {} 2 \sum _{i=1}^n \left[ \sum _{j,m=1}^n |A_{ij}| |A_{im}| \left( \sum _{\ell =3}^\infty \frac{t^\ell }{\ell !} ( \Vert {u_j}\Vert + \Vert {u_m}\Vert )^\ell \right) \right. \\&\left. + \sum _{j,m=1}^n |A_{ij}| |A_{im}| \left( \sum _{\begin{array}{c} \ell ,s=0 \\ \ell +s\ge 3 \end{array}}^\infty \frac{t^{\ell +s}}{\ell ! s!} ( \Vert {u_i}\Vert + \Vert {u_j}\Vert )^{\ell +s} \right) \right] . \end{aligned}$$

Using the fact that \(\Vert {u_i}\Vert \le 1\) for all \(i\in [n]\), we get the following upper bound

$$\begin{aligned} |\beta _g| \le 2 \sum _{i=1}^n \left[ \sum _{j,m=1}^n |A_{ij}| |A_{im}| \left( \sum _{\ell =3}^\infty \frac{t^\ell }{\ell !} 2^\ell \right) + \sum _{j,m=1}^n |A_{ij}| |A_{im}| \left( \sum _{\begin{array}{c} \ell ,s=0 \\ \ell +s\ge 3 \end{array}}^\infty \frac{t^{\ell +s}}{\ell ! s!} 2^{\ell +s} \right) \right] . \end{aligned}$$

Using the upper bound \(\sum _{j,m=1}^n |A_{ij}| |A_{im}| \le \Vert {{\varvec{A}}}\Vert _1^2\) above, we obtain

$$\begin{aligned} |\beta _g| \le 2 \Vert {{\varvec{A}}}\Vert _1^2 \sum _{i=1}^n \left[ \sum _{\ell =3}^\infty \frac{t^\ell }{\ell !} 2^\ell + \sum _{\begin{array}{c} \ell ,s=0 \\ \ell +s\ge 3 \end{array}}^\infty \frac{t^{\ell +s}}{\ell ! s!} 2^{\ell +s} \right] . \end{aligned}$$

Introducing a change of variables in the last sum, we get

$$\begin{aligned} |\beta _g|&\le 2 \Vert {{\varvec{A}}}\Vert _1^2 \sum _{i=1}^n \left[ \sum _{\ell =3}^\infty \frac{t^\ell }{\ell !} 2^\ell + \sum _{\ell =3}^\infty \frac{t^\ell }{\ell !} 2^\ell \left( \sum _{s=0}^\ell \frac{\ell !}{s!(\ell -s)!} \right) \right] ,\\&= 2 \Vert {{\varvec{A}}}\Vert _1^2 \sum _{i=1}^n \left[ \sum _{\ell =3}^\infty \frac{t^\ell }{\ell !} \left( 2^\ell + 4^\ell \right) \right] . \end{aligned}$$

Assuming that \(t\le 1\), we obtain the following upper bound

$$\begin{aligned} |\beta _g| \le 2 \Vert {{\varvec{A}}}\Vert _1^2 t^3 \sum _{i=1}^n \left[ \sum _{\ell =3}^\infty \frac{1}{\ell !} \left( 2^\ell + 4^\ell \right) \right] . \end{aligned}$$

Using the inequality \(\sum _{\ell =3}^\infty \frac{2^\ell +4^\ell }{\ell !} = e^2+e^4-18 \le 44\) above, we get

$$\begin{aligned} |\beta _g| \le 88 n \Vert {{\varvec{A}}}\Vert _1^2 t^3. \end{aligned}$$

Plugging this value back in (53), we obtain

$$\begin{aligned} \Vert {\mathrm {grad}}f(\varvec{\sigma }^k)\Vert _{\mathrm {F}}^2 \ge 2 t^2 \sum _{i=1}^n \left( \Vert {u_i}\Vert \Vert {g_i}\Vert - {\langle \frac{u_i}{\Vert {u_i}\Vert },v_i \rangle } \right) ^2 - 88 n \Vert {{\varvec{A}}}\Vert _1^2 t^3. \end{aligned}$$
(54)

Using the same bounding technique as in (24), we get

$$\begin{aligned} \Vert \mathrm {grad} f(\varvec{\sigma }^k)\Vert _{\mathrm {F}}^2&\ge \frac{\mu }{n} \left( f({\bar{\varvec{\sigma }}}) - f(\varvec{\sigma }^k) - \frac{5n \Vert {{\varvec{A}}}\Vert _1t^3 }{2} \right) - 88 n \Vert {{\varvec{A}}}\Vert _1^2 t^3, \\&= \frac{\mu }{n} \left( f({\bar{\varvec{\sigma }}}) - f(\varvec{\sigma }^k) \right) - t^3 \Vert {{\varvec{A}}}\Vert _1 \left( 3\mu + 88 n \Vert {{\varvec{A}}}\Vert _1 \right) . \end{aligned}$$

Therefore, in order for (25) to hold, we need

$$\begin{aligned} t^3 \Vert {{\varvec{A}}}\Vert _1 \left( 3\mu + 88 n \Vert {{\varvec{A}}}\Vert _1 \right) \le \frac{\mu }{2n} \left( f({\bar{\varvec{\sigma }}}) - f(\varvec{\sigma }^k) \right) , \end{aligned}$$

which can be equivalently rewritten as follows

$$\begin{aligned} t^3 \le \frac{\mu (f({\bar{\varvec{\sigma }}}) - f(\varvec{\sigma }^k))}{2n \Vert {{\varvec{A}}}\Vert _1 \left( 3\mu + 88 n \Vert {{\varvec{A}}}\Vert _1 \right) }. \end{aligned}$$

As \(f(\varvec{\sigma }^k)\) is a monotonically non-decreasing sequence, then as soon as \(\varvec{\sigma }^0\) is sufficiently close to \([{\bar{\varvec{\sigma }}}]\) in the sense that

$$\begin{aligned} \mathrm {dist}(\varvec{\sigma }^0, [{\bar{\varvec{\sigma }}}]) \le \left( \frac{\mu (f({\bar{\varvec{\sigma }}}) - f(\varvec{\sigma }^k))}{2n \Vert {{\varvec{A}}}\Vert _1 \left( 3\mu + 88 n \Vert {{\varvec{A}}}\Vert _1 \right) } \right) ^{1/3}, \end{aligned}$$

then the linear convergence rate presented in (26) holds.

Proof of Theorem 7

Before presenting the proof of Theorem 7, we first introduce the following theorem that characterizes the convergence rate of the Lanczos method with random initialization.

Theorem 8

[28, Theorem 4.2] Let \({\varvec{A}}\in {{\mathbb {R}}}^{n \times n}\) be a positive semidefinite matrix, \(b\in {{\mathbb {R}}}^n\) be an arbitrary vector and \(\lambda _L^\ell ({\varvec{A}},b)\) denote the output of the Lanczos algorithm after \(\ell \) iterations when applied to find the leading eigenvalue of \({\varvec{A}}\) (denoted by \(\lambda _1({\varvec{A}})\)) with initialization b. In particular,

$$\begin{aligned} \lambda _L^\ell ({\varvec{A}},b) = \max \left\{ \frac{{\langle x,{\varvec{A}}x \rangle }}{{\langle x,x \rangle }} : 0 \ne x \in \mathrm {span}(b,\dots ,{\varvec{A}}^{\ell -1}b) \right\} . \end{aligned}$$

Assume that b is uniformly distributed over the set \(\{b\in {{\mathbb {R}}}^n : \Vert {b}\Vert =1\}\) and let \(\epsilon \in [0,1)\). Then, the probability that the Lanczos algorithm does not return an \(\epsilon \)-approximation to the leading eigenvalue of \({\varvec{A}}\) exponentially decreases as follows

$$\begin{aligned} \mathbb {P}\left( \lambda _L^\ell ({\varvec{A}},b)< (1-\epsilon ) \lambda _1({\varvec{A}}) \right) {\left\{ \begin{array}{ll} \le 1.648 \sqrt{n} e^{-\sqrt{\epsilon }(2\ell -1)}, &{} \text {if } 0<\ell <n(r-1), \\ = 0, &{} \text {if } \ell \ge n(r-1). \end{array}\right. } \end{aligned}$$

Using this result, Theorem 7 is proven as follows. Since the tangent space \(T_{\varvec{\sigma }}{\mathcal {M}}_r\) has dimension \(n(r-1)\), then we can define a symmetric matrix (where we drop the notational dependency on \(\varvec{\sigma }\) for simplicity) \({\varvec{H}}\in {{\mathbb {R}}}^{n(r-1) \times n(r-1)}\) that represents the linear operator \({\mathrm {Hess}}f(\varvec{\sigma })\) in the basis \(\{ {\varvec{u}}^1,\dots ,{\varvec{u}}^{n(r-1)} \}\) such that \(\mathrm {span}({\varvec{u}}^1,\dots ,{\varvec{u}}^{n(r-1)}) = T_{\varvec{\sigma }}{\mathcal {M}}_r\). In particular, letting \(H_{ij} = {\langle {\varvec{u}}^i, {\mathrm {Hess}}f(\varvec{\sigma })[{\varvec{u}}^j] \rangle }\) yields the desired matrix \({\varvec{H}}\) and the Lanczos algorithm is run to find the leading eigenvalue of this matrix. Here, it is important to note that \({\varvec{H}}\) is not a psd matrix, so it is required to shift \({\varvec{H}}\) with a large enough multiple of the identity matrix so that the resulting matrix is guaranteed to be positive semidefinite. In particular, by inspecting the definition of \({\mathrm {Hess}}f(\varvec{\sigma })\) in (5), it is easy to observe that \(\Vert {{\mathrm {Hess}}f(\varvec{\sigma })}\Vert _{\text {op}} \le 4\Vert {{\varvec{A}}}\Vert _1\). Therefore, it is sufficient to run the Lanczos algorithm to find the leading eigenvalue of \({\widetilde{{\varvec{H}}}} = {\varvec{H}}+4\Vert {{\varvec{A}}}\Vert _1 {\varvec{I}}\), where \({\varvec{I}}\) denotes the appropriate sized identity matrix. On the other hand, we initialize the Lanczos algorithm with a random vector \({\varvec{u}}\) of unit norm (i.e., \(\Vert {\varvec{u}}\Vert _{\mathrm {F}}=1\)) in the tangent space \(T_{\varvec{\sigma }}{\mathcal {M}}_r\). Notice that \({\varvec{u}}\) can equivalently be represented as a vector \(b\in {{\mathbb {R}}}^{n(r-1)}\) in the basis \(\{ {\varvec{u}}^1,\dots ,{\varvec{u}}^{n(r-1)} \}\) as \({\varvec{u}}= \sum _{i=1}^{n(r-1)} b_i {\varvec{u}}^i\) such that \(\Vert {b}\Vert =1\). Then, by Theorem 8, we have

$$\begin{aligned} \mathbb {P}\left( \lambda _L^\ell ({\widetilde{{\varvec{H}}}},b) < (1-\epsilon ) \lambda _1({\widetilde{{\varvec{H}}}}) \right) \le 1.648 \sqrt{n(r-1)} e^{-\sqrt{\epsilon }(2\ell -1)}. \end{aligned}$$

Letting \(\lambda _1({\varvec{H}})\) denote the leading eigenvalue of \({\varvec{H}}\), we run the Lanczos algorithm to obtain a vector \(b^*\) such that \(\Vert {b^*}\Vert =1\) and \({\langle b^*, {\varvec{H}}b^* \rangle } \ge \lambda _1({\varvec{H}})/2\). Thus, we want \(\mathbb {P}\left( \lambda _L^\ell ({\widetilde{{\varvec{H}}}},b) < 4\Vert {{\varvec{A}}}\Vert _1 + \lambda _1({\varvec{H}})/2 \right) \) to be small. Setting \(\epsilon ^* = \frac{\lambda _1({\varvec{H}})}{16\Vert {{\varvec{A}}}\Vert _1}\), we can observe that

$$\begin{aligned} \left( 1-\epsilon ^* \right) \lambda _1({\widetilde{{\varvec{H}}}})&= \left( 1-\frac{\lambda _1({\varvec{H}})}{16\Vert {{\varvec{A}}}\Vert _1} \right) \left( 4\Vert {{\varvec{A}}}\Vert _1 + \lambda _1({\varvec{H}}) \right) , \\&= 4\Vert {{\varvec{A}}}\Vert _1 + \frac{3\lambda _1({\varvec{H}})}{4} - \frac{(\lambda _1({\varvec{H}}))^2}{16\Vert {{\varvec{A}}}\Vert _1}, \\&\ge 4\Vert {{\varvec{A}}}\Vert _1 + \frac{\lambda _1({\varvec{H}})}{2}, \end{aligned}$$

where the inequality follows since \(\lambda _1({\varvec{H}}) \le 4\Vert {{\varvec{A}}}\Vert _1\). Consequently, we have

$$\begin{aligned}&\mathbb {P}\left( \lambda _L^\ell ({\widetilde{{\varvec{H}}}},b)< 4\Vert {{\varvec{A}}}\Vert _1 + \lambda _1({\varvec{H}})/2 \right) \\&\quad \le \mathbb {P}\left( \lambda _L^\ell ({\widetilde{{\varvec{H}}}},b) < (1-\epsilon ^*) \lambda _1({\widetilde{{\varvec{H}}}}) \right) \le 1.648 \sqrt{n(r-1)} e^{-\sqrt{\epsilon ^*}(2\ell -1)}. \end{aligned}$$

By Theorem 6, we know that the Lanczos method is called at most \(\left\lceil 675 n \Vert {{\varvec{A}}}\Vert _1^2 / \varepsilon ^2 \right\rceil \) times to search for an \(\varepsilon \)-approximate concave point and for any non-desired solution we have \(\lambda _1({\varvec{H}}) \ge \varepsilon \) by the definition of \(\varepsilon \)-approximate concave point. Then, by using a union bound over all calls to the Lanczos method, we conclude that when the Lanczos method is run for \(\ell \) iterations, we have the following guarantee

$$\begin{aligned}&\mathbb {P}\left( \text {Algorithm 2+3 fails to return an} \varepsilon \text {-approximate concave point} \right) \\&\quad \le \left\lceil \frac{675 n \Vert {{\varvec{A}}}\Vert _1^2}{\varepsilon ^2} \right\rceil 1.648 \sqrt{n(r-1)} e^{-\sqrt{\frac{\varepsilon }{16\Vert {{\varvec{A}}}\Vert _1}}(2\ell -1)}. \end{aligned}$$

In order to set this probability to some \(\delta \in (0,1)\), we let

$$\begin{aligned} \ell ^*= & {} \left\lceil \left( \frac{1}{2} + 2 \sqrt{\frac{\Vert {{\varvec{A}}}\Vert _1}{\varepsilon }} \right) \log \left( \frac{\left\lceil \frac{675 n \Vert {{\varvec{A}}}\Vert _1^2}{\varepsilon ^2} \right\rceil 1.648 \sqrt{n(r-1)}}{\delta } \right) \right\rceil \\= & {} \widetilde{{\mathcal {O}}}\left( \sqrt{\frac{\Vert {{\varvec{A}}}\Vert _1}{\varepsilon }} \log \left( \frac{n\sqrt{n(r-1)}}{\delta } \right) \right) , \end{aligned}$$

where tilde is used to hide poly-logarithmic factors in \(\Vert {{\varvec{A}}}\Vert _1 / \varepsilon \). Since the Lanczos algorithm is guaranteed to return the leading eigenvalue with probability 1 in at most \(n(r-1)\) iterations, then running each Lanczos subroutine for \(\min (\ell ^*,n(r-1))\) iterations, it is guaranteed that Algorithm 2+3 returns an \(\varepsilon \)-approximate concave point with probability at least \(1-\delta \).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Erdogdu, M.A., Ozdaglar, A., Parrilo, P.A. et al. Convergence rate of block-coordinate maximization Burer–Monteiro method for solving large SDPs. Math. Program. 195, 243–281 (2022). https://doi.org/10.1007/s10107-021-01686-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-021-01686-3

Keywords

Mathematics Subject Classification

Navigation