Skip to main content
Log in

Linear convergence of Frank–Wolfe for rank-one matrix recovery without strong convexity

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

We consider convex optimization problems which are widely used as convex relaxations for low-rank matrix recovery problems. In particular, in several important problems, such as phase retrieval and robust PCA, the underlying assumption in many cases is that the optimal solution is rank-one. In this paper we consider a simple and natural sufficient condition on the objective so that the optimal solution to these relaxations is indeed unique and rank-one. Mainly, we show that under this condition, the standard Frank–Wolfe method with line-search (i.e., without any tuning of parameters whatsoever), which only requires a single rank-one SVD computation per iteration, finds an \(\epsilon \)-approximated solution in only \(O(\log {1/\epsilon })\) iterations (as opposed to the previous best known bound of \(O(1/\epsilon )\)), despite the fact that the objective is not strongly convex. We consider several variants of the basic method with improved complexities, as well as an extension motivated by robust PCA, and finally, an extension to nonsmooth problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. Here we note that while some problems, such as phase retrieval, are usually formulated as optimization over matrices with complex entries, our results are applicable in a straightforward manner to optimization over the corresponding spectrahedron \(\{{\mathbf {X}}\in {\mathbb {C}}^{n\times n} ~|~{\mathbf {X}}\succeq 0,~\text {Tr}({\mathbf {X}})=1\}\). However, for simplicity of presentation we focus on matrices with real entries.

  2. In the close proximity of an optimal solution it is quite plausible that only low-rank SVD computations will be needed to compute the proximal step, see for instance our recent work [16].

  3. Extending this discussion to the case in which these eigenvalues are only approximated up to sufficient precision is straightforward

  4. This quantity is known as the duality gap and it is indeed an upper-bound on the approximation error since \(f(\cdot )\) is convex, see [23].

  5. Here we make an implicit assumption that it is computationally efficient to compute Euclidean projections onto the set \(\mathcal {K}\).

  6. Recall that according to the previous lemma the gradient vector is constant over the set of optimal solutions and thus, this is equivalent to assuming the eigen-gap holds for some optimal solution.

  7. The bound on the approximation error is verified by computing the duality gap, which is an upper-bound on the approximation error w.r.t. the function value (see for instance [23])

  8. We note this is a common initialization for Frank–Wolfe, and actually is equivalent to initializing Frank–Wolfe with \(\tau \cdot {\mathbf {x}}{\mathbf {x}}^{\top }\), and running for one iteration with the classical step-size rule \(\eta _t = \frac{2}{t+1}\)

References

  1. Allen-Zhu, Z., Hazan, E., Hu, W., Li, Y.: Linear convergence of a Frank-Wolfe type algorithm over trace-norm balls. Adv. Neural. Inf. Process. Syst. 30, 6192–6201 (2017)

    Google Scholar 

  2. Beck, A.: First-order Methods in Optimization. SIAM (2017)

  3. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  4. Beck, A., Teboulle, M.: Smoothing and first order methods: a unified framework. SIAM J. Optim. 22(2), 557–580 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  5. Bhojanapalli, S., Neyshabur, B., Srebro, N.: Global optimality of local search for low rank matrix recovery. In Advances in Neural Information Processing Systems, pp 3873–3881 (2016)

  6. Candes, E.J., Eldar, Y.C., Strohmer, T., Voroninski, V.: Phase retrieval via matrix completion. SIAM Rev. 57(2), 225–251 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  7. Candès, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? J. ACM 58(3), 11 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  8. Candès, E.J., Recht, B.: Exact matrix completion via convex optimization. Found. Comput. Math. 9(6), 717–772 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  9. Chen, Y., Wainwright, M. J.: Fast low-rank estimation by projected gradient descent: general statistical and algorithmic guarantees. arXiv:1509.03025 (2015)

  10. Demyanov, V.F., Rubinov, A.M.: Approximate Methods in Optimization Problems. Elsevier, Amsterdam (1970)

    Google Scholar 

  11. Ding, L., Fei, Y., Xu, Q., Yang, C.: Spectral Frank-Wolfe algorithm: strict complementarity and linear convergence. In: International conference on machine learning, pp 2535–2544. PMLR, (2020)

  12. Drusvyatskiy, D., Lewis, A.S.: Error bounds, quadratic growth, and linear convergence of proximal methods. Math. Oper. Res. 43(3), 919–948 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  13. Freund, R.M., Grigas, P., Mazumder, R.: An extended Frank–Wolfe method with in-face directions, and its application to low-rank matrix completion. SIAM J. Optim. 27(1), 319–346 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  14. Garber, D.: Faster projection-free convex optimization over the spectrahedron. In Advances in Neural Information Processing Systems, pp 874–882 (2016)

  15. Garber, D.: Linear convergence of Frank-Wolfe for rank-one matrix recovery without strong convexity. (2019)

  16. Garber, D.: On the convergence of projected-gradient methods with low-rank projections for smooth convex minimization over trace-norm balls and related problems. arXiv: abs/1902.01644, (2019)

  17. Garber, D., Hazan, E.: Faster rates for the Frank–Wolfe method over strongly-convex sets. In 32nd International Conference on Machine Learning, ICML 2015, (2015)

  18. Garber, D., Hazan, E., Ma, T.: Online learning of eigenvectors. In Proceedings of the 32nd international conference on machine learning, ICML 2015, Lille, France, 6–11 pp 560–568, (2015)

  19. Garber, D., Kaplan, A.: Fast stochastic algorithms for low-rank and nonsmooth matrix problems. In The 22nd International conference on artificial intelligence and statistics, AISTATS 2019, 16–18 Naha, Okinawa, Japan, pp 286–294, (2019)

  20. Garber, D., Sabach, S., Kaplan, A.: Fast generalized conditional gradient method with applications to matrix recovery problems. arXiv: abs/1802.05581 (2018)

  21. Ge, R., Lee, J. D., Ma, T.: Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pp 2973–2981, (2016)

  22. Golub, G.H., Van, L., Charles, F.: Matrix Computations. JHU Press, Baltimore (2012)

    MATH  Google Scholar 

  23. Jaggi, M.: Revisiting Frank–Wolfe: projection-free sparse convex optimization. In Proceedings of the 30th International Conference on Machine Learning, ICML (2013)

  24. Jaggi, M., Sulovsky, M.: A simple algorithm for nuclear norm regularized problems. In Proceedings of the 27th International Conference on Machine Learning, ICML (2010)

  25. Jain, P., Meka, R., Dhillon, I.: Guaranteed rank minimization via singular value projection. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems pp 937–945 (2010)

  26. Jain, P., Tewari, A., Kar, P.: On iterative hard thresholding methods for high-dimensional m-estimation. In Advances in Neural Information Processing Systems, pp 685–693, (2014)

  27. Laue, S.: A hybrid algorithm for convex semidefinite optimization. In Proceedings of the 29th International Conference on International Conference on Machine Learning, pages 1083–1090. Omnipress, (2012)

  28. Mu, C., Zhang, Y., Wright, J., Goldfarb, D.: Scalable robust matrix recovery: Frank–Wolfe meets proximal methods. SIAM J. Sci. Comput. 38(5), A3291–A3317 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  29. Necoara, I., Nesterov, Y., Glineur, F.: Linear convergence of first order methods for non-strongly convex optimization. Math. Program. 175(1–2), 69–107 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  30. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Springer, Berlin (2013)

    MATH  Google Scholar 

  31. Netrapalli, P., Jain, P., Sanghavi, S.: Phase retrieval using alternating minimization. In Advances in Neural Information Processing Systems, pp 2796–2804, (2013)

  32. Netrapalli, P., Niranjan, U. N., Sanghavi, S., Anandkumar, A., Prateek, J.: Non-convex robust pca. In Advances in Neural Information Processing Systems, pp 1107–1115, (2014)

  33. Recht, B.: A simpler approach to matrix completion. J. Mach. Learn. Res. 12, 3413–3430 (2011)

    MathSciNet  MATH  Google Scholar 

  34. Richard, E., Savalle, P.-A., Vayatis, N.: Estimation of simultaneously sparse and low rank matrices. In Proceedings of the 29th International Conference on Machine Learning, (2012)

  35. Tropp, J. A.: Convex recovery of a structured signal from independent random linear measurements. In Sampling Theory, a Renaissance, pages 67–101. Springer, (2015)

  36. Wright, J., Ganesh, A., Rao, S., Peng, Y., Ma, Y.: Robust principal component analysis: Exact recovery of corrupted low-rank matrices via convex optimization. In Advances in neural information processing systems, pp 2080–2088, (2009)

  37. Yi, X., Park, D., Chen, Y.: Constantine Caramanis. Fast algorithms for robust pca via gradient descent. In Advances in neural information processing systems, pp 4152–4160, (2016)

  38. Yurtsever, A., Madeleine U., Tropp, J. A., Cevher, V.,: Sketchy decisions: Convex low-rank matrix optimization with optimal storage. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, AISTATS 2017, 20–22, Fort Lauderdale, FL, USA, pp 1188–1196. (2017)

  39. Zhou, Z., So, A.M.-C.: A unified approach to error bounds for structured convex optimization problems. Math. Program. 165(2), 689–728 (2017)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

We would like to thank both of the anonymous referees whose many excellent comments and suggestions have significantly improved the presentation of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dan Garber.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 A Proof of Lemma 2

The lemma is an adaptation of Lemma 3 in [16] (which considers optimization over trace-norm balls). We restate and prove a slightly more general version of the lemma.

Lemma 10

Let \(f:\mathbb {S}^n\rightarrow \mathbb {R}\) be \(\beta \)-smooth and convex. Let \({\mathbf {X}}^*\in \mathcal {S}_n\) be an optimal solution of rank r to the optimization problem \(\min _{{\mathbf {X}}\in \mathcal {S}_n}f({\mathbf {X}})\). Let \(\lambda _1,\dots ,\lambda _n\) denote the eigenvalues of \(\nabla {}f({\mathbf {X}}^*)\) in non-increasing order. Let \(\zeta \) be a non-negative scalar. It holds that

$$\begin{aligned} \text {rank}(\varPi _{(1+\zeta )\mathcal {S}_n}[{\mathbf {X}}^*-\beta ^{-1}\nabla {}f({\mathbf {X}}^*)])> r \quad \Longleftrightarrow \quad \zeta > r\beta {}(\lambda _{n-r}-\lambda _n), \end{aligned}$$

where \((1+\zeta )\mathcal {S}_n = \{(1+\zeta ){\mathbf {X}}~|~{\mathbf {X}}\in \mathcal {S}_n\}\), and \(\varPi _{(1+\zeta )\mathcal {S}_n}[\cdot ]\) denotes the Euclidean projection onto the convex set \((1+\zeta )\mathcal {S}_n\).

Proof

Let us write the eigen-decomposition of \({\mathbf {X}}^*\) as \({\mathbf {X}}^*=\sum _{i=1}^r\lambda _i^*{\mathbf {v}}_i{\mathbf {v}}_i^{\top }\). It follows from the optimality of \({\mathbf {X}}^*\) that for all \(i\in [r]\), \({\mathbf {v}}_i\) is also an eigenvector of \(\nabla {}f({\mathbf {X}}^*)\) which corresponds to the smallest eigenvalue \(\lambda _n\) (see Lemma 7 in [16]). Thus, if we let \(\rho _1,\dots ,\rho _n\) denote the eigenvalues (in non-increasing order) of \({\mathbf {Y}}:= {\mathbf {X}}^*-\beta ^{-1}\nabla {}f({\mathbf {X}}^*)\), it holds that

$$\begin{aligned} \forall i\in [r]: \quad \rho _i= & {} \lambda _i^* - \beta ^{-1}\lambda _n;\\ \forall i>r: \quad \rho _i= & {} \lambda _i^* - \beta ^{-1}\lambda _{n-i+1}. \end{aligned}$$

Recall that \(\sum _{i=1}^r\lambda _i^* =1\) and \(\lambda _{r+1}^* = 0\).

It is well known that for any matrix \({\mathbf {M}}\in \mathbb {S}^n\) with eigen-decomposition \({\mathbf {M}}=\sum _{i=1}^n\sigma _i{\mathbf {u}}_i{\mathbf {u}}_i^{\top }\), the projection of \({\mathbf {M}}\) onto the set \((1+\zeta )\mathcal {S}_n\), for any \(\zeta \ge 0\) is given by

$$\begin{aligned} \varPi _{(1+\zeta )\mathcal {S}_n}[{\mathbf {M}}] = \sum _{i=1}^n\max \{0,~\sigma _i-\sigma \}{\mathbf {u}}_i{\mathbf {u}}_i^{\top }, \end{aligned}$$

where \(\sigma \in \mathbb {R}\) is the unique scalar such that \(\sum _{i=1}^n\max \{0,~\sigma _i-\sigma \} = 1+\zeta \).

Now, we can see that \(\text {rank}(\varPi _{(1+\zeta )\mathcal {S}_n}[{\mathbf {Y}}]) \le r\) if and only if \(\sigma \ge \rho _{r+1} = -\beta ^{-1}\lambda _{n-r}\). Thus, if \(\text {rank}(\varPi _{(1+\zeta )\mathcal {S}_n}[{\mathbf {Y}}]) \le r\) then it must hold that \(\sigma \ge -\beta ^{-1}\lambda _{n-r}\) which implies that

$$\begin{aligned} 1 + \zeta&= \sum _{i=1}^n\max \{0,~\rho _i-\sigma \} \nonumber \\&= \sum _{i=1}^r\max \{0,~\rho _i-\sigma \} \le \sum _{i=1}^r\max \{0,~\rho _i-(-\beta ^{-1}\lambda _{n-r})\} \nonumber \\&= \sum _{i=1}^r(\rho _i-(-\beta ^{-1}\lambda _{n-r})) = \sum _{i=1}^r(\lambda _i +\beta (\lambda _{n-r}-\lambda _n)) = 1 + \beta {}r(\lambda _{n-r}-\lambda _{n}). \end{aligned}$$
(37)

However, (37) can hold only if \(\zeta \le \beta {}r(\lambda _{n-r}-\lambda _n)\). Thus, we have \(\text {rank}(\varPi _{(1+\zeta )\mathcal {S}_n}[{\mathbf {Y}}]) \le r \Longrightarrow \zeta \le \beta {}r(\lambda _{n-r}-\lambda _n)\).

On the other-hand, if \(\text {rank}(\varPi _{(1+\zeta )\mathcal {S}_n}[{\mathbf {Y}}]) > r\) then it must hold that \(\sigma < -\beta ^{-1}\lambda _{n-r}\) which, using the same arguments as above, implies that

$$\begin{aligned} 1 + \zeta&= \sum _{i=1}^n\max \{0,~\rho _i-\sigma \} \nonumber \\&> \sum _{i=1}^r\max \{0,~\rho _i-(-\beta ^{-1}\lambda _{n-r})\} =1 + \beta {}r(\lambda _{n-r}-\lambda _{n}). \end{aligned}$$
(38)

We see that (38) can hold only if \(\zeta > \beta {}r(\lambda _{n-r}-\lambda _n)\). Thus, we also have \(\text {rank}(\varPi _{(1+\zeta )\mathcal {S}_n}[{\mathbf {Y}}])> r \Longrightarrow \zeta > \beta {}r(\lambda _{n-r}-\lambda _n)\), and the lemma follows. \(\square \)

1.2 B Proof of Lemma 3

We first restate the lemma and then prove it.

Lemma 11

Let \(f:\mathbb {S}^n\rightarrow \mathbb {R}\) be \(\beta \)-smooth and convex. Suppose that Assumption 1 holds w.r.t. \(f(\cdot )\) with some parameter \(\delta >0\). Let \({\tilde{f}}:\mathbb {S}^n\rightarrow \mathbb {R}\) be differentiable and convex, and suppose that \(\sup _{{\mathbf {X}}\in \mathcal {S}_n}\Vert {\nabla {}f({\mathbf {X}}) - \nabla {\tilde{f}}({\mathbf {X}})}\Vert _F \le \nu \), for some \(\nu > 0\). Then, for \(\nu < \frac{1}{2}(1+\frac{2\beta }{\delta })^{-1}\delta \), Assumption 1 holds w.r.t. the function \({\tilde{f}}(\cdot )\) with parameter \({\tilde{\delta }} = \delta - 2\nu (1+\frac{2\beta }{\delta }) > 0\).

Proof

Let \({\mathbf {X}}^*\) and \({\tilde{{\mathbf {X}}}}^*\) denote minimizers of \(f(\cdot )\) and \({\tilde{f}}(\cdot )\) over \(\mathcal {S}_n\), respectively. Since Assumption 1 holds w.r.t. \(f(\cdot )\), using the quadratic growth result of Lemma 4 we have that

$$\begin{aligned} \Vert {{\tilde{{\mathbf {X}}}}^* - {\mathbf {X}}^*}\Vert _F^2&\le \frac{2}{\delta }\left( {f({\tilde{{\mathbf {X}}}}^*)-f({\mathbf {X}}^*)}\right) \underset{(a)}{\le } \frac{2}{\delta }\langle {{\tilde{{\mathbf {X}}}}^*-{\mathbf {X}}^*, \nabla {}f({\tilde{{\mathbf {X}}}}^*)}\rangle \\&= \frac{2}{\delta }\left( {\langle {{\tilde{{\mathbf {X}}}}^*-{\mathbf {X}}^*, \nabla {}{\tilde{f}}({\tilde{{\mathbf {X}}}}^*)}\rangle + \langle {{\tilde{{\mathbf {X}}}}^*-{\mathbf {X}}^*, \nabla {}f({\tilde{{\mathbf {X}}}}^*)-\nabla {}{\tilde{f}}({\tilde{{\mathbf {X}}}}^*)}\rangle }\right) \\&\underset{(b)}{\le } \frac{2}{\delta } \langle {{\tilde{{\mathbf {X}}}}^*-{\mathbf {X}}^*, \nabla {}f({\tilde{{\mathbf {X}}}}^*) - \nabla {}{\tilde{f}}({\tilde{{\mathbf {X}}}}^*)\rangle } \underset{(c)}{\le } \frac{2\nu }{\delta }\Vert {{\tilde{{\mathbf {X}}}}^*-{\mathbf {X}}^*}\Vert _F, \end{aligned}$$

where (a) follows from convexity of \(f(\cdot )\), (b) follows from optimality of \({\tilde{{\mathbf {X}}}}^*\) w.r.t. \({\tilde{f}}(\cdot )\), and (c) follows from the Cauchy-Schwarz inequality and the assumption of the lemma that \(\sup _{{\mathbf {X}}\in \mathcal {S}_n}\Vert {\nabla {}f({\mathbf {X}}) - \nabla {\tilde{f}}({\mathbf {X}})}\Vert _F \le \nu \).

Thus, we get that \(\Vert {{\tilde{{\mathbf {X}}}}^*-{\mathbf {X}}^*}\Vert _F \le \frac{2\nu }{\delta }\).

Using Weyl’s inequality for the eigenvalues we have that

$$\begin{aligned} \lambda _{n}(\nabla {}{\tilde{f}}({\tilde{{\mathbf {X}}}}^*))&\le \lambda _n(\nabla {}f({\mathbf {X}}^*)) + \Vert {\nabla {}{\tilde{f}}({\tilde{{\mathbf {X}}}}^*) - \nabla {}f({\mathbf {X}}^*)}\Vert _F \nonumber \\&= \lambda _{n-1}(\nabla {}f({\mathbf {X}}^*)) - \delta + \Vert {\nabla {}{\tilde{f}}({\tilde{{\mathbf {X}}}}^*) - \nabla {}f({\mathbf {X}}^*)}\Vert _F \nonumber \\&\le \lambda _{n-1}(\nabla {}{\tilde{f}}({\tilde{{\mathbf {X}}}}^*)) - \delta + 2\Vert {\nabla {}{\tilde{f}}({\tilde{{\mathbf {X}}}}^*) - \nabla {}f({\mathbf {X}}^*)}\Vert _F. \end{aligned}$$
(39)

Using the smoothness of \(f(\cdot )\) and the assumption \(\sup _{{\mathbf {X}}\in \mathcal {S}_n}\Vert {\nabla {}f({\mathbf {X}}) - \nabla {\tilde{f}}({\mathbf {X}})}\Vert _F \le \nu \), we have that

$$\begin{aligned} \Vert {\nabla {}{\tilde{f}}({\tilde{{\mathbf {X}}}}^*) - \nabla {}f({\mathbf {X}}^*)}\Vert _F&\le \Vert {\nabla {}{\tilde{f}}({\tilde{{\mathbf {X}}}}^*) - \nabla {}f({\tilde{{\mathbf {X}}}}^*)}\Vert _F + \Vert {\nabla {}f({\tilde{{\mathbf {X}}}}^*) - \nabla {}f({\mathbf {X}}^*)}\Vert _F \nonumber \\&\le \nu + \beta \Vert {{\tilde{{\mathbf {X}}}}^*-{\mathbf {X}}^*}\Vert _F \le \nu \left( {1+\frac{2\beta }{\delta }}\right) . \end{aligned}$$
(40)

Plugging-in (40) into (39) and rearranging we obtain

$$\begin{aligned} \lambda _{n-1}(\nabla {}{\tilde{f}}({\tilde{{\mathbf {X}}}}^*)) - \lambda _{n}(\nabla {}{\tilde{f}}({\tilde{{\mathbf {X}}}}^*)) \ge \delta - 2\nu \left( {1+\frac{2\beta }{\delta }}\right) . \end{aligned}$$

Thus, Assumption 1 indeed holds w.r.t. \({\tilde{f}}(\cdot )\) whenever \(\nu < \frac{1}{2}(1+\frac{2\beta }{\delta })^{-1}\delta \). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Garber, D. Linear convergence of Frank–Wolfe for rank-one matrix recovery without strong convexity. Math. Program. 199, 87–121 (2023). https://doi.org/10.1007/s10107-022-01821-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-022-01821-8

Keywords

Mathematics Subject Classification

Navigation