Skip to main content
Log in

Fast Gradient Method for Low-Rank Matrix Estimation

  • Published:
Journal of Scientific Computing Aims and scope Submit manuscript

Abstract

Projected gradient descent and its Riemannian variant belong to a typical class of methods for low-rank matrix estimation. This paper proposes a new Nesterov’s Accelerated Riemannian Gradient algorithm using efficient orthographic retraction and tangent space projection. The subspace relationship between iterative and extrapolated sequences on the low-rank matrix manifold provides computational convenience. With perturbation analysis of truncated singular value decomposition and two retractions, we systematically analyze the local convergence of gradient algorithms and Nesterov’s variants in the Euclidean and Riemannian settings. Theoretically, we estimate the exact rate of local linear convergence under different parameters using the spectral radius in a closed form and give the optimal convergence rate and the corresponding momentum parameter. When the parameter is unknown, the adaptive restart scheme can avoid the oscillation problem caused by high momentum, thus approaching the optimal convergence rate. Extensive numerical experiments confirm the estimations of convergence rate and demonstrate that the proposed algorithm is competitive with first-order methods for matrix completion and matrix sensing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Data Availibility

Enquiries about data availability should be directed to the authors.

Code Availability

The codes used to perform the experiments in this paper are available from https://github.com/pxxyyz/FastGradient.

References

  1. Absil, P.A., Malick, J.: Projection-like retractions on matrix manifolds. SIAM J. Optim. 22(1), 135–158 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  2. Absil, P.A., Oseledets, I.V.: Low-rank retractions: a survey and new results. Comput. Optim. Appl. 62(1), 5–29 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  3. Ahn, K., Sra, S.: From Nesterov’s estimate sequence to Riemannian acceleration. In: Conference on Learning Theory, pp. 84–118. PMLR (2020)

  4. Boumal, N.: An Introduction to Optimization on Smooth Manifolds. Cambridge University Press (2023)

    Book  MATH  Google Scholar 

  5. Cai, J.F., Wei, K.: Exploiting the structure effectively and efficiently in low-rank matrix recovery. In: Handbook of Numerical Analysis, vol. 19, pp. 21–51. Elsevier (2018)

  6. Chen, Y., Chi, Y.: Harnessing structures in big data via guaranteed low-rank matrix estimation: recent theory and fast algorithms via convex and nonconvex optimization. IEEE Sign. Process Mag. 35(4), 14–31 (2018)

    Article  Google Scholar 

  7. Chen, Y., Chi, Y., Fan, J., Ma, C.: Gradient descent with random initialization: fast global convergence for nonconvex phase retrieval. Math. Program. 176, 5–37 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  8. Chen, Y., Chi, Y., Fan, J., Ma, C., et al.: Spectral methods for data science: a statistical perspective. Found. Trends Mach. Learn. 14(5), 566–806 (2021)

    Article  MATH  Google Scholar 

  9. Chi, Y., Lu, Y.M., Chen, Y.: Nonconvex optimization meets low-rank matrix factorization: an overview. IEEE Trans. Sign. Process. 67(20), 5239–5269 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  10. Chunikhina, E., Raich, R., Nguyen, T.: Performance analysis for matrix completion via iterative hard-thresholded SVD. In: 2014 IEEE Workshop on Statistical Signal Processing (SSP), pp. 392–395. IEEE (2014)

  11. Davenport, M.A., Romberg, J.: An overview of low-rank matrix recovery from incomplete observations. IEEE J. Sel. Top. Sign. Process. 10(4), 608–622 (2016)

    Article  Google Scholar 

  12. Duruisseaux, V., Leok, M.: A variational formulation of accelerated optimization on Riemannian manifolds. SIAM J. Math. Data Sci. 4(2), 649–674 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  13. Gonzaga, C.C., Schneider, R.M.: On the steepest descent algorithm for quadratic functions. Comput. Optim. Appl. 63, 523–542 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  14. Huang, J., Zhou, J.: A direct proof and a generalization for a Kantorovich type inequality. Linear Algebra Appl. 397, 185–192 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  15. Huang, W., Wei, K.: An extension of fast iterative shrinkage-thresholding algorithm to Riemannian optimization for sparse principal component analysis. Numer. Linear Algebra Appl. 29(1), e2409 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  16. Huang, Y., Dai, Y.H., Liu, X.W., Zhang, H.: On the asymptotic convergence and acceleration of gradient methods. J. Sci. Comput. 90, 1–29 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  17. Jain, P., Meka, R., Dhillon, I.: Guaranteed rank minimization via singular value projection. Adv. Neu. Inf. Process. Syst. 23 (2010)

  18. Kim, D., Fessler, J.A.: Adaptive restart of the optimized gradient method for convex optimization. J. Optim. Theory Appl. 178(1), 240–263 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  19. Kim, J., Yang, I.: Nesterov acceleration for Riemannian optimization. arXiv preprint arXiv:2202.02036 (2022)

  20. Kyrillidis, A., Cevher, V.: Matrix recipes for hard thresholding methods. J. Math. Imag. Vis. 48, 235–265 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  21. Lessard, L., Recht, B., Packard, A.: Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  22. Li, H., Fang, C., Lin, Z.: Accelerated first-order optimization algorithms for machine learning. Proc. IEEE 108(11), 2067–2082 (2020)

    Article  Google Scholar 

  23. Li, H., Lin, Z.: Accelerated alternating direction method of multipliers: an optimal o (1/k) nonergodic analysis. J. Sci. Comput. 79, 671–699 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  24. Liang, J., Fadili, J., Peyré, G.: Activity identification and local linear convergence of forward-backward-type methods. SIAM J. Optim. 27(1), 408–437 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  25. Liang, J., Luo, T., Schonlieb, C.B.: Improving “fast iterative shrinkage-thresholding algorithm’’: faster, smarter, and greedier. SIAM J. Sci. Comput. 44(3), A1069–A1091 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  26. Luenberger, D.G., Ye, Y.: Linear and Nonlinear Programming, vol. 228. Springer Nature (2021)

    Book  MATH  Google Scholar 

  27. Nesterov, Y.E.: A method of solving a convex programming problem with convergence rate o\(\left(\frac{1}{k^{2}}\right)\). In: Doklady Akademii Nauk, vol. 269, pp. 543–547. Russian Academy of Sciences (1983)

  28. Odonoghue, B., Candes, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15, 715–732 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  29. Park, J.: Accelerated additive Schwarz methods for convex optimization with adaptive restart. J. Sci. Comput. 89(3), 58 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  30. Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)

    Article  Google Scholar 

  31. Tanner, J., Wei, K.: Normalized iterative hard thresholding for matrix completion. SIAM J. Sci. Comput. 35(5), S104–S125 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  32. Tong, T., Ma, C., Chi, Y.: Accelerating ill-conditioned low-rank matrix estimation via scaled gradient descent. J. Mach. Learn. Res. 22(1), 6639–6701 (2021)

    MathSciNet  MATH  Google Scholar 

  33. Vandereycken, B.: Low-rank matrix completion by Riemannian optimization. SIAM J. Optim. 23(2), 1214–1236 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  34. Vu, T., Raich, R.: Accelerating iterative hard thresholding for low-rank matrix completion via adaptive restart. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2917–2921. IEEE (2019)

  35. Vu, T., Raich, R.: On local convergence of iterative hard thresholding for matrix completion. arXiv preprint arXiv:2112.14733 (2021)

  36. Vu, T., Raich, R.: On asymptotic linear convergence of projected gradient descent for constrained least squares. IEEE Trans. Sign. Process. 70, 4061–4076 (2022)

    Article  MathSciNet  Google Scholar 

  37. Wang, D., He, Y., De Sterck, H.: On the asymptotic linear convergence speed of Anderson acceleration applied to ADMM. J. Sci. Comput. 88(2), 38 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  38. Wang, H., Cai, J.F., Wang, T., Wei, K.: Fast Cadzow’s algorithm and a gradient variant. J. Sci. Comput. 88(2), 41 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  39. Wang, R., Zhang, C., Wang, L., Shao, Y.: A stochastic Nesterov’s smoothing accelerated method for general nonsmooth constrained stochastic composite convex optimization. J. Sci. Comput. 93(2), 52 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  40. Wei, K., Cai, J.F., Chan, T.F., Leung, S.: Guarantees of Riemannian optimization for low rank matrix recovery. SIAM J. Matrix Anal. Appl. 37(3), 1198–1222 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  41. Wei, K., Cai, J.F., Chan, T.F., Leung, S.: Guarantees of Riemannian optimization for low rank matrix completion. Inverse Probl. Imag. 14(2), 233–265 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  42. Wibisono, A., Wilson, A.C., Jordan, M.I.: A variational perspective on accelerated methods in optimization. Proc. Natl. Acad. Sci. 113(47), E7351–E7358 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  43. Zhang, H., Sra, S.: Towards Riemannian accelerated gradient methods. arXiv preprint arXiv:1806.02812 (2018)

  44. Zhang, T., Yang, Y.: Robust PCA by manifold optimization. J. Mach. Learn. Res. 19(1), 3101–3139 (2018)

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their review and helpful comments.

Funding

This work was supported by the National Natural Science Foundation of China (Grant no. 61771001).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Chengwei Pan or Di Zhao.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Auxiliary Lemmas

1.1 A.1 Relationship of Matrix Eigenvalues

Lemma 4

Let \(\varvec{\varTheta }\) be a symmetric positive semi-definite matrix, and \({\varvec{P}}\in {{\mathbb {R}}}^{n\times n}\) be an orthogonal projection matrix. Denote \({\varvec{P}}^{\perp }=\varvec{I}-{\varvec{P}}\), then there exists an eigenvalue \(\lambda \ne 0\) of \(({\varvec{I}}-\mu \varvec{\varTheta } ) {\varvec{P}}^\perp \), such that

$$\begin{aligned} \lambda _{i}(\mu \varvec{\varTheta } {\varvec{P}}^\perp +\varvec{P})=\lambda _{i}(\mu \varvec{\varTheta } \varvec{P}^\perp )=1-\lambda . \end{aligned}$$

Proof

Assume \(\text {rank}({\varvec{P}})=r\). According to idempotent, we get the eigenvalues of \({\varvec{P}}\) and \(\varvec{P}^{\perp }\) of the form.

$$\begin{aligned} \lambda ({\varvec{P}})=\{\underbrace{1,\ldots , 1}_{r},\underbrace{0,\ldots , 0}_{n-r}\},\lambda ({\varvec{P}}^{\perp })=\{\underbrace{1,\ldots , 1}_{n-r},\underbrace{0,\ldots , 0}_{r}\}. \end{aligned}$$

Here, we use \({\varvec{u}}_i\) and \({\varvec{v}}_j\) represent the eigenvectors of \({\varvec{P}}\) corresponding to eigenvalues \(1\) and \(0\), respectively. From the orthogonal relation between \({\varvec{P}}\) and \({\varvec{P}}^{\perp }\), \(\varvec{P}{\varvec{u}}_i={\varvec{u}}_i,{\varvec{P}}\varvec{v}_j={\varvec{0}},{\varvec{P}}^\perp {\varvec{u}}_i=\varvec{0},{\varvec{P}}^\perp {\varvec{v}}_j={\varvec{v}}_j\). Further, we have

$$\begin{aligned} (\mu \varvec{\varTheta } {\varvec{P}}^\perp +\varvec{P}){\varvec{u}}_i={\varvec{u}}_i, (\mu \varvec{\varTheta } {\varvec{P}}^\perp ){\varvec{u}}_i={\varvec{0}}, (\mu \varvec{\varTheta } {\varvec{P}}^\perp +{\varvec{P}}){\varvec{v}}_j=(\mu \varvec{\varTheta } {\varvec{P}}^\perp ){\varvec{v}}_j=\mu \varvec{\varTheta } {\varvec{v}}_j.\nonumber \\ \end{aligned}$$
(29)

It can be known that \({\varvec{u}}_i\) corresponds to the eigenvector of \(\mu \varvec{\varTheta } {\varvec{P}}^\perp +{\varvec{P}}\) with the eigenvalue of \(1\) and the eigenvector of \(\mu \varvec{\varTheta } {\varvec{P}}^\perp \) with the eigenvalue of \(0\), respectively. Besides, if \({\varvec{v}}_j\) happens to be an eigenvector of \(\mu \varvec{\varTheta }\), then \(\varvec{v}_j\) is also an eigenvector of \(\mu \varvec{\varTheta } \varvec{P}^\perp +{\varvec{P}}\) and \(\mu \varvec{\varTheta } \varvec{P}^\perp \). This conjecture implies the relevance of the above three matrix eigendecompositions. To this end, assume that there exists a non-zero vector \({\varvec{x}}\) such that

$$\begin{aligned} ({\varvec{I}}-\mu \varvec{\varTheta } ) {\varvec{P}}^\perp {\varvec{x}}=\lambda {\varvec{x}}, \end{aligned}$$

then

$$\begin{aligned} (\mu \varvec{\varTheta } {\varvec{P}}^\perp +\varvec{P}){\varvec{x}}=(1-\lambda ){\varvec{x}}. \end{aligned}$$

For \(\lambda \ne 0\), we will discuss \({\varvec{P}}\varvec{x}=(1-\lambda ){\varvec{x}}-\mu \varvec{\varTheta } \varvec{P}^\perp {\varvec{x}}\) case by case:

Case 1: when \({\varvec{P}}{\varvec{x}}={\varvec{0}}\), i.e., \({\varvec{x}}\) is a linear combination of \({\varvec{v}}_i\). Obviously, \(\mu \varvec{\varTheta } {\varvec{P}}^\perp x=(1-\lambda ){\varvec{x}}\) holds. We obtain \(1-\lambda \) is the eigenvalue of matrix \(\mu \varvec{\varTheta } {\varvec{P}}^\perp \).

Case 2: when \({\varvec{P}}{\varvec{x}}\ne {\varvec{0}}\), we know that \({\varvec{x}}\) can always be represented as a linear combination of orthonormal bases, as follows

$$\begin{aligned} {\varvec{x}}=\sum _{i=1}^r\alpha _i \varvec{u}_i+\sum _{j=1}^{n-r}\beta _j {\varvec{v}}_j, \end{aligned}$$

and \({\varvec{P}}{\varvec{x}}\ne {\varvec{0}}\) means that there exists \(\alpha _i\ne 0\), otherwise \({\varvec{P}}\varvec{x}=\sum _{j=1}^{n-r}\beta _j {\varvec{P}}{\varvec{v}}_j=\varvec{0}\) if \(\forall i,\alpha _i=0\). And we expand the formula to get

$$\begin{aligned} ({\varvec{I}}-\mu \varvec{\varTheta } ) \sum _{j=1}^{n-r}\beta _j {\varvec{v}}_j=({\varvec{I}}-\mu \varvec{\varTheta } ) \varvec{P}^\perp {\varvec{x}}=\lambda {\varvec{x}}=\lambda (\sum _{i=1}^r\alpha _i {\varvec{u}}_i+\sum _{j=1}^{n-r}\beta _j {\varvec{v}}_j), \end{aligned}$$

where the left-hand side is a linear representation of the basis vector \(\{{\varvec{v}}_j\}\), while the right-hand side is a linear combination of mutually orthogonal basis vectors \(\{{\varvec{u}}_i\}\) and \(\{{\varvec{v}}_j\}\). So when \(\lambda \ne 0\), \(\forall i,\alpha _i=0\) holds. This contradicts \({\varvec{P}}{\varvec{x}}\ne {\varvec{0}}\). \(\square \)

1.2 A.2 Perturbation Analysis of Subspaces

Lemma 5

(Wedin’s \(\sin \varTheta \) Theorem [8]) Let \({\varvec{X}}_t={\varvec{U}}_t \varvec{\varSigma }_t {\varvec{V}}_t^\top \) and \({\varvec{X}}_\star =\varvec{U}_\star \varvec{\varSigma }_\star {\varvec{V}}_\star ^\top \) be the SVD of \({\varvec{X}}_t, {\varvec{X}}_\star \in {{\mathbb {M}}}_r\), respectively. If \(\Vert {\varvec{X}}_t-\varvec{X}_\star \Vert <\sigma _r({\varvec{X}}_\star )\), there is an upper bound for the perturbation of the singular subspace as follows

$$\begin{aligned} \max \{\Vert P_{{\varvec{U}}_t}^\perp -P_{\varvec{U}_\star }^\perp \Vert ,\Vert P_{{\varvec{V}}_t}^\perp -P_{\varvec{V}_\star }^\perp \Vert \}\le \frac{2\Vert {\varvec{X}}_t-\varvec{X}_\star \Vert }{\sigma _r({\varvec{X}}_\star )}. \end{aligned}$$

Lemma 6

(Perturbation of subspace projection [40]) Let \({\varvec{X}}_t=\varvec{U}_t \varvec{\varSigma }_t {\varvec{V}}_t^\top \) and \(\varvec{X}_\star ={\varvec{U}}_\star \varvec{\varSigma }_\star \varvec{V}_\star ^\top \) be the SVD of \({\varvec{X}}_t, \varvec{X}_\star \in {{\mathbb {M}}}_r\), respectively. If \(\Vert \varvec{X}_t-{\varvec{X}}_\star \Vert <\sigma _r({\varvec{X}}_\star )\), then the following inequality is satisfied

$$\begin{aligned} \Vert P_{{\varvec{U}}_\star }^\perp {\varvec{X}}_t P_{\varvec{V}_\star }^\perp \Vert _F\le \frac{\Vert {\varvec{X}}_t-\varvec{X}_\star \Vert _F^2}{\sigma _r({\varvec{X}}_\star )}. \end{aligned}$$

B Proof of Theorem 1

Proof

Let the residual matrix \({\varvec{E}}_t=\varvec{X}_t-{\varvec{X}}_\star \). According to the iteration, we have

$$\begin{aligned} \begin{aligned} {\varvec{E}}_{t+1}&={\varvec{X}}_{t+1}-{\varvec{X}}_\star \\&={\mathcal {P}}_r({\varvec{X}}_{t}-\mu _t\nabla f({\varvec{X}}_t))-{\varvec{X}}_\star \\&={\mathcal {P}}_r({\varvec{X}}_\star +{\varvec{X}}_{t}-{\varvec{X}}_\star -\mu _t\nabla f({\varvec{X}}_t))-{\varvec{X}}_\star \\&={\mathcal {P}}_r({\varvec{X}}_\star +{\varvec{E}}_{t}-\mu _t\nabla f({\varvec{X}}_t))-{\varvec{X}}_\star \\&{\mathop {=}\limits ^{(a)}}({\varvec{E}}_{t}-\mu _t\nabla f({\varvec{X}}_t))-P_{{\varvec{U}}_\star }^\perp ({\varvec{E}}_{t}-\mu _t\nabla f({\varvec{X}}_t))P_{{\varvec{V}}_\star }^\perp +{\mathcal {O}}(\Vert {\varvec{E}}_{t}\Vert _F^2),\\ \end{aligned} \end{aligned}$$
(30)

where \((a)\) is the first-order expansion (7) of the truncated SVD. Since \(\Vert {\mathcal {I}}-\mu \mathcal {A}^*{\mathcal {A}}\Vert \le 1\), we have \(\Vert {\varvec{E}}_{t}-\mu \nabla f({\varvec{X}}_t)\Vert _F=\Vert ({\mathcal {I}}-\mu _t{\mathcal {A}}^*\mathcal {A})({\varvec{E}}_t)\Vert _F\le \Vert (\varvec{E}_t)\Vert _F\le \sigma _{{r}}({\varvec{X}}_\star )/2\), which verifies the condition of Lemma 1 holds. After vectorizing \(\varvec{e}_{t+1}=\text {vec}({\varvec{E}}_t)\), we get

$$\begin{aligned} \begin{aligned} {\varvec{e}}_{t+1}&=\text {vec}(({\varvec{E}}_{t}-\mu _t\nabla f({\varvec{X}}_t))-P_{{\varvec{U}}_\star }^\perp ({\varvec{E}}_{t}-\mu _t\nabla f({\varvec{X}}_t))P_{{\varvec{V}}_\star }^\perp )+{\mathcal {O}}(\Vert {\varvec{E}}_t\Vert _F^2)\\&{\mathop {=}\limits ^{(a)}}({\varvec{I}}-P_{{\varvec{V}}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp )\text {vec}({\varvec{E}}_{t}-\mu _t\nabla f({\varvec{X}}_t))+{\mathcal {O}}(\Vert {\varvec{E}}_t\Vert _F^2)\\&{\mathop {=}\limits ^{(b)}}\underbrace{({\varvec{I}}-P_{{\varvec{V}}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp )({\varvec{I}}-\mu _t \varvec{\varTheta })}_{{\varvec{H}}(\mu _t)}{\varvec{e}}_{t}+{\mathcal {O}}(\Vert {\varvec{e}}_{t}\Vert _2^2),\\ \end{aligned} \end{aligned}$$

where \((a)\) uses vectorization of Kronecker product, i.e., \(\text {vec}({\varvec{A}}{\varvec{B}}{\varvec{C}})=(\varvec{C}^\top \otimes {\varvec{A}})\text {vec}({\varvec{B}})\). \((b)\) is based on (3) and \(\Vert \varvec{E}_t\Vert _F=\Vert {\varvec{e}}_{t}\Vert _2\). The convergence rate with the constant stepsize \(\mu _t\equiv \mu \) is determined by the spectral radius of the matrix \({\varvec{H}}={\varvec{H}}(\mu )\).

$$\begin{aligned} \rho ({\varvec{H}})=\max _\lambda |\lambda _i(\varvec{H})|=\max {(|\lambda _{\max }(\varvec{H})|,|\lambda _{\min }({\varvec{H}})|)}. \end{aligned}$$

Thus, the maximum and minimum eigenvalues of \({\varvec{H}}\) should be compared. Taking MS as an example, we compute the largest eigenvalue.

$$\begin{aligned} \begin{aligned} \lambda _{\max }({\varvec{H}}_{{\textsf{M}}{\textsf{S}}})&=1-\lambda _{\min }(\mu \varvec{\varTheta }+P_{{\varvec{V}}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp -\mu (P_{{\varvec{V}}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp )\varvec{\varTheta })\\&{\mathop {=}\limits ^{(a)}}1-\lambda _{\min }(\mu \varvec{\varTheta }-\mu (P_{{\varvec{V}}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp )\varvec{\varTheta })\\&=1-\mu \lambda _{\min }(({\varvec{I}}-P_{{\varvec{V}}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp )\varvec{\varTheta }),\\ \end{aligned} \end{aligned}$$

where \((a)\) is based on Lemma 4. Similarly, the minimum eigenvalue results are as follows:

$$\begin{aligned} \lambda _{\min }(\varvec{H}_{{\textsf{M}}{\textsf{S}}})=1-\mu \lambda _{\max }(({\varvec{I}}-P_{\varvec{V}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp )\varvec{\varTheta }). \end{aligned}$$

Obviously, the optimal spectral radius occurs when \(\lambda _{\max }(\varvec{H}_{{\textsf{M}}{\textsf{S}}})=-\lambda _{\min }({\varvec{H}}_{{\textsf{M}}{\textsf{S}}})\), i.e., \(1-\mu \lambda _{\min }=\mu \lambda _{\max }-1\). The corresponding stepsize is \(\mu _\dagger =\frac{2}{\lambda _{\min }+\lambda _{\max }}\). Due to \({\varvec{I}}-P_{{\varvec{V}}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp \) is orthogonal projector, we have \(\Vert {\varvec{I}}-P_{{\varvec{V}}_\star }^\perp \otimes P_{\varvec{U}_\star }^\perp \Vert =1\). It is easy to check

$$\begin{aligned} \rho ({\varvec{H}})\le \Vert {\varvec{I}}-P_{\varvec{V}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp \Vert \Vert \varvec{I}-\mu _t \varvec{\varTheta }\Vert \le \Vert {\varvec{I}}-\mu _t \varvec{\varTheta }\Vert \le 1, \end{aligned}$$

so we can get (9). Especially for MC, as shown in (6), we get a simplified result similar to [34].

$$\begin{aligned} \begin{aligned} \lambda _{\max }({\varvec{H}}_{{\textsf{M}}{\textsf{C}}})&{\mathop {=}\limits ^{(a)}}1-\mu \lambda _{\min }( \varvec{S}_{\varOmega }^\top (I-P_{{\varvec{V}}}^\perp \otimes P_{\varvec{U}}^\perp ){\varvec{S}}_{\varOmega }) {\mathop {=}\limits ^{(b)}}1-\mu \lambda _{\min }( {\varvec{I}}- {\varvec{S}}_{\varOmega }^\top (P_{{\varvec{V}}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp ){\varvec{S}}_{\varOmega })\\&=1-\mu (1-\lambda _{\max }( {\varvec{S}}_{\varOmega }^\top (P_{\varvec{V}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp )\varvec{S}_{\varOmega })) {\mathop {=}\limits ^{(c)}}1-\mu (1-\lambda _{\max }({\varvec{S}}_{\varOmega }{\varvec{S}}_{\varOmega }^\top (P_{{\varvec{V}}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp )))\\&{\mathop {=}\limits ^{(d)}}1-\mu (\lambda _{\min }(P_{\varvec{V}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp -\varvec{S}_{\varOmega }{\varvec{S}}_{\varOmega }^\top (P_{\varvec{V}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp ))) {\mathop {=}\limits ^{(e)}}1-\mu (\lambda _{\min }({\varvec{S}}_{{\bar{\varOmega }}}{\varvec{S}}_{{\bar{\varOmega }}}^\top (P_{{\varvec{V}}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp )))\\&{\mathop {=}\limits ^{(f)}}1-\mu (\lambda _{\min }(\varvec{S}_{{\bar{\varOmega }}}^\top (P_{{\varvec{V}}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp ){\varvec{S}}_{{\bar{\varOmega }}})) =1-\mu (\sigma _{\min }^2({\varvec{S}}_{{\bar{\varOmega }}}^\top (\varvec{V}_{_\star \perp } \otimes {\varvec{U}}_{_\star \perp }))), \end{aligned} \end{aligned}$$

where \((a)\), \((c)\) and \((f)\) are based on the fact that AB and BA have the same eigenvalues, \((b)\) and \((e)\) correspond to the properties of the sampling matrix in (6), \((d)\) uses Lemma 4. Similarly, the minimum eigenvalue results are as follows

$$\begin{aligned} \lambda _{\min }(\varvec{H}_{{\textsf{M}}{\textsf{C}}})=1-\mu (\sigma _{\max }^2(\varvec{S}_{{\bar{\varOmega }}}^\top ({\varvec{V}}_{_\star \perp }\otimes \varvec{U}_{_\star \perp }))). \end{aligned}$$

We can estimate the convergence rate of Algorithm 1. \(\square \)

C Proof of Proposition 1

Proof

Vectorizing (12) yields \(\nabla _{\mathcal {R}} f({\varvec{x}}_t)={\varvec{P}}\nabla f({\varvec{x}}_t)\), where \({\varvec{P}}=({\varvec{I}}-P_{{\varvec{U}}}^\perp \otimes P_{{\varvec{V}}}^\perp )\) is the orthogonal projection matrix. Bring (13) into the loss function to get

$$\begin{aligned} \begin{aligned} f({\varvec{x}}_{t+1})&=\frac{1}{2}({\varvec{x}}_{t+1}-{\varvec{x}}_\star )^\top \varvec{\varTheta } ({\varvec{x}}_{t+1}-{\varvec{x}}_\star )\\&=\frac{1}{2}({\varvec{x}}_t -\mu _t \nabla _{\mathcal {R}} f({\varvec{x}}_t)-{\varvec{x}}_\star )^\top \varvec{\varTheta } ({\varvec{x}}_t -\mu _t \nabla _{\mathcal {R}} f({\varvec{x}}_t)-{\varvec{x}}_\star )+{\mathcal {O}}(\Vert {\varvec{x}}_t -{\varvec{x}}_\star \Vert _2^2)\\&=f({\varvec{x}}_t)-\mu _t\nabla _{\mathcal {R}} f({\varvec{x}}_t)^\top \nabla f({\varvec{x}}_t)+\frac{\mu _t^2}{2}\nabla _{\mathcal {R}} f({\varvec{x}}_t)^\top \varvec{\varTheta } \nabla _{\mathcal {R}} f({\varvec{x}}_t)+{\mathcal {O}}(\Vert {\varvec{x}}_t -{\varvec{x}}_\star \Vert _2^2)\\&=f({\varvec{x}}_t)-\frac{(\nabla _{\mathcal {R}} f({\varvec{x}}_t)^\top \nabla f({\varvec{x}}_t))^2}{2\nabla _{\mathcal {R}} f({\varvec{x}}_t)^\top \varvec{\varTheta } \nabla _{\mathcal {R}} f({\varvec{x}}_t)}+{\mathcal {O}}(\Vert {\varvec{x}}_t -{\varvec{x}}_\star \Vert _2^2)\\&{\mathop {=}\limits ^{(a)}}\left( 1-\frac{(\nabla _{\mathcal {R}} f({\varvec{x}}_t)^\top \nabla f({\varvec{x}}_t))^2}{(\nabla f({\varvec{x}}_t)^\top ({\varvec{P}}\varvec{\varTheta } {\varvec{P}}) \nabla f({\varvec{x}}_t))(\nabla _{\mathcal {R}} f({\varvec{x}}_t)^\top ({\varvec{P}}\varvec{\varTheta } {\varvec{P}})^{+} \nabla _{\mathcal {R}} f({\varvec{x}}_t))}\right) f({\varvec{x}}_t)\\&\quad +{\mathcal {O}}(\Vert {\varvec{x}}_t -{\varvec{x}}_\star \Vert _2^2)\\&{\mathop {\le }\limits ^{(b)}}\left( 1-\frac{4}{\frac{\lambda _{\max }({\varvec{P}}\varvec{\varTheta } {\varvec{P}})}{\lambda _{\min }({\varvec{P}}\varvec{\varTheta } {\varvec{P}})} + 2+\frac{\lambda _{\min }({\varvec{P}}\varvec{\varTheta } {\varvec{P}})}{\lambda _{\max }({\varvec{P}}\varvec{\varTheta } {\varvec{P}})}}\right) f({\varvec{x}}_t)+{\mathcal {O}}(\Vert {\varvec{x}}_t -{\varvec{x}}_\star \Vert _2^2)\\&{\mathop {\le }\limits ^{(c)}}\left( \frac{\kappa -1}{\kappa +1}\right) ^2 f({\varvec{x}}_t)+{\mathcal {O}}(\Vert {\varvec{x}}_t -{\varvec{x}}_\star \Vert _2^2),\\ \end{aligned} \end{aligned}$$

where (a) is because of \(f({\varvec{x}})=\frac{1}{2}(\varvec{x}-{\varvec{x}}_\star )^\top \varvec{\varTheta } (\varvec{x}-{\varvec{x}}_\star )=\frac{1}{2}\nabla _{\mathcal {R}} f(\varvec{x})^\top ({\varvec{P}}\varvec{\varTheta } {\varvec{P}})^{+} \nabla _{\mathcal {R}} f({\varvec{x}}_t)\). Furthermore, since \(\frac{|\nabla _{\mathcal {R}} f({\varvec{x}}_t)^\top \nabla f({\varvec{x}}_t)|}{\Vert \nabla _{\mathcal {R}} f({\varvec{x}}_t)^\top \Vert _2 \Vert \nabla f({\varvec{x}}_t)\Vert _2}\ge 0\), we apply the generalized Kantorovich type inequality [14] in Lemma 7 to get (b). To prove (c), we only need to show

$$\begin{aligned} \begin{aligned} \frac{\lambda _{\max }({\varvec{P}}\varvec{\varTheta } \varvec{P})}{\lambda _{\min }({\varvec{P}}\varvec{\varTheta } {\varvec{P}})}&=\Vert {\varvec{P}}\varvec{\varTheta } {\varvec{P}}\Vert \Vert (\varvec{P}\varvec{\varTheta })^+{\varvec{P}}^+\Vert \le \Vert {\varvec{P}}\varvec{\varTheta }\Vert \Vert {\varvec{P}}\Vert \Vert {\varvec{P}}^+\Vert \Vert ({\varvec{P}}\varvec{\varTheta })^+\Vert \\&=\Vert {\varvec{P}}\varvec{\varTheta }\Vert \Vert ({\varvec{P}}\varvec{\varTheta })^+\Vert = \frac{\lambda _{\max }({\varvec{P}}\varvec{\varTheta })}{\lambda _{\min }({\varvec{P}}\varvec{\varTheta })}:=\kappa , \end{aligned} \end{aligned}$$

here, \(\lambda _{\min }(\cdot )\) means the smallest non-zero eigenvalue. \(\square \)

Lemma 7

(Kantorovich inequality [14]) Let \({\varvec{A}}\) be a symmetric (semi-) positive definite matrix, and \(\lambda _{\max }\) and \(\lambda _{\min }\) correspond to the largest and smallest non-zero eigenvalues, respectively. If \(\varvec{x},{\varvec{y}}\in {{\mathbb {R}}}^n\) satisfies \(\frac{|{\varvec{x}}^\top {\varvec{y}}|}{\Vert {\varvec{x}}\Vert _2 \Vert \varvec{y}\Vert _2}\ge \cos \theta \) with \(0\le \theta \le \frac{\pi }{2}\), then

$$\begin{aligned} \frac{({\varvec{x}}^\top {\varvec{y}})^2}{({\varvec{x}}^\top {\varvec{A}}{\varvec{x}})({\varvec{y}}^\top \varvec{A}^{+}{\varvec{y}})}\ge \frac{4}{\kappa + 2+\kappa ^{-1}}, \end{aligned}$$

where \(\kappa =\frac{\lambda _{\max }}{\lambda _{\min }}\frac{1+\sin \theta }{1-\sin \theta }\) and \((\cdot )^{+}\) is the Moore-Penrose inverse. When \({\varvec{A}}\) is positive definite and \({\varvec{x}}={\varvec{y}}\), i.e., \({\varvec{A}}^{+}={\varvec{A}}^{-1}\) and \(\theta =0\), the above inequality degenerates into the traditional form.

D Proof of Theorem 2

Proof

According to Algorithm 3, we calculate the error as follows.

$$\begin{aligned} \begin{aligned} {\varvec{E}}_{t+1}&={\varvec{X}}_{t+1}-{\varvec{X}}_\star \\&={\mathcal {P}}_r({\varvec{Y}}_t-\mu _t \nabla f({\varvec{Y}}_t))-{\varvec{X}}_\star \\&={\mathcal {P}}_r({\varvec{X}}_\star +{\varvec{Y}}_t-{\varvec{X}}_\star -\mu _t \nabla f({\varvec{Y}}_t))-{\varvec{X}}_\star \\&=({\varvec{Y}}_t-{\varvec{X}}_\star -\mu _t \nabla f({\varvec{Y}}_t))-P_{{\varvec{U}}_\star }^\perp ({\varvec{Y}}_t-{\varvec{X}}_\star -\mu _t \nabla f({\varvec{Y}}_t))P_{{\varvec{V}}_\star }^\perp +{\mathcal {O}}(\Vert {\varvec{Y}}_t-{\varvec{X}}_\star \Vert _F^2).\\ \end{aligned} \end{aligned}$$

After vectorizing, we have

$$\begin{aligned} \begin{aligned} {\varvec{e}}_{t+1}&=\underbrace{({\varvec{I}}-P_{{\varvec{V}}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp )({\varvec{I}}-\mu _t \varvec{\varTheta })}_{{\varvec{H}}_t={\varvec{H}}(\mu _t)}\text {vec}({\varvec{Y}}_t-{\varvec{X}}_\star )+{\mathcal {O}}(\Vert {\varvec{Y}}_t-{\varvec{X}}_\star \Vert _F^2)\\&=(1+\eta _t){\varvec{H}}_t{\varvec{e}}_{t}-\eta _t {\varvec{H}}_t{\varvec{e}}_{t-1}+{\mathcal {O}}(\Vert {\varvec{e}}_{t}\Vert _2^2).\\ \end{aligned} \end{aligned}$$

Stacking the errors of two adjacent iterations, we get the recursive form

$$\begin{aligned} \begin{aligned} \begin{pmatrix} {\varvec{e}}_{t+1}\\ {\varvec{e}}_{t} \end{pmatrix}=\underbrace{\begin{pmatrix} (1+\eta _t){\varvec{H}}_t &{}-\eta _t{\varvec{H}}_t\\ {\varvec{I}}&{}{\varvec{0}} \end{pmatrix}}_{{\varvec{T}}} \begin{pmatrix} {\varvec{e}}_{t}\\ {\varvec{e}}_{t-1} \end{pmatrix}. \end{aligned} \end{aligned}$$

The convergence rate depends on the spectral radius \(\rho ({\varvec{T}})\) of \({\varvec{T}}\in {\mathbb {R}}^{2n_1n_2\times 2n_1n_2}\). According to the eigendecomposition in [34], \({\varvec{T}}\) is similar to the block diagonal matrix composed of the \(2\times 2\) matrix \({\varvec{T}}_j\), i.e., \({\varvec{T}}\sim \text {bldiag}({\varvec{T}}_1,{\varvec{T}}_2,\ldots ,\varvec{T}_{n_1n_2})\), where each block \({\varvec{T}}_j\in {\mathbb {R}}^{2\times 2}\) is form

$$\begin{aligned} {\varvec{T}}_j=\begin{pmatrix} (1+\eta _t)(1-\mu _t\lambda _j) &{}-\eta _t(1-\mu _t\lambda _j)\\ 1&{}0 \end{pmatrix}. \end{aligned}$$

where \(\lambda _j\) is the eigenvalue of matrix \((\varvec{I}-P_{{\varvec{V}}_\star }^\perp \otimes P_{\varvec{U}_\star }^\perp )\varvec{\varTheta }\). Next, we aim to find the eigenvalues of the matrix \({\varvec{T}}_j\) using the characteristic polynomial.

$$\begin{aligned} r^2-(1+\eta _t)(1-\mu _t\lambda _j)r+\eta _t(1-\mu _t\lambda _j)=0. \end{aligned}$$
(31)

According to the quadratic formula, set the discriminant \(\varDelta (\lambda _j,\mu _t,\eta _t)=(1+\eta _t)^2(1-\mu _t\lambda _j)^2-4\eta _t(1-\mu _t\lambda _j)\), then the solution to (31) is:

$$\begin{aligned} r^{\pm }(\lambda _j,\mu _t,\eta _t)=\frac{(1+\eta _t)(1-\mu _t\lambda _j)\pm \sqrt{\varDelta (\lambda _j,\mu _t,\eta _t)}}{2}, \end{aligned}$$
(32)

where the superscript \((\cdot )^{\pm }\) means addition or subtraction in numerator. For given \({\varvec{T}}\) with fixed \((\mu _t,\eta _t)\), \(\rho ({\varvec{T}})=\max _{\lambda _j} |r^{\pm }(\lambda _j,\mu _t,\eta _t)|\) is continuous and quasi-convex w.r.t. the eigenvalue \(\lambda _j\) [18, 21, 37]. Thus, the extremal value is attained on the boundary, i.e.

$$\begin{aligned} \rho ({\varvec{T}})=\max (|r^{\pm }(\lambda _{\max },\mu _t,\eta _t)|,|r^{\pm }(\lambda _{\min },\mu _t,\eta _t)|). \end{aligned}$$
(33)

As a whole, \(\rho ({\varvec{T}})\) is determined by the maximum modulus of the roots of (33). We denote that surfaces \(\varPi _1\) and \(\varPi _2\) correspond to \(\lambda _{\min }\) and \(\lambda _{\max }\), respectively.

Below we show how to determine the minimum spectral radius and corresponding parameters. Back to (32), \(|r^{\pm }(\lambda _j,\mu _t,\eta _t)|\ge |(1+\eta _t)(1-\mu _t\lambda _j)|/2\) takes the equal if and only if \(\varDelta (\lambda _j,\mu _t,\eta _t)=0\). In this case, we can get a relationship of the parameter \((\mu _t,\eta _t)\)

$$\begin{aligned} \eta _t^-=\frac{1-\sqrt{\mu _t\lambda _j}}{1+\sqrt{\mu _t\lambda _j}},\eta _t^+=\frac{1+\sqrt{\mu _t\lambda _j}}{1-\sqrt{\mu _t\lambda _j}}. \end{aligned}$$
(34)

Obviously, \(0<\eta _t^-<1<\eta _t^+\). Given \(\mu _t\), there are three cases for \(\eta _t\).

  • (32) with \(\eta _t\in (0,\eta _{t}^-)\cup (\eta _{t}^+,\infty )\) has two different solutions.

  • (32) with \(\eta _t=\eta _{t}^\pm \) has a single solution.

  • (32) with \(\eta _t\in (\eta _{t}^-,\eta _{t}^+)\) has conjugate complex solutions.

If \(\eta _t\in [\eta _{t}^+,\infty )\), \(r^{\pm }(\lambda _j,\mu _t,\eta _t^+)\ge |(1+\eta _t^+)(1-\mu _t\lambda _j)|/2=1+\sqrt{\mu _t\lambda _j}>1\), and \(\rho ({\varvec{T}})>1\) is obtained form (33). Conversely, when \(\eta _t=\eta _{t}^-\), \(r^{\pm }(\lambda _j,\mu _t,\eta _t^-)=|(1+\eta _t^-)(1-\mu _t\lambda _j)|/2=1-\sqrt{\mu _t\lambda _j}<1\). This is also why the parameter is selected as \(0<\eta \le 1\) in practice. When \(\eta _t\in (\eta _{t}^-,\eta _{t}^+)\), \(\rho ({\varvec{T}})=\max _{\lambda _j} \sqrt{\eta _t(1-\mu _t\lambda _j)}\) monotonically increases w.r.t. \(\eta _t\) and monotonically decreases w.r.t. \(\mu _t\). We can draw the geometric properties of \(\rho ({\varvec{T}})\) w.r.t. \((\mu _t,\eta _t)\), and condition \(\varDelta (\lambda _j,\mu _t,\eta _t^-)=0\) helps to find the theoretical lower bound of \(\rho ({\varvec{T}})\). The optimal parameter pair \((\mu _\flat ,\eta _\flat )\) is the intersection of \(r^{\pm }(\lambda _{\min },\mu _\flat ,\eta _\flat )\) in the curve \(\eta _t^-=\frac{1-\sqrt{\mu _t\lambda _j}}{1+\sqrt{\mu _t\lambda _j}}\) and the surface \(\varPi _2\), i.e., \(|r^{-}(\lambda _{\max },\mu _t,\eta _t)|\). So it satisfies the following equation

$$\begin{aligned} (1+\eta _\flat )(1-\mu _\flat \lambda _{\min })=-(1+\eta _\flat )(1-\mu _\flat \lambda _{\max }^2)+\sqrt{(1+\eta _\flat )^2(1-\mu _\flat \lambda _{\max }^2)^2-4\eta _\flat (1-\mu _\flat \lambda _{\max }^2)}. \end{aligned}$$

Bringing in \(\eta _\flat =\frac{1-\sqrt{\mu _\flat \lambda _{\min }}}{1+\sqrt{\mu _\flat \lambda _{\min }}}\), it is not difficult for us to get optimal convergence result \(\mu _{\flat }=\frac{4}{\lambda _{\min }+3\lambda _{\max }}\) and \(\rho _{\textsf{opt}}(\varvec{T})=1-\sqrt{\frac{4\lambda _{\min }}{\lambda _{\min }+3\lambda _{\max }}}\) in (18). Also, for \(\eta _t<1\), the intersection of \(\varPi _1\) and \(\varPi _2\) can be calculated according to monotonicity

$$\begin{aligned} r^{+}(\lambda _{\min },\mu _t,\eta _t)=-r^{-}(\lambda _{\max },\mu _t,\eta _t). \end{aligned}$$

If \(\eta _t=0\), it simplifies to \(\mu _t=2/(\lambda _{\min }+\lambda _{\max })=\mu _\dagger \) in (11). Due to momentum, the optimal stepsizes satisfy \(\mu _\flat <\mu _\dagger \). In fact, we bring \(\eta _t=0\) to get \({\varvec{e}}_t={\varvec{H}}{\varvec{e}}_{t-1}+\mathcal {O}(\Vert {\varvec{e}}_{t-1}\Vert _2^2)\), which is consistent with the non-accelerated iteration. Conversely, if \(\mu _t\ge \mu _\dagger \), then \(\eta _t=0\) is a good parameter choice, which means NAG degenerates to Grad. When \(\eta _t\ne 0\), we have

$$\begin{aligned} \eta _t\mu _t^2(\lambda _{\max }-\lambda _{\min })^2+2(1+\eta _t)^2(1-\mu _t\lambda _{\max })(1-\mu _t\lambda _{\min })(2-\mu _t(\lambda _{\min }+\lambda _{\max }))=0. \end{aligned}$$

Despite the complex form, we use symbolic computing tools to solve when \(\mu _t\in (\mu _\flat ,\mu _\dagger )\)

$$\begin{aligned} \begin{aligned} \eta _{t\bowtie }&=[(-4\lambda _{\min }^2\lambda _{\max }\mu _t^3+5\lambda _{\min }^2\mu _t^2-4\lambda _{\min }\lambda _{\max }^2\mu _t^3+14\lambda _{\min }\lambda _{\max }\mu _t^2-12\lambda _{\min }\mu _t+5\lambda _{\max }^2\mu _t^2-12\lambda _{\max }\mu _t+8)\\&-\sqrt{\mu _t^2(-(\lambda _{\max }-\lambda _{\min })^2)(8\lambda _{\min }^2\lambda _{\max }\mu _t^3-9\lambda _{\min }^2\mu _t^2+8\lambda _{\min }\lambda _{\max }^2\mu _t^3-30\lambda _{\min }\mu _t^2+24\lambda _{\min }\mu _t-9\lambda _{\max }^2\mu _t^2+24\lambda _{\max }\mu _t-16)}]\\&/(4(\lambda _{\min }\mu _t-1)(\lambda _{\max }\mu _t-1)(\lambda _{\min }\mu _t+\lambda _{\max }\mu _t-2)). \end{aligned} \end{aligned}$$

We analyze the spectral radius of \({\varvec{T}}\) in (33) w.r.t. pair \((\mu _t,\eta _{t})\) by case. \(\square \)

E Proof in Sect. 4

1.1 E.1 Proof of Lemma 3

Proof

The first one obviously holds according to Lemma 1. From (25), we have

$$\begin{aligned} \begin{aligned} {\mathcal {R}}_{{\varvec{X}}}^{\textsf{orth}}({\varvec{N}})&=({\varvec{X}}+{\varvec{N}}){\varvec{V}}_{{\varvec{X}}}(\varvec{\varSigma }_{{\varvec{X}}}+{\varvec{U}}_{{\varvec{X}}}^\top {\varvec{N}} {\varvec{V}}_{{\varvec{X}}})^{-1}{\varvec{U}}_{{\varvec{X}}}^\top ({\varvec{X}}+{\varvec{N}})\\&{\mathop {=}\limits ^{(a)}}({\varvec{X}}+{\varvec{N}}){\varvec{V}}_{{\varvec{X}}}(\varvec{\varSigma }_{{\varvec{X}}}^{-1}-\varvec{\varSigma }_{{\varvec{X}}}^{-1}{\varvec{U}}_{{\varvec{X}}}^\top {\varvec{N}}{\varvec{V}}_{{\varvec{X}}}\varvec{\varSigma }_{{\varvec{X}}}^{-1}){\varvec{U}}_{{\varvec{X}}}^\top ({\varvec{X}}+{\varvec{N}})+{\mathcal {O}}(\Vert {\varvec{N}}\Vert _F^2)\\&=({\varvec{X}}+{\varvec{N}})({\varvec{V}}_{{\varvec{X}}}\varvec{\varSigma }_{{\varvec{X}}}^{-1}{\varvec{U}}_{{\varvec{X}}}^\top -{\varvec{V}}_{{\varvec{X}}}\varvec{\varSigma }_{{\varvec{X}}}^{-1}{\varvec{U}}_{{\varvec{X}}}^\top {\varvec{N}} {\varvec{V}}_{{\varvec{X}}}\varvec{\varSigma }_{{\varvec{X}}}^{-1}{\varvec{U}}_{{\varvec{X}}}^\top )({\varvec{X}}+{\varvec{N}})+{\mathcal {O}}(\Vert {\varvec{N}}\Vert _F^2)\\&=({\varvec{X}}+{\varvec{N}})({\varvec{X}}^{-\top }-{\varvec{X}}^{-\top }{\varvec{N}} {\varvec{X}}^{-\top })({\varvec{X}}+{\varvec{N}})+{\mathcal {O}}(\Vert {\varvec{N}}\Vert _F^2)\\&{\mathop {=}\limits ^{(b)}}{\varvec{X}}{\varvec{X}}^{-\top }{\varvec{X}}+{\varvec{N}}{\varvec{X}}^{-\top }{\varvec{X}}+{\varvec{X}}{\varvec{X}}^{-\top }{\varvec{N}}-{\varvec{X}}{\varvec{X}}^{-\top }{\varvec{N}} {\varvec{X}}^{-\top }{\varvec{X}}+{\mathcal {O}}(\Vert {\varvec{N}}\Vert _F^2)\\&{\mathop {=}\limits ^{(c)}}{\varvec{X}}+P_{{\varvec{U}}_{{\varvec{X}}}} {\varvec{N}}+{\varvec{N}} P_{{\varvec{V}}_{{\varvec{X}}}}-P_{{\varvec{U}}_{{\varvec{X}}}} {\varvec{N}} P_{{\varvec{V}}_{{\varvec{X}}}}+{\mathcal {O}}(\Vert {\varvec{N}}\Vert _F^2)\\&={\varvec{X}}+{\varvec{N}}-P_{{\varvec{U}}_{{\varvec{X}}}}^\perp {\varvec{N}} P_{{\varvec{V}}_{{\varvec{X}}}}^\perp +{\mathcal {O}}(\Vert {\varvec{N}}\Vert _F^2)\\&={\mathcal {P}}_{{{\mathbb {T}}}_{{\varvec{X}}}}({\varvec{X}}+\varvec{N})+{\mathcal {O}}(\Vert {\varvec{N}}\Vert _F^2), \end{aligned} \end{aligned}$$

where (a) is the perturbation analysis of matrix inverse. As long as \(\Vert {\varvec{A}}^{-1}{\varvec{B}}\Vert <1\) or \(\Vert \varvec{B}{\varvec{A}}^{-1}\Vert <1\) holds, the Taylor expansion of the inverse of the matrix sum is as follows

$$\begin{aligned} \begin{aligned} ({\varvec{A}}+{\varvec{B}})^{-1}&={\varvec{A}}^{-1} - {\varvec{A}}^{-1}{\varvec{B}}{\varvec{A}}^{-1} + {\varvec{A}}^{-1}({\varvec{B}}{\varvec{A}}^{-1})^2 - {\varvec{A}}^{-1}({\varvec{B}}{\varvec{A}}^{-1})^3 + \cdots \\&={\varvec{A}}^{-1}-{\varvec{A}}^{-1}{\varvec{B}}\varvec{A}^{-1}+{\mathcal {O}}(\Vert {\varvec{B}}\Vert _F^2). \end{aligned} \end{aligned}$$

Using the norm inequality \(\Vert {\varvec{A}}\varvec{B}\Vert \le \Vert {\varvec{A}}\Vert \Vert {\varvec{B}}\Vert \), combined with the condition \(\Vert {\varvec{N}}\Vert \le \Vert {\varvec{N}}\Vert _F< \sigma _{{r}}({\varvec{X}})/2\), it can be judged that the inverse matrix condition holds.

$$\begin{aligned} \Vert \varvec{\varSigma }_{{\varvec{X}}}^{-1}({\varvec{U}}_{\varvec{X}}^\top {\varvec{N}} {\varvec{V}}_{\varvec{X}})\Vert \le \frac{\Vert {\varvec{U}}_{{\varvec{X}}}^\top {\varvec{N}} {\varvec{V}}_{{\varvec{X}}}\Vert }{\Vert \varvec{\varSigma }_{\varvec{X}}\Vert }\le \frac{\Vert \varvec{N}\Vert }{\sigma _{{r}}({\varvec{X}})}<1. \end{aligned}$$

(b) merges the product of multiple \({\varvec{N}}\) into higher-order terms. (c) uses the SVD of \({\varvec{X}}\) to get

$$\begin{aligned} \begin{aligned}&{\varvec{X}}{\varvec{X}}^{-\top }={\varvec{U}}_{{\varvec{X}}}\varvec{\varSigma }_{{\varvec{X}}} {\varvec{V}}_{{\varvec{X}}}^\top {\varvec{V}}_{{\varvec{X}}}\varvec{\varSigma }_{{\varvec{X}}}^{-1}{\varvec{U}}_{{\varvec{X}}}^\top ={\varvec{U}}_{{\varvec{X}}}{\varvec{U}}_{{\varvec{X}}}^\top =P_{{\varvec{U}}_{{\varvec{X}}}},\\&{\varvec{X}}^{-\top }{\varvec{X}}={\varvec{V}}_{{\varvec{X}}}\varvec{\varSigma }_{{\varvec{X}}}^{-1}{\varvec{U}}_{{\varvec{X}}}^\top {\varvec{U}}_{{\varvec{X}}}\varvec{\varSigma }_{{\varvec{X}}} {\varvec{V}}_{{\varvec{X}}}^\top ={\varvec{V}}_{{\varvec{X}}}{\varvec{V}}_{{\varvec{X}}}^\top =P_{{\varvec{V}}_{{\varvec{X}}}},\\&{\varvec{X}}{\varvec{X}}^{-\top }{\varvec{X}}=P_{\varvec{U}_{{\varvec{X}}}} {\varvec{U}}_{{\varvec{X}}}\varvec{\varSigma }_{{\varvec{X}}} {\varvec{V}}_{\varvec{X}}^\top ={\varvec{U}}_{{\varvec{X}}}\varvec{\varSigma }_{\varvec{X}} {\varvec{V}}_{{\varvec{X}}}^\top ={\varvec{X}}. \end{aligned} \end{aligned}$$

\(\square \)

1.2 E.2 Convergence for Algorithm 4

Proof

According to Algorithm 4, we have

$$\begin{aligned} \begin{aligned} {\varvec{E}}_{t+1}&={\mathcal {R}}_{{\varvec{X}}_t}(-\mu _t\text {grad}f({\varvec{X}}_t))-{\varvec{X}}_\star \\&{\mathop {=}\limits ^{(a)}}{\mathcal {P}}_{{{\mathbb {T}}}_{X_t}{{\mathbb {M}}}_r}({\varvec{X}}_t-\mu _t\nabla f({\varvec{X}}_t))-{\varvec{X}}_\star +{\mathcal {O}}(\Vert {\varvec{E}}_{t}\Vert _F^2)\\&{\mathop {=}\limits ^{(b)}}({\varvec{E}}_{t}-\mu _t\nabla f({\varvec{X}}_t))-P_{{\varvec{U}}_t}^\perp ({\varvec{E}}_{t}-\mu _t\nabla f({\varvec{X}}_t))P_{{\varvec{V}}_t}^\perp +{\mathcal {O}}(\Vert {\varvec{E}}_{t}\Vert _F^2)\\&{\mathop {=}\limits ^{(c)}}({\varvec{E}}_{t}-\mu _t\nabla f({\varvec{X}}_t))-P_{{\varvec{U}}_\star }^\perp ({\varvec{E}}_{t}-\mu _t\nabla f({\varvec{X}}_t))P_{{\varvec{V}}_\star }^\perp +{\mathcal {O}}(\Vert {\varvec{E}}_{t}\Vert _F^2),\\ \end{aligned} \end{aligned}$$

where \((a)\) uses Lemma 3, \((b)\) is based on the tangent space projection in (22), and \((c)\) uses the subspace perturbation in Lemma 5, and replaces the subspace \({\mathcal {P}}_{{{\mathbb {T}}}_{{\varvec{X}}_t}{{\mathbb {M}}}_r}\) with \(\mathcal {P}_{{{\mathbb {T}}}_{{\varvec{X}}_\star }{{\mathbb {M}}}_r}\).

$$\begin{aligned} \begin{aligned} \Vert P_{{\varvec{U}}_t}^\perp {\varvec{A}}{\varvec{P}}_{\varvec{V}_t}^\perp -P_{{\varvec{U}}_\star }^\perp {\varvec{A}}{\varvec{P}}_{\varvec{V}_\star }^\perp \Vert&=\Vert P_{{\varvec{U}}_t}^\perp {\varvec{A}}{\varvec{P}}_{{\varvec{V}}_t}^\perp -P_{{\varvec{U}}_t}^\perp {\varvec{A}}{\varvec{P}}_{{\varvec{V}}_\star }^\perp +P_{{\varvec{U}}_t}^\perp {\varvec{A}}{\varvec{P}}_{{\varvec{V}}_\star }^\perp -P_{{\varvec{U}}_\star }^\perp {\varvec{A}}{\varvec{P}}_{{\varvec{V}}_\star }^\perp \Vert \\&\le \Vert P_{{\varvec{U}}_t}^\perp {\varvec{A}}{\varvec{P}}_{{\varvec{V}}_t}^\perp -P_{{\varvec{U}}_t}^\perp {\varvec{A}}{\varvec{P}}_{{\varvec{V}}_\star }^\perp \Vert +\Vert P_{{\varvec{U}}_t}^\perp {\varvec{A}}{\varvec{P}}_{{\varvec{V}}_\star }^\perp -P_{{\varvec{U}}_\star }^\perp {\varvec{A}}{\varvec{P}}_{{\varvec{V}}_\star }^\perp \Vert \\&\le \Vert P_{{\varvec{U}}_t}^\perp \Vert \Vert {\varvec{A}}\Vert \Vert P_{{\varvec{V}}_t}^\perp -P_{{\varvec{V}}_\star }^\perp \Vert +\Vert P_{{\varvec{U}}_t}^\perp -P_{{\varvec{U}}_\star }^\perp \Vert \Vert {\varvec{A}}\Vert \Vert P_{{\varvec{V}}_\star }^\perp \Vert \\&={\mathcal {O}}(\Vert {\varvec{E}}_t\Vert _F^2), \end{aligned} \end{aligned}$$

The subsequent proof is consistent with the proof of Theorem 1 in Appendix 1. \(\square \)

1.3 E.3 Convergence for Algorithm 5

Proof

The proof is divided into three steps to analyse \(\varvec{X}_{t-1}\), \({\varvec{Y}}_t\) and \({\varvec{X}}_{t+1}\), respectively.

Step 1 Calculate the orthographic retraction of \(\varvec{X}_{t-1}\) and the inverse matrix.

$$\begin{aligned} \begin{aligned} \textsf{inv} {\mathcal {R}}^{\textsf{orth}}_{\varvec{X}_t}({\varvec{X}}_{t-1})&={\mathcal {P}}_{{{\mathbb {T}}}_{{\varvec{X}}_t}{{\mathbb {M}}}_r}({\varvec{X}}_{t-1}-{\varvec{X}}_t)\\&={\mathcal {P}}_{{{\mathbb {T}}}_{{\varvec{X}}_t}{{\mathbb {M}}}_r}({\varvec{X}}_{t-1})-{\varvec{X}}_t\\&={\mathcal {P}}_{{{\mathbb {T}}}_{{\varvec{X}}_\star }{{\mathbb {M}}}_r}({\varvec{X}}_{t-1})-{\varvec{X}}_t+{\mathcal {O}}(\Vert {\varvec{E}}_{t}\Vert _F^2)\\&={\varvec{X}}_{t-1}-{\varvec{X}}_t+{\mathcal {O}}(\Vert {\varvec{E}}_{t}\Vert _F^2+\Vert {\varvec{E}}_{t-1}\Vert _F^2).\\ \end{aligned} \end{aligned}$$

It gives an approximation of \({\varvec{X}}_{t-1}\) on the tangent space \({{\mathbb {T}}}_{{\varvec{X}}_t}{{\mathbb {M}}}_r\).

Step 2 Similar to Appendix 1, we calculate the residual of \({\varvec{Y}}_{t}\)

$$\begin{aligned} \begin{aligned} {\varvec{Y}}_{t}-{\varvec{X}}_\star&={\mathcal {R}}^{\textsf{orth}}_{{\varvec{X}}_t}(-\eta _t \textsf{inv} {\mathcal {R}}^{\textsf{orth}}_{{\varvec{X}}_t}({\varvec{X}}_{t-1}))-X_\star \\&={\mathcal {P}}_{{{\mathbb {T}}}_{{\varvec{X}}_t}{{\mathbb {M}}}_r}({\varvec{X}}_t-\eta _t \textsf{inv} {\mathcal {R}}^{\textsf{orth}}_{{\varvec{X}}_t}({\varvec{X}}_{t-1}))-{\varvec{X}}_\star +{\mathcal {O}}(\Vert {\varvec{E}}_t\Vert _F^2)\\&={\varvec{X}}_t-\eta _t \textsf{inv} {\mathcal {R}}^{\textsf{orth}}_{{\varvec{X}}_t}({\varvec{X}}_{t-1})-{\varvec{X}}_\star +{\mathcal {O}}(\Vert {\varvec{E}}_t\Vert _F^2)\\&={\varvec{X}}_t-{\varvec{X}}_\star +\eta _t ({\varvec{X}}_t-{\varvec{X}}_{t-1})+{\mathcal {O}}(\Vert {\varvec{E}}_{t}\Vert _F^2+\Vert {\varvec{E}}_{t-1}\Vert _F^2)\\&={\varvec{E}}_t+\eta _t({\varvec{E}}_t-\varvec{E}_{t-1})+{\mathcal {O}}(\Vert {\varvec{E}}_{t}\Vert _F^2+\Vert \varvec{E}_{t-1}\Vert _F^2). \end{aligned} \end{aligned}$$

It also satisfies the linear extrapolation in Euclidean space.

Step 3 Compute \({\varvec{X}}_{t+1}-{\varvec{X}}_\star \) to get the recursive form

$$\begin{aligned} \begin{aligned} {\varvec{E}}_{t+1}&={\varvec{X}}_{t+1}-{\varvec{X}}_\star \\&={\mathcal {R}}^{\textsf{orth}}_{{\varvec{Y}}_t}(-\mu _t\text {grad}f({\varvec{Y}}_t))-{\varvec{X}}_\star \\&={\mathcal {P}}_{{{\mathbb {T}}}_{{\varvec{Y}}_t}{{\mathbb {M}}}_r}({\varvec{Y}}_t-\mu _t \nabla f({\varvec{Y}}_t))-{\varvec{X}}_\star +{\mathcal {O}}(\Vert {\varvec{Y}}_t-{\varvec{X}}_\star \Vert _F^2)\\&={\mathcal {P}}_{{{\mathbb {T}}}_{{\varvec{X}}_\star }{{\mathbb {M}}}_r}({\varvec{Y}}_t-\mu _t \nabla f({\varvec{Y}}_t))-{\varvec{X}}_\star +{\mathcal {O}}(\Vert {\varvec{Y}}_t-{\varvec{X}}_\star \Vert _F^2)\\&=({\varvec{Y}}_t-{\varvec{X}}_\star -\mu _t \nabla f({\varvec{Y}}_t))-P_{{\varvec{U}}_\star }^\perp ({\varvec{Y}}_t-{\varvec{X}}_\star -\mu _t \nabla f({\varvec{Y}}_t))P_{{\varvec{V}}_\star }^\perp +{\mathcal {O}}(\Vert {\varvec{Y}}_t-{\varvec{X}}_\star \Vert _F^2).\\ \end{aligned} \end{aligned}$$

The subsequent proof is consistent with proof of Theorem 2 in Appendix 1. \(\square \)

1.4 E.4 Proof of Restart Condition Equivalence in (28)

Proof

When condition (8) hold, we have

$$\begin{aligned} \begin{aligned} \begin{aligned} \langle \nabla f({\varvec{Y}}_{t-1}), {\varvec{X}}_t-\varvec{X}_{t-1}\rangle&=\langle \text {grad}~f({\varvec{Y}}_{t-1})+\nabla f({\varvec{Y}}_{t-1})-\text {grad}~f({\varvec{Y}}_{t-1}), X_t-{\varvec{X}}_{t-1}\rangle \\&{\mathop {=}\limits ^{(a)}}\langle \text {grad}~f({\varvec{Y}}_{t-1}), \textsf{inv} {\mathcal {R}}^{\textsf{orth}}_{{\varvec{Y}}_{t-1}}({\varvec{X}}_{t})-\textsf{inv} {\mathcal {R}}^{\textsf{orth}}_{{\varvec{Y}}_{t-1}}({\varvec{X}}_{t-1})\rangle \\&\quad +\langle \nabla f({\varvec{Y}}_{t-1})-\text {grad}~f({\varvec{Y}}_t), {\varvec{X}}_t-\textsf{inv} {\mathcal {R}}^{\textsf{orth}}_{{\varvec{Y}}_{t-1}}({\varvec{X}}_{t})\rangle \\&\quad -\langle \nabla f({\varvec{Y}}_{t-1})-\text {grad}~f({\varvec{Y}}_t), {\varvec{X}}_{t-1}-\textsf{inv} {\mathcal {R}}^{\textsf{orth}}_{{\varvec{Y}}_{t-1}}({\varvec{X}}_{t-1})\rangle \\&{\mathop {\approx }\limits ^{(b)}}\langle \text {grad}~f({\varvec{Y}}_{t-1}), \textsf{inv} {\mathcal {R}}^{\textsf{orth}}_{{\varvec{Y}}_{t-1}}({\varvec{X}}_{t})-\textsf{inv} {\mathcal {R}}^{\textsf{orth}}_{{\varvec{Y}}_{t-1}}({\varvec{X}}_{t-1})\rangle ,\\ \end{aligned} \end{aligned} \end{aligned}$$

where (a) uses the orthogonal relationship of the Riemannian gradient and tangent space. Based on the first-order expansion, we appropriately omit the higher-order terms in (a) to obtain the approximate relationship (b), which will not change the sign before and after the approximation. As mentioned in step 2 in Appendix 1, \(\nabla f(\varvec{Y}_{t-1})=\varvec{\varTheta } \text {vec}(\varvec{Y}_{t-1}-{\varvec{X}}_\star )\) and \(\text {grad}~f(\varvec{Y}_{t-1})\) both are first order w.r.t. the residual. According to Lemma 3, \(\varvec{X}_t-\textsf{inv} {\mathcal {R}}^{\textsf{orth}}_{\varvec{Y}_{t-1}}({\varvec{X}}_{t})\) and \({\varvec{X}}_{t-1}-\textsf{inv} {\mathcal {R}}^{\textsf{orth}}_{{\varvec{Y}}_{t-1}}(\varvec{X}_{t-1})\) are second order. So the last two terms of (a) are third order, while the remaining inner product is second order. \(\square \)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, H., Peng, Z., Pan, C. et al. Fast Gradient Method for Low-Rank Matrix Estimation. J Sci Comput 96, 41 (2023). https://doi.org/10.1007/s10915-023-02266-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10915-023-02266-7

Keywords

Navigation