Fast Gradient Method for Low-Rank Matrix Estimation

Li, Hongyi; Peng, Zhen; Pan, Chengwei; Zhao, Di

doi:10.1007/s10915-023-02266-7

Fast Gradient Method for Low-Rank Matrix Estimation

Published: 17 June 2023

Volume 96, article number 41, (2023)
Cite this article

Journal of Scientific Computing Aims and scope Submit manuscript

Hongyi Li¹,
Zhen Peng¹,
Chengwei Pan ORCID: orcid.org/0000-0003-0497-7903² &
…
Di Zhao¹

442 Accesses
1 Citation
Explore all metrics

Abstract

Projected gradient descent and its Riemannian variant belong to a typical class of methods for low-rank matrix estimation. This paper proposes a new Nesterov’s Accelerated Riemannian Gradient algorithm using efficient orthographic retraction and tangent space projection. The subspace relationship between iterative and extrapolated sequences on the low-rank matrix manifold provides computational convenience. With perturbation analysis of truncated singular value decomposition and two retractions, we systematically analyze the local convergence of gradient algorithms and Nesterov’s variants in the Euclidean and Riemannian settings. Theoretically, we estimate the exact rate of local linear convergence under different parameters using the spectral radius in a closed form and give the optimal convergence rate and the corresponding momentum parameter. When the parameter is unknown, the adaptive restart scheme can avoid the oscillation problem caused by high momentum, thus approaching the optimal convergence rate. Extensive numerical experiments confirm the estimations of convergence rate and demonstrate that the proposed algorithm is competitive with first-order methods for matrix completion and matrix sensing.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Analysis of Singular Value Thresholding Algorithm for Matrix Completion

Article 08 July 2019

An improved Riemannian conjugate gradient method and its application to robust matrix completion

Article 31 October 2023

An Approximate Augmented Lagrangian Method for Nonnegative Low-Rank Matrix Approximation

Article 13 July 2021

Data Availibility

Enquiries about data availability should be directed to the authors.

Code Availability

The codes used to perform the experiments in this paper are available from https://github.com/pxxyyz/FastGradient.

References

Absil, P.A., Malick, J.: Projection-like retractions on matrix manifolds. SIAM J. Optim. 22(1), 135–158 (2012)
Article MathSciNet MATH Google Scholar
Absil, P.A., Oseledets, I.V.: Low-rank retractions: a survey and new results. Comput. Optim. Appl. 62(1), 5–29 (2015)
Article MathSciNet MATH Google Scholar
Ahn, K., Sra, S.: From Nesterov’s estimate sequence to Riemannian acceleration. In: Conference on Learning Theory, pp. 84–118. PMLR (2020)
Boumal, N.: An Introduction to Optimization on Smooth Manifolds. Cambridge University Press (2023)
Book MATH Google Scholar
Cai, J.F., Wei, K.: Exploiting the structure effectively and efficiently in low-rank matrix recovery. In: Handbook of Numerical Analysis, vol. 19, pp. 21–51. Elsevier (2018)
Chen, Y., Chi, Y.: Harnessing structures in big data via guaranteed low-rank matrix estimation: recent theory and fast algorithms via convex and nonconvex optimization. IEEE Sign. Process Mag. 35(4), 14–31 (2018)
Article Google Scholar
Chen, Y., Chi, Y., Fan, J., Ma, C.: Gradient descent with random initialization: fast global convergence for nonconvex phase retrieval. Math. Program. 176, 5–37 (2019)
Article MathSciNet MATH Google Scholar
Chen, Y., Chi, Y., Fan, J., Ma, C., et al.: Spectral methods for data science: a statistical perspective. Found. Trends Mach. Learn. 14(5), 566–806 (2021)
Article MATH Google Scholar
Chi, Y., Lu, Y.M., Chen, Y.: Nonconvex optimization meets low-rank matrix factorization: an overview. IEEE Trans. Sign. Process. 67(20), 5239–5269 (2019)
Article MathSciNet MATH Google Scholar
Chunikhina, E., Raich, R., Nguyen, T.: Performance analysis for matrix completion via iterative hard-thresholded SVD. In: 2014 IEEE Workshop on Statistical Signal Processing (SSP), pp. 392–395. IEEE (2014)
Davenport, M.A., Romberg, J.: An overview of low-rank matrix recovery from incomplete observations. IEEE J. Sel. Top. Sign. Process. 10(4), 608–622 (2016)
Article Google Scholar
Duruisseaux, V., Leok, M.: A variational formulation of accelerated optimization on Riemannian manifolds. SIAM J. Math. Data Sci. 4(2), 649–674 (2022)
Article MathSciNet MATH Google Scholar
Gonzaga, C.C., Schneider, R.M.: On the steepest descent algorithm for quadratic functions. Comput. Optim. Appl. 63, 523–542 (2016)
Article MathSciNet MATH Google Scholar
Huang, J., Zhou, J.: A direct proof and a generalization for a Kantorovich type inequality. Linear Algebra Appl. 397, 185–192 (2005)
Article MathSciNet MATH Google Scholar
Huang, W., Wei, K.: An extension of fast iterative shrinkage-thresholding algorithm to Riemannian optimization for sparse principal component analysis. Numer. Linear Algebra Appl. 29(1), e2409 (2022)
Article MathSciNet MATH Google Scholar
Huang, Y., Dai, Y.H., Liu, X.W., Zhang, H.: On the asymptotic convergence and acceleration of gradient methods. J. Sci. Comput. 90, 1–29 (2022)
Article MathSciNet MATH Google Scholar
Jain, P., Meka, R., Dhillon, I.: Guaranteed rank minimization via singular value projection. Adv. Neu. Inf. Process. Syst. 23 (2010)
Kim, D., Fessler, J.A.: Adaptive restart of the optimized gradient method for convex optimization. J. Optim. Theory Appl. 178(1), 240–263 (2018)
Article MathSciNet MATH Google Scholar
Kim, J., Yang, I.: Nesterov acceleration for Riemannian optimization. arXiv preprint arXiv:2202.02036 (2022)
Kyrillidis, A., Cevher, V.: Matrix recipes for hard thresholding methods. J. Math. Imag. Vis. 48, 235–265 (2014)
Article MathSciNet MATH Google Scholar
Lessard, L., Recht, B., Packard, A.: Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016)
Article MathSciNet MATH Google Scholar
Li, H., Fang, C., Lin, Z.: Accelerated first-order optimization algorithms for machine learning. Proc. IEEE 108(11), 2067–2082 (2020)
Article Google Scholar
Li, H., Lin, Z.: Accelerated alternating direction method of multipliers: an optimal o (1/k) nonergodic analysis. J. Sci. Comput. 79, 671–699 (2019)
Article MathSciNet MATH Google Scholar
Liang, J., Fadili, J., Peyré, G.: Activity identification and local linear convergence of forward-backward-type methods. SIAM J. Optim. 27(1), 408–437 (2017)
Article MathSciNet MATH Google Scholar
Liang, J., Luo, T., Schonlieb, C.B.: Improving “fast iterative shrinkage-thresholding algorithm’’: faster, smarter, and greedier. SIAM J. Sci. Comput. 44(3), A1069–A1091 (2022)
Article MathSciNet MATH Google Scholar
Luenberger, D.G., Ye, Y.: Linear and Nonlinear Programming, vol. 228. Springer Nature (2021)
Book MATH Google Scholar
Nesterov, Y.E.: A method of solving a convex programming problem with convergence rate o$\left(\frac{1}{k^{2}}\right)$. In: Doklady Akademii Nauk, vol. 269, pp. 543–547. Russian Academy of Sciences (1983)
Odonoghue, B., Candes, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15, 715–732 (2015)
Article MathSciNet MATH Google Scholar
Park, J.: Accelerated additive Schwarz methods for convex optimization with adaptive restart. J. Sci. Comput. 89(3), 58 (2021)
Article MathSciNet MATH Google Scholar
Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)
Article Google Scholar
Tanner, J., Wei, K.: Normalized iterative hard thresholding for matrix completion. SIAM J. Sci. Comput. 35(5), S104–S125 (2013)
Article MathSciNet MATH Google Scholar
Tong, T., Ma, C., Chi, Y.: Accelerating ill-conditioned low-rank matrix estimation via scaled gradient descent. J. Mach. Learn. Res. 22(1), 6639–6701 (2021)
MathSciNet MATH Google Scholar
Vandereycken, B.: Low-rank matrix completion by Riemannian optimization. SIAM J. Optim. 23(2), 1214–1236 (2013)
Article MathSciNet MATH Google Scholar
Vu, T., Raich, R.: Accelerating iterative hard thresholding for low-rank matrix completion via adaptive restart. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2917–2921. IEEE (2019)
Vu, T., Raich, R.: On local convergence of iterative hard thresholding for matrix completion. arXiv preprint arXiv:2112.14733 (2021)
Vu, T., Raich, R.: On asymptotic linear convergence of projected gradient descent for constrained least squares. IEEE Trans. Sign. Process. 70, 4061–4076 (2022)
Article MathSciNet Google Scholar
Wang, D., He, Y., De Sterck, H.: On the asymptotic linear convergence speed of Anderson acceleration applied to ADMM. J. Sci. Comput. 88(2), 38 (2021)
Article MathSciNet MATH Google Scholar
Wang, H., Cai, J.F., Wang, T., Wei, K.: Fast Cadzow’s algorithm and a gradient variant. J. Sci. Comput. 88(2), 41 (2021)
Article MathSciNet MATH Google Scholar
Wang, R., Zhang, C., Wang, L., Shao, Y.: A stochastic Nesterov’s smoothing accelerated method for general nonsmooth constrained stochastic composite convex optimization. J. Sci. Comput. 93(2), 52 (2022)
Article MathSciNet MATH Google Scholar
Wei, K., Cai, J.F., Chan, T.F., Leung, S.: Guarantees of Riemannian optimization for low rank matrix recovery. SIAM J. Matrix Anal. Appl. 37(3), 1198–1222 (2016)
Article MathSciNet MATH Google Scholar
Wei, K., Cai, J.F., Chan, T.F., Leung, S.: Guarantees of Riemannian optimization for low rank matrix completion. Inverse Probl. Imag. 14(2), 233–265 (2020)
Article MathSciNet MATH Google Scholar
Wibisono, A., Wilson, A.C., Jordan, M.I.: A variational perspective on accelerated methods in optimization. Proc. Natl. Acad. Sci. 113(47), E7351–E7358 (2016)
Article MathSciNet MATH Google Scholar
Zhang, H., Sra, S.: Towards Riemannian accelerated gradient methods. arXiv preprint arXiv:1806.02812 (2018)
Zhang, T., Yang, Y.: Robust PCA by manifold optimization. J. Mach. Learn. Res. 19(1), 3101–3139 (2018)
MathSciNet MATH Google Scholar

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their review and helpful comments.

Funding

This work was supported by the National Natural Science Foundation of China (Grant no. 61771001).

Author information

Authors and Affiliations

LMIB, School of Mathematical Sciences, Beihang University, Beijing, 100191, China
Hongyi Li, Zhen Peng & Di Zhao
Institute of Artificial Intelligence, Beihang University, Beijing, 100191, China
Chengwei Pan

Authors

Hongyi Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Peng
View author publications
You can also search for this author in PubMed Google Scholar
Chengwei Pan
View author publications
You can also search for this author in PubMed Google Scholar
Di Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Chengwei Pan or Di Zhao.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Auxiliary Lemmas

1.1 A.1 Relationship of Matrix Eigenvalues

Lemma 4

Let $\varvec{\varTheta }$ be a symmetric positive semi-definite matrix, and ${\varvec{P}}\in {{\mathbb {R}}}^{n\times n}$ be an orthogonal projection matrix. Denote ${\varvec{P}}^{\perp }=\varvec{I}-{\varvec{P}}$, then there exists an eigenvalue $\lambda \ne 0$ of $({\varvec{I}}-\mu \varvec{\varTheta } ) {\varvec{P}}^\perp $, such that

$$\begin{aligned} \lambda _{i}(\mu \varvec{\varTheta } {\varvec{P}}^\perp +\varvec{P})=\lambda _{i}(\mu \varvec{\varTheta } \varvec{P}^\perp )=1-\lambda . \end{aligned}$$

Proof

Assume $\text {rank}({\varvec{P}})=r$. According to idempotent, we get the eigenvalues of ${\varvec{P}}$ and $\varvec{P}^{\perp }$ of the form.

$$\begin{aligned} \lambda ({\varvec{P}})=\{\underbrace{1,\ldots , 1}_{r},\underbrace{0,\ldots , 0}_{n-r}\},\lambda ({\varvec{P}}^{\perp })=\{\underbrace{1,\ldots , 1}_{n-r},\underbrace{0,\ldots , 0}_{r}\}. \end{aligned}$$

Here, we use ${\varvec{u}}_i$ and ${\varvec{v}}_j$ represent the eigenvectors of ${\varvec{P}}$ corresponding to eigenvalues $1$ and $0$, respectively. From the orthogonal relation between ${\varvec{P}}$ and ${\varvec{P}}^{\perp }$, $\varvec{P}{\varvec{u}}_i={\varvec{u}}_i,{\varvec{P}}\varvec{v}_j={\varvec{0}},{\varvec{P}}^\perp {\varvec{u}}_i=\varvec{0},{\varvec{P}}^\perp {\varvec{v}}_j={\varvec{v}}_j$. Further, we have

$$\begin{aligned} (\mu \varvec{\varTheta } {\varvec{P}}^\perp +\varvec{P}){\varvec{u}}_i={\varvec{u}}_i, (\mu \varvec{\varTheta } {\varvec{P}}^\perp ){\varvec{u}}_i={\varvec{0}}, (\mu \varvec{\varTheta } {\varvec{P}}^\perp +{\varvec{P}}){\varvec{v}}_j=(\mu \varvec{\varTheta } {\varvec{P}}^\perp ){\varvec{v}}_j=\mu \varvec{\varTheta } {\varvec{v}}_j.\nonumber \\ \end{aligned}$$

(29)

It can be known that ${\varvec{u}}_i$ corresponds to the eigenvector of $\mu \varvec{\varTheta } {\varvec{P}}^\perp +{\varvec{P}}$ with the eigenvalue of $1$ and the eigenvector of $\mu \varvec{\varTheta } {\varvec{P}}^\perp $ with the eigenvalue of $0$, respectively. Besides, if ${\varvec{v}}_j$ happens to be an eigenvector of $\mu \varvec{\varTheta }$, then $\varvec{v}_j$ is also an eigenvector of $\mu \varvec{\varTheta } \varvec{P}^\perp +{\varvec{P}}$ and $\mu \varvec{\varTheta } \varvec{P}^\perp $. This conjecture implies the relevance of the above three matrix eigendecompositions. To this end, assume that there exists a non-zero vector ${\varvec{x}}$ such that

$$\begin{aligned} ({\varvec{I}}-\mu \varvec{\varTheta } ) {\varvec{P}}^\perp {\varvec{x}}=\lambda {\varvec{x}}, \end{aligned}$$

then

$$\begin{aligned} (\mu \varvec{\varTheta } {\varvec{P}}^\perp +\varvec{P}){\varvec{x}}=(1-\lambda ){\varvec{x}}. \end{aligned}$$

For $\lambda \ne 0$, we will discuss ${\varvec{P}}\varvec{x}=(1-\lambda ){\varvec{x}}-\mu \varvec{\varTheta } \varvec{P}^\perp {\varvec{x}}$ case by case:

Case 1: when ${\varvec{P}}{\varvec{x}}={\varvec{0}}$, i.e., ${\varvec{x}}$ is a linear combination of ${\varvec{v}}_i$. Obviously, $\mu \varvec{\varTheta } {\varvec{P}}^\perp x=(1-\lambda ){\varvec{x}}$ holds. We obtain $1-\lambda $ is the eigenvalue of matrix $\mu \varvec{\varTheta } {\varvec{P}}^\perp $.

Case 2: when ${\varvec{P}}{\varvec{x}}\ne {\varvec{0}}$, we know that ${\varvec{x}}$ can always be represented as a linear combination of orthonormal bases, as follows

$$\begin{aligned} {\varvec{x}}=\sum _{i=1}^r\alpha _i \varvec{u}_i+\sum _{j=1}^{n-r}\beta _j {\varvec{v}}_j, \end{aligned}$$

and ${\varvec{P}}{\varvec{x}}\ne {\varvec{0}}$ means that there exists $\alpha _i\ne 0$, otherwise ${\varvec{P}}\varvec{x}=\sum _{j=1}^{n-r}\beta _j {\varvec{P}}{\varvec{v}}_j=\varvec{0}$ if $\forall i,\alpha _i=0$. And we expand the formula to get

$$\begin{aligned} ({\varvec{I}}-\mu \varvec{\varTheta } ) \sum _{j=1}^{n-r}\beta _j {\varvec{v}}_j=({\varvec{I}}-\mu \varvec{\varTheta } ) \varvec{P}^\perp {\varvec{x}}=\lambda {\varvec{x}}=\lambda (\sum _{i=1}^r\alpha _i {\varvec{u}}_i+\sum _{j=1}^{n-r}\beta _j {\varvec{v}}_j), \end{aligned}$$

where the left-hand side is a linear representation of the basis vector $\{{\varvec{v}}_j\}$, while the right-hand side is a linear combination of mutually orthogonal basis vectors $\{{\varvec{u}}_i\}$ and $\{{\varvec{v}}_j\}$. So when $\lambda \ne 0$, $\forall i,\alpha _i=0$ holds. This contradicts ${\varvec{P}}{\varvec{x}}\ne {\varvec{0}}$. $\square $

1.2 A.2 Perturbation Analysis of Subspaces

Lemma 5

(Wedin’s $\sin \varTheta $ Theorem [8]) Let ${\varvec{X}}_t={\varvec{U}}_t \varvec{\varSigma }_t {\varvec{V}}_t^\top $ and ${\varvec{X}}_\star =\varvec{U}_\star \varvec{\varSigma }_\star {\varvec{V}}_\star ^\top $ be the SVD of ${\varvec{X}}_t, {\varvec{X}}_\star \in {{\mathbb {M}}}_r$, respectively. If $\Vert {\varvec{X}}_t-\varvec{X}_\star \Vert <\sigma _r({\varvec{X}}_\star )$, there is an upper bound for the perturbation of the singular subspace as follows

$$\begin{aligned} \max \{\Vert P_{{\varvec{U}}_t}^\perp -P_{\varvec{U}_\star }^\perp \Vert ,\Vert P_{{\varvec{V}}_t}^\perp -P_{\varvec{V}_\star }^\perp \Vert \}\le \frac{2\Vert {\varvec{X}}_t-\varvec{X}_\star \Vert }{\sigma _r({\varvec{X}}_\star )}. \end{aligned}$$

Lemma 6

(Perturbation of subspace projection [40]) Let ${\varvec{X}}_t=\varvec{U}_t \varvec{\varSigma }_t {\varvec{V}}_t^\top $ and $\varvec{X}_\star ={\varvec{U}}_\star \varvec{\varSigma }_\star \varvec{V}_\star ^\top $ be the SVD of ${\varvec{X}}_t, \varvec{X}_\star \in {{\mathbb {M}}}_r$, respectively. If $\Vert \varvec{X}_t-{\varvec{X}}_\star \Vert <\sigma _r({\varvec{X}}_\star )$, then the following inequality is satisfied

$$\begin{aligned} \Vert P_{{\varvec{U}}_\star }^\perp {\varvec{X}}_t P_{\varvec{V}_\star }^\perp \Vert _F\le \frac{\Vert {\varvec{X}}_t-\varvec{X}_\star \Vert _F^2}{\sigma _r({\varvec{X}}_\star )}. \end{aligned}$$

B Proof of Theorem 1

Proof

Let the residual matrix ${\varvec{E}}_t=\varvec{X}_t-{\varvec{X}}_\star $. According to the iteration, we have

$$\begin{aligned} \begin{aligned} {\varvec{E}}_{t+1}&={\varvec{X}}_{t+1}-{\varvec{X}}_\star \\&={\mathcal {P}}_r({\varvec{X}}_{t}-\mu _t\nabla f({\varvec{X}}_t))-{\varvec{X}}_\star \\&={\mathcal {P}}_r({\varvec{X}}_\star +{\varvec{X}}_{t}-{\varvec{X}}_\star -\mu _t\nabla f({\varvec{X}}_t))-{\varvec{X}}_\star \\&={\mathcal {P}}_r({\varvec{X}}_\star +{\varvec{E}}_{t}-\mu _t\nabla f({\varvec{X}}_t))-{\varvec{X}}_\star \\&{\mathop {=}\limits ^{(a)}}({\varvec{E}}_{t}-\mu _t\nabla f({\varvec{X}}_t))-P_{{\varvec{U}}_\star }^\perp ({\varvec{E}}_{t}-\mu _t\nabla f({\varvec{X}}_t))P_{{\varvec{V}}_\star }^\perp +{\mathcal {O}}(\Vert {\varvec{E}}_{t}\Vert _F^2),\\ \end{aligned} \end{aligned}$$

(30)

where $(a)$ is the first-order expansion (7) of the truncated SVD. Since $\Vert {\mathcal {I}}-\mu \mathcal {A}^*{\mathcal {A}}\Vert \le 1$, we have $\Vert {\varvec{E}}_{t}-\mu \nabla f({\varvec{X}}_t)\Vert _F=\Vert ({\mathcal {I}}-\mu _t{\mathcal {A}}^*\mathcal {A})({\varvec{E}}_t)\Vert _F\le \Vert (\varvec{E}_t)\Vert _F\le \sigma _{{r}}({\varvec{X}}_\star )/2$, which verifies the condition of Lemma 1 holds. After vectorizing $\varvec{e}_{t+1}=\text {vec}({\varvec{E}}_t)$, we get

$$\begin{aligned} \begin{aligned} {\varvec{e}}_{t+1}&=\text {vec}(({\varvec{E}}_{t}-\mu _t\nabla f({\varvec{X}}_t))-P_{{\varvec{U}}_\star }^\perp ({\varvec{E}}_{t}-\mu _t\nabla f({\varvec{X}}_t))P_{{\varvec{V}}_\star }^\perp )+{\mathcal {O}}(\Vert {\varvec{E}}_t\Vert _F^2)\\&{\mathop {=}\limits ^{(a)}}({\varvec{I}}-P_{{\varvec{V}}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp )\text {vec}({\varvec{E}}_{t}-\mu _t\nabla f({\varvec{X}}_t))+{\mathcal {O}}(\Vert {\varvec{E}}_t\Vert _F^2)\\&{\mathop {=}\limits ^{(b)}}\underbrace{({\varvec{I}}-P_{{\varvec{V}}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp )({\varvec{I}}-\mu _t \varvec{\varTheta })}_{{\varvec{H}}(\mu _t)}{\varvec{e}}_{t}+{\mathcal {O}}(\Vert {\varvec{e}}_{t}\Vert _2^2),\\ \end{aligned} \end{aligned}$$

where $(a)$ uses vectorization of Kronecker product, i.e., $\text {vec}({\varvec{A}}{\varvec{B}}{\varvec{C}})=(\varvec{C}^\top \otimes {\varvec{A}})\text {vec}({\varvec{B}})$. $(b)$ is based on (3) and $\Vert \varvec{E}_t\Vert _F=\Vert {\varvec{e}}_{t}\Vert _2$. The convergence rate with the constant stepsize $\mu _t\equiv \mu $ is determined by the spectral radius of the matrix ${\varvec{H}}={\varvec{H}}(\mu )$.

$$\begin{aligned} \rho ({\varvec{H}})=\max _\lambda |\lambda _i(\varvec{H})|=\max {(|\lambda _{\max }(\varvec{H})|,|\lambda _{\min }({\varvec{H}})|)}. \end{aligned}$$

Thus, the maximum and minimum eigenvalues of ${\varvec{H}}$ should be compared. Taking MS as an example, we compute the largest eigenvalue.

$$\begin{aligned} \begin{aligned} \lambda _{\max }({\varvec{H}}_{{\textsf{M}}{\textsf{S}}})&=1-\lambda _{\min }(\mu \varvec{\varTheta }+P_{{\varvec{V}}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp -\mu (P_{{\varvec{V}}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp )\varvec{\varTheta })\\&{\mathop {=}\limits ^{(a)}}1-\lambda _{\min }(\mu \varvec{\varTheta }-\mu (P_{{\varvec{V}}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp )\varvec{\varTheta })\\&=1-\mu \lambda _{\min }(({\varvec{I}}-P_{{\varvec{V}}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp )\varvec{\varTheta }),\\ \end{aligned} \end{aligned}$$

where $(a)$ is based on Lemma 4. Similarly, the minimum eigenvalue results are as follows:

$$\begin{aligned} \lambda _{\min }(\varvec{H}_{{\textsf{M}}{\textsf{S}}})=1-\mu \lambda _{\max }(({\varvec{I}}-P_{\varvec{V}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp )\varvec{\varTheta }). \end{aligned}$$

Obviously, the optimal spectral radius occurs when $\lambda _{\max }(\varvec{H}_{{\textsf{M}}{\textsf{S}}})=-\lambda _{\min }({\varvec{H}}_{{\textsf{M}}{\textsf{S}}})$, i.e., $1-\mu \lambda _{\min }=\mu \lambda _{\max }-1$. The corresponding stepsize is $\mu _\dagger =\frac{2}{\lambda _{\min }+\lambda _{\max }}$. Due to ${\varvec{I}}-P_{{\varvec{V}}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp $ is orthogonal projector, we have $\Vert {\varvec{I}}-P_{{\varvec{V}}_\star }^\perp \otimes P_{\varvec{U}_\star }^\perp \Vert =1$. It is easy to check

$$\begin{aligned} \rho ({\varvec{H}})\le \Vert {\varvec{I}}-P_{\varvec{V}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp \Vert \Vert \varvec{I}-\mu _t \varvec{\varTheta }\Vert \le \Vert {\varvec{I}}-\mu _t \varvec{\varTheta }\Vert \le 1, \end{aligned}$$

so we can get (9). Especially for MC, as shown in (6), we get a simplified result similar to [34].

$$\begin{aligned} \begin{aligned} \lambda _{\max }({\varvec{H}}_{{\textsf{M}}{\textsf{C}}})&{\mathop {=}\limits ^{(a)}}1-\mu \lambda _{\min }( \varvec{S}_{\varOmega }^\top (I-P_{{\varvec{V}}}^\perp \otimes P_{\varvec{U}}^\perp ){\varvec{S}}_{\varOmega }) {\mathop {=}\limits ^{(b)}}1-\mu \lambda _{\min }( {\varvec{I}}- {\varvec{S}}_{\varOmega }^\top (P_{{\varvec{V}}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp ){\varvec{S}}_{\varOmega })\\&=1-\mu (1-\lambda _{\max }( {\varvec{S}}_{\varOmega }^\top (P_{\varvec{V}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp )\varvec{S}_{\varOmega })) {\mathop {=}\limits ^{(c)}}1-\mu (1-\lambda _{\max }({\varvec{S}}_{\varOmega }{\varvec{S}}_{\varOmega }^\top (P_{{\varvec{V}}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp )))\\&{\mathop {=}\limits ^{(d)}}1-\mu (\lambda _{\min }(P_{\varvec{V}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp -\varvec{S}_{\varOmega }{\varvec{S}}_{\varOmega }^\top (P_{\varvec{V}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp ))) {\mathop {=}\limits ^{(e)}}1-\mu (\lambda _{\min }({\varvec{S}}_{{\bar{\varOmega }}}{\varvec{S}}_{{\bar{\varOmega }}}^\top (P_{{\varvec{V}}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp )))\\&{\mathop {=}\limits ^{(f)}}1-\mu (\lambda _{\min }(\varvec{S}_{{\bar{\varOmega }}}^\top (P_{{\varvec{V}}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp ){\varvec{S}}_{{\bar{\varOmega }}})) =1-\mu (\sigma _{\min }^2({\varvec{S}}_{{\bar{\varOmega }}}^\top (\varvec{V}_{_\star \perp } \otimes {\varvec{U}}_{_\star \perp }))), \end{aligned} \end{aligned}$$

where $(a)$, $(c)$ and $(f)$ are based on the fact that AB and BA have the same eigenvalues, $(b)$ and $(e)$ correspond to the properties of the sampling matrix in (6), $(d)$ uses Lemma 4. Similarly, the minimum eigenvalue results are as follows

$$\begin{aligned} \lambda _{\min }(\varvec{H}_{{\textsf{M}}{\textsf{C}}})=1-\mu (\sigma _{\max }^2(\varvec{S}_{{\bar{\varOmega }}}^\top ({\varvec{V}}_{_\star \perp }\otimes \varvec{U}_{_\star \perp }))). \end{aligned}$$

We can estimate the convergence rate of Algorithm 1. $\square $

C Proof of Proposition 1

Proof

Vectorizing (12) yields $\nabla _{\mathcal {R}} f({\varvec{x}}_t)={\varvec{P}}\nabla f({\varvec{x}}_t)$, where ${\varvec{P}}=({\varvec{I}}-P_{{\varvec{U}}}^\perp \otimes P_{{\varvec{V}}}^\perp )$ is the orthogonal projection matrix. Bring (13) into the loss function to get

$$\begin{aligned} \begin{aligned} f({\varvec{x}}_{t+1})&=\frac{1}{2}({\varvec{x}}_{t+1}-{\varvec{x}}_\star )^\top \varvec{\varTheta } ({\varvec{x}}_{t+1}-{\varvec{x}}_\star )\\&=\frac{1}{2}({\varvec{x}}_t -\mu _t \nabla _{\mathcal {R}} f({\varvec{x}}_t)-{\varvec{x}}_\star )^\top \varvec{\varTheta } ({\varvec{x}}_t -\mu _t \nabla _{\mathcal {R}} f({\varvec{x}}_t)-{\varvec{x}}_\star )+{\mathcal {O}}(\Vert {\varvec{x}}_t -{\varvec{x}}_\star \Vert _2^2)\\&=f({\varvec{x}}_t)-\mu _t\nabla _{\mathcal {R}} f({\varvec{x}}_t)^\top \nabla f({\varvec{x}}_t)+\frac{\mu _t^2}{2}\nabla _{\mathcal {R}} f({\varvec{x}}_t)^\top \varvec{\varTheta } \nabla _{\mathcal {R}} f({\varvec{x}}_t)+{\mathcal {O}}(\Vert {\varvec{x}}_t -{\varvec{x}}_\star \Vert _2^2)\\&=f({\varvec{x}}_t)-\frac{(\nabla _{\mathcal {R}} f({\varvec{x}}_t)^\top \nabla f({\varvec{x}}_t))^2}{2\nabla _{\mathcal {R}} f({\varvec{x}}_t)^\top \varvec{\varTheta } \nabla _{\mathcal {R}} f({\varvec{x}}_t)}+{\mathcal {O}}(\Vert {\varvec{x}}_t -{\varvec{x}}_\star \Vert _2^2)\\&{\mathop {=}\limits ^{(a)}}\left( 1-\frac{(\nabla _{\mathcal {R}} f({\varvec{x}}_t)^\top \nabla f({\varvec{x}}_t))^2}{(\nabla f({\varvec{x}}_t)^\top ({\varvec{P}}\varvec{\varTheta } {\varvec{P}}) \nabla f({\varvec{x}}_t))(\nabla _{\mathcal {R}} f({\varvec{x}}_t)^\top ({\varvec{P}}\varvec{\varTheta } {\varvec{P}})^{+} \nabla _{\mathcal {R}} f({\varvec{x}}_t))}\right) f({\varvec{x}}_t)\\&\quad +{\mathcal {O}}(\Vert {\varvec{x}}_t -{\varvec{x}}_\star \Vert _2^2)\\&{\mathop {\le }\limits ^{(b)}}\left( 1-\frac{4}{\frac{\lambda _{\max }({\varvec{P}}\varvec{\varTheta } {\varvec{P}})}{\lambda _{\min }({\varvec{P}}\varvec{\varTheta } {\varvec{P}})} + 2+\frac{\lambda _{\min }({\varvec{P}}\varvec{\varTheta } {\varvec{P}})}{\lambda _{\max }({\varvec{P}}\varvec{\varTheta } {\varvec{P}})}}\right) f({\varvec{x}}_t)+{\mathcal {O}}(\Vert {\varvec{x}}_t -{\varvec{x}}_\star \Vert _2^2)\\&{\mathop {\le }\limits ^{(c)}}\left( \frac{\kappa -1}{\kappa +1}\right) ^2 f({\varvec{x}}_t)+{\mathcal {O}}(\Vert {\varvec{x}}_t -{\varvec{x}}_\star \Vert _2^2),\\ \end{aligned} \end{aligned}$$

where (a) is because of $f({\varvec{x}})=\frac{1}{2}(\varvec{x}-{\varvec{x}}_\star )^\top \varvec{\varTheta } (\varvec{x}-{\varvec{x}}_\star )=\frac{1}{2}\nabla _{\mathcal {R}} f(\varvec{x})^\top ({\varvec{P}}\varvec{\varTheta } {\varvec{P}})^{+} \nabla _{\mathcal {R}} f({\varvec{x}}_t)$. Furthermore, since $\frac{|\nabla _{\mathcal {R}} f({\varvec{x}}_t)^\top \nabla f({\varvec{x}}_t)|}{\Vert \nabla _{\mathcal {R}} f({\varvec{x}}_t)^\top \Vert _2 \Vert \nabla f({\varvec{x}}_t)\Vert _2}\ge 0$, we apply the generalized Kantorovich type inequality [14] in Lemma 7 to get (b). To prove (c), we only need to show

$$\begin{aligned} \begin{aligned} \frac{\lambda _{\max }({\varvec{P}}\varvec{\varTheta } \varvec{P})}{\lambda _{\min }({\varvec{P}}\varvec{\varTheta } {\varvec{P}})}&=\Vert {\varvec{P}}\varvec{\varTheta } {\varvec{P}}\Vert \Vert (\varvec{P}\varvec{\varTheta })^+{\varvec{P}}^+\Vert \le \Vert {\varvec{P}}\varvec{\varTheta }\Vert \Vert {\varvec{P}}\Vert \Vert {\varvec{P}}^+\Vert \Vert ({\varvec{P}}\varvec{\varTheta })^+\Vert \\&=\Vert {\varvec{P}}\varvec{\varTheta }\Vert \Vert ({\varvec{P}}\varvec{\varTheta })^+\Vert = \frac{\lambda _{\max }({\varvec{P}}\varvec{\varTheta })}{\lambda _{\min }({\varvec{P}}\varvec{\varTheta })}:=\kappa , \end{aligned} \end{aligned}$$

here, $\lambda _{\min }(\cdot )$ means the smallest non-zero eigenvalue. $\square $

Lemma 7

(Kantorovich inequality [14]) Let ${\varvec{A}}$ be a symmetric (semi-) positive definite matrix, and $\lambda _{\max }$ and $\lambda _{\min }$ correspond to the largest and smallest non-zero eigenvalues, respectively. If $\varvec{x},{\varvec{y}}\in {{\mathbb {R}}}^n$ satisfies $\frac{|{\varvec{x}}^\top {\varvec{y}}|}{\Vert {\varvec{x}}\Vert _2 \Vert \varvec{y}\Vert _2}\ge \cos \theta $ with $0\le \theta \le \frac{\pi }{2}$, then

$$\begin{aligned} \frac{({\varvec{x}}^\top {\varvec{y}})^2}{({\varvec{x}}^\top {\varvec{A}}{\varvec{x}})({\varvec{y}}^\top \varvec{A}^{+}{\varvec{y}})}\ge \frac{4}{\kappa + 2+\kappa ^{-1}}, \end{aligned}$$

where $\kappa =\frac{\lambda _{\max }}{\lambda _{\min }}\frac{1+\sin \theta }{1-\sin \theta }$ and $(\cdot )^{+}$ is the Moore-Penrose inverse. When ${\varvec{A}}$ is positive definite and ${\varvec{x}}={\varvec{y}}$, i.e., ${\varvec{A}}^{+}={\varvec{A}}^{-1}$ and $\theta =0$, the above inequality degenerates into the traditional form.

D Proof of Theorem 2

Proof

According to Algorithm 3, we calculate the error as follows.

$$\begin{aligned} \begin{aligned} {\varvec{E}}_{t+1}&={\varvec{X}}_{t+1}-{\varvec{X}}_\star \\&={\mathcal {P}}_r({\varvec{Y}}_t-\mu _t \nabla f({\varvec{Y}}_t))-{\varvec{X}}_\star \\&={\mathcal {P}}_r({\varvec{X}}_\star +{\varvec{Y}}_t-{\varvec{X}}_\star -\mu _t \nabla f({\varvec{Y}}_t))-{\varvec{X}}_\star \\&=({\varvec{Y}}_t-{\varvec{X}}_\star -\mu _t \nabla f({\varvec{Y}}_t))-P_{{\varvec{U}}_\star }^\perp ({\varvec{Y}}_t-{\varvec{X}}_\star -\mu _t \nabla f({\varvec{Y}}_t))P_{{\varvec{V}}_\star }^\perp +{\mathcal {O}}(\Vert {\varvec{Y}}_t-{\varvec{X}}_\star \Vert _F^2).\\ \end{aligned} \end{aligned}$$

After vectorizing, we have

$$\begin{aligned} \begin{aligned} {\varvec{e}}_{t+1}&=\underbrace{({\varvec{I}}-P_{{\varvec{V}}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp )({\varvec{I}}-\mu _t \varvec{\varTheta })}_{{\varvec{H}}_t={\varvec{H}}(\mu _t)}\text {vec}({\varvec{Y}}_t-{\varvec{X}}_\star )+{\mathcal {O}}(\Vert {\varvec{Y}}_t-{\varvec{X}}_\star \Vert _F^2)\\&=(1+\eta _t){\varvec{H}}_t{\varvec{e}}_{t}-\eta _t {\varvec{H}}_t{\varvec{e}}_{t-1}+{\mathcal {O}}(\Vert {\varvec{e}}_{t}\Vert _2^2).\\ \end{aligned} \end{aligned}$$

Stacking the errors of two adjacent iterations, we get the recursive form

$$\begin{aligned} \begin{aligned} \begin{pmatrix} {\varvec{e}}_{t+1}\\ {\varvec{e}}_{t} \end{pmatrix}=\underbrace{\begin{pmatrix} (1+\eta _t){\varvec{H}}_t &{}-\eta _t{\varvec{H}}_t\\ {\varvec{I}}&{}{\varvec{0}} \end{pmatrix}}_{{\varvec{T}}} \begin{pmatrix} {\varvec{e}}_{t}\\ {\varvec{e}}_{t-1} \end{pmatrix}. \end{aligned} \end{aligned}$$

The convergence rate depends on the spectral radius $\rho ({\varvec{T}})$ of ${\varvec{T}}\in {\mathbb {R}}^{2n_1n_2\times 2n_1n_2}$. According to the eigendecomposition in [34], ${\varvec{T}}$ is similar to the block diagonal matrix composed of the $2\times 2$ matrix ${\varvec{T}}_j$, i.e., ${\varvec{T}}\sim \text {bldiag}({\varvec{T}}_1,{\varvec{T}}_2,\ldots ,\varvec{T}_{n_1n_2})$, where each block ${\varvec{T}}_j\in {\mathbb {R}}^{2\times 2}$ is form

$$\begin{aligned} {\varvec{T}}_j=\begin{pmatrix} (1+\eta _t)(1-\mu _t\lambda _j) &{}-\eta _t(1-\mu _t\lambda _j)\\ 1&{}0 \end{pmatrix}. \end{aligned}$$

where $\lambda _j$ is the eigenvalue of matrix $(\varvec{I}-P_{{\varvec{V}}_\star }^\perp \otimes P_{\varvec{U}_\star }^\perp )\varvec{\varTheta }$. Next, we aim to find the eigenvalues of the matrix ${\varvec{T}}_j$ using the characteristic polynomial.

$$\begin{aligned} r^2-(1+\eta _t)(1-\mu _t\lambda _j)r+\eta _t(1-\mu _t\lambda _j)=0. \end{aligned}$$

(31)

According to the quadratic formula, set the discriminant $\varDelta (\lambda _j,\mu _t,\eta _t)=(1+\eta _t)^2(1-\mu _t\lambda _j)^2-4\eta _t(1-\mu _t\lambda _j)$, then the solution to (31) is:

$$\begin{aligned} r^{\pm }(\lambda _j,\mu _t,\eta _t)=\frac{(1+\eta _t)(1-\mu _t\lambda _j)\pm \sqrt{\varDelta (\lambda _j,\mu _t,\eta _t)}}{2}, \end{aligned}$$

(32)

where the superscript $(\cdot )^{\pm }$ means addition or subtraction in numerator. For given ${\varvec{T}}$ with fixed $(\mu _t,\eta _t)$, $\rho ({\varvec{T}})=\max _{\lambda _j} |r^{\pm }(\lambda _j,\mu _t,\eta _t)|$ is continuous and quasi-convex w.r.t. the eigenvalue $\lambda _j$ [18, 21, 37]. Thus, the extremal value is attained on the boundary, i.e.

$$\begin{aligned} \rho ({\varvec{T}})=\max (|r^{\pm }(\lambda _{\max },\mu _t,\eta _t)|,|r^{\pm }(\lambda _{\min },\mu _t,\eta _t)|). \end{aligned}$$

(33)

As a whole, $\rho ({\varvec{T}})$ is determined by the maximum modulus of the roots of (33). We denote that surfaces $\varPi _1$ and $\varPi _2$ correspond to $\lambda _{\min }$ and $\lambda _{\max }$, respectively.

Below we show how to determine the minimum spectral radius and corresponding parameters. Back to (32), $|r^{\pm }(\lambda _j,\mu _t,\eta _t)|\ge |(1+\eta _t)(1-\mu _t\lambda _j)|/2$ takes the equal if and only if $\varDelta (\lambda _j,\mu _t,\eta _t)=0$. In this case, we can get a relationship of the parameter $(\mu _t,\eta _t)$

$$\begin{aligned} \eta _t^-=\frac{1-\sqrt{\mu _t\lambda _j}}{1+\sqrt{\mu _t\lambda _j}},\eta _t^+=\frac{1+\sqrt{\mu _t\lambda _j}}{1-\sqrt{\mu _t\lambda _j}}. \end{aligned}$$

(34)

Obviously, $0<\eta _t^-<1<\eta _t^+$. Given $\mu _t$, there are three cases for $\eta _t$.

(32) with $\eta _t\in (0,\eta _{t}^-)\cup (\eta _{t}^+,\infty )$ has two different solutions.
(32) with $\eta _t=\eta _{t}^\pm $ has a single solution.
(32) with $\eta _t\in (\eta _{t}^-,\eta _{t}^+)$ has conjugate complex solutions.

If $\eta _t\in [\eta _{t}^+,\infty )$, $r^{\pm }(\lambda _j,\mu _t,\eta _t^+)\ge |(1+\eta _t^+)(1-\mu _t\lambda _j)|/2=1+\sqrt{\mu _t\lambda _j}>1$, and $\rho ({\varvec{T}})>1$ is obtained form (33). Conversely, when $\eta _t=\eta _{t}^-$, $r^{\pm }(\lambda _j,\mu _t,\eta _t^-)=|(1+\eta _t^-)(1-\mu _t\lambda _j)|/2=1-\sqrt{\mu _t\lambda _j}<1$. This is also why the parameter is selected as $0<\eta \le 1$ in practice. When $\eta _t\in (\eta _{t}^-,\eta _{t}^+)$, $\rho ({\varvec{T}})=\max _{\lambda _j} \sqrt{\eta _t(1-\mu _t\lambda _j)}$ monotonically increases w.r.t. $\eta _t$ and monotonically decreases w.r.t. $\mu _t$. We can draw the geometric properties of $\rho ({\varvec{T}})$ w.r.t. $(\mu _t,\eta _t)$, and condition $\varDelta (\lambda _j,\mu _t,\eta _t^-)=0$ helps to find the theoretical lower bound of $\rho ({\varvec{T}})$. The optimal parameter pair $(\mu _\flat ,\eta _\flat )$ is the intersection of $r^{\pm }(\lambda _{\min },\mu _\flat ,\eta _\flat )$ in the curve $\eta _t^-=\frac{1-\sqrt{\mu _t\lambda _j}}{1+\sqrt{\mu _t\lambda _j}}$ and the surface $\varPi _2$, i.e., $|r^{-}(\lambda _{\max },\mu _t,\eta _t)|$. So it satisfies the following equation

$$\begin{aligned} (1+\eta _\flat )(1-\mu _\flat \lambda _{\min })=-(1+\eta _\flat )(1-\mu _\flat \lambda _{\max }^2)+\sqrt{(1+\eta _\flat )^2(1-\mu _\flat \lambda _{\max }^2)^2-4\eta _\flat (1-\mu _\flat \lambda _{\max }^2)}. \end{aligned}$$

Bringing in $\eta _\flat =\frac{1-\sqrt{\mu _\flat \lambda _{\min }}}{1+\sqrt{\mu _\flat \lambda _{\min }}}$, it is not difficult for us to get optimal convergence result $\mu _{\flat }=\frac{4}{\lambda _{\min }+3\lambda _{\max }}$ and $\rho _{\textsf{opt}}(\varvec{T})=1-\sqrt{\frac{4\lambda _{\min }}{\lambda _{\min }+3\lambda _{\max }}}$ in (18). Also, for $\eta _t<1$, the intersection of $\varPi _1$ and $\varPi _2$ can be calculated according to monotonicity

$$\begin{aligned} r^{+}(\lambda _{\min },\mu _t,\eta _t)=-r^{-}(\lambda _{\max },\mu _t,\eta _t). \end{aligned}$$

If $\eta _t=0$, it simplifies to $\mu _t=2/(\lambda _{\min }+\lambda _{\max })=\mu _\dagger $ in (11). Due to momentum, the optimal stepsizes satisfy $\mu _\flat <\mu _\dagger $. In fact, we bring $\eta _t=0$ to get ${\varvec{e}}_t={\varvec{H}}{\varvec{e}}_{t-1}+\mathcal {O}(\Vert {\varvec{e}}_{t-1}\Vert _2^2)$, which is consistent with the non-accelerated iteration. Conversely, if $\mu _t\ge \mu _\dagger $, then $\eta _t=0$ is a good parameter choice, which means NAG degenerates to Grad. When $\eta _t\ne 0$, we have

$$\begin{aligned} \eta _t\mu _t^2(\lambda _{\max }-\lambda _{\min })^2+2(1+\eta _t)^2(1-\mu _t\lambda _{\max })(1-\mu _t\lambda _{\min })(2-\mu _t(\lambda _{\min }+\lambda _{\max }))=0. \end{aligned}$$

Despite the complex form, we use symbolic computing tools to solve when $\mu _t\in (\mu _\flat ,\mu _\dagger )$

$$\begin{aligned} \begin{aligned} \eta _{t\bowtie }&=[(-4\lambda _{\min }^2\lambda _{\max }\mu _t^3+5\lambda _{\min }^2\mu _t^2-4\lambda _{\min }\lambda _{\max }^2\mu _t^3+14\lambda _{\min }\lambda _{\max }\mu _t^2-12\lambda _{\min }\mu _t+5\lambda _{\max }^2\mu _t^2-12\lambda _{\max }\mu _t+8)\\&-\sqrt{\mu _t^2(-(\lambda _{\max }-\lambda _{\min })^2)(8\lambda _{\min }^2\lambda _{\max }\mu _t^3-9\lambda _{\min }^2\mu _t^2+8\lambda _{\min }\lambda _{\max }^2\mu _t^3-30\lambda _{\min }\mu _t^2+24\lambda _{\min }\mu _t-9\lambda _{\max }^2\mu _t^2+24\lambda _{\max }\mu _t-16)}]\\&/(4(\lambda _{\min }\mu _t-1)(\lambda _{\max }\mu _t-1)(\lambda _{\min }\mu _t+\lambda _{\max }\mu _t-2)). \end{aligned} \end{aligned}$$

We analyze the spectral radius of ${\varvec{T}}$ in (33) w.r.t. pair $(\mu _t,\eta _{t})$ by case. $\square $

E Proof in Sect. 4

1.1 E.1 Proof of Lemma 3

Proof

The first one obviously holds according to Lemma 1. From (25), we have

$$\begin{aligned} \begin{aligned} {\mathcal {R}}_{{\varvec{X}}}^{\textsf{orth}}({\varvec{N}})&=({\varvec{X}}+{\varvec{N}}){\varvec{V}}_{{\varvec{X}}}(\varvec{\varSigma }_{{\varvec{X}}}+{\varvec{U}}_{{\varvec{X}}}^\top {\varvec{N}} {\varvec{V}}_{{\varvec{X}}})^{-1}{\varvec{U}}_{{\varvec{X}}}^\top ({\varvec{X}}+{\varvec{N}})\\&{\mathop {=}\limits ^{(a)}}({\varvec{X}}+{\varvec{N}}){\varvec{V}}_{{\varvec{X}}}(\varvec{\varSigma }_{{\varvec{X}}}^{-1}-\varvec{\varSigma }_{{\varvec{X}}}^{-1}{\varvec{U}}_{{\varvec{X}}}^\top {\varvec{N}}{\varvec{V}}_{{\varvec{X}}}\varvec{\varSigma }_{{\varvec{X}}}^{-1}){\varvec{U}}_{{\varvec{X}}}^\top ({\varvec{X}}+{\varvec{N}})+{\mathcal {O}}(\Vert {\varvec{N}}\Vert _F^2)\\&=({\varvec{X}}+{\varvec{N}})({\varvec{V}}_{{\varvec{X}}}\varvec{\varSigma }_{{\varvec{X}}}^{-1}{\varvec{U}}_{{\varvec{X}}}^\top -{\varvec{V}}_{{\varvec{X}}}\varvec{\varSigma }_{{\varvec{X}}}^{-1}{\varvec{U}}_{{\varvec{X}}}^\top {\varvec{N}} {\varvec{V}}_{{\varvec{X}}}\varvec{\varSigma }_{{\varvec{X}}}^{-1}{\varvec{U}}_{{\varvec{X}}}^\top )({\varvec{X}}+{\varvec{N}})+{\mathcal {O}}(\Vert {\varvec{N}}\Vert _F^2)\\&=({\varvec{X}}+{\varvec{N}})({\varvec{X}}^{-\top }-{\varvec{X}}^{-\top }{\varvec{N}} {\varvec{X}}^{-\top })({\varvec{X}}+{\varvec{N}})+{\mathcal {O}}(\Vert {\varvec{N}}\Vert _F^2)\\&{\mathop {=}\limits ^{(b)}}{\varvec{X}}{\varvec{X}}^{-\top }{\varvec{X}}+{\varvec{N}}{\varvec{X}}^{-\top }{\varvec{X}}+{\varvec{X}}{\varvec{X}}^{-\top }{\varvec{N}}-{\varvec{X}}{\varvec{X}}^{-\top }{\varvec{N}} {\varvec{X}}^{-\top }{\varvec{X}}+{\mathcal {O}}(\Vert {\varvec{N}}\Vert _F^2)\\&{\mathop {=}\limits ^{(c)}}{\varvec{X}}+P_{{\varvec{U}}_{{\varvec{X}}}} {\varvec{N}}+{\varvec{N}} P_{{\varvec{V}}_{{\varvec{X}}}}-P_{{\varvec{U}}_{{\varvec{X}}}} {\varvec{N}} P_{{\varvec{V}}_{{\varvec{X}}}}+{\mathcal {O}}(\Vert {\varvec{N}}\Vert _F^2)\\&={\varvec{X}}+{\varvec{N}}-P_{{\varvec{U}}_{{\varvec{X}}}}^\perp {\varvec{N}} P_{{\varvec{V}}_{{\varvec{X}}}}^\perp +{\mathcal {O}}(\Vert {\varvec{N}}\Vert _F^2)\\&={\mathcal {P}}_{{{\mathbb {T}}}_{{\varvec{X}}}}({\varvec{X}}+\varvec{N})+{\mathcal {O}}(\Vert {\varvec{N}}\Vert _F^2), \end{aligned} \end{aligned}$$

where (a) is the perturbation analysis of matrix inverse. As long as $\Vert {\varvec{A}}^{-1}{\varvec{B}}\Vert <1$ or $\Vert \varvec{B}{\varvec{A}}^{-1}\Vert <1$ holds, the Taylor expansion of the inverse of the matrix sum is as follows

$$\begin{aligned} \begin{aligned} ({\varvec{A}}+{\varvec{B}})^{-1}&={\varvec{A}}^{-1} - {\varvec{A}}^{-1}{\varvec{B}}{\varvec{A}}^{-1} + {\varvec{A}}^{-1}({\varvec{B}}{\varvec{A}}^{-1})^2 - {\varvec{A}}^{-1}({\varvec{B}}{\varvec{A}}^{-1})^3 + \cdots \\&={\varvec{A}}^{-1}-{\varvec{A}}^{-1}{\varvec{B}}\varvec{A}^{-1}+{\mathcal {O}}(\Vert {\varvec{B}}\Vert _F^2). \end{aligned} \end{aligned}$$

Using the norm inequality $\Vert {\varvec{A}}\varvec{B}\Vert \le \Vert {\varvec{A}}\Vert \Vert {\varvec{B}}\Vert $, combined with the condition $\Vert {\varvec{N}}\Vert \le \Vert {\varvec{N}}\Vert _F< \sigma _{{r}}({\varvec{X}})/2$, it can be judged that the inverse matrix condition holds.

$$\begin{aligned} \Vert \varvec{\varSigma }_{{\varvec{X}}}^{-1}({\varvec{U}}_{\varvec{X}}^\top {\varvec{N}} {\varvec{V}}_{\varvec{X}})\Vert \le \frac{\Vert {\varvec{U}}_{{\varvec{X}}}^\top {\varvec{N}} {\varvec{V}}_{{\varvec{X}}}\Vert }{\Vert \varvec{\varSigma }_{\varvec{X}}\Vert }\le \frac{\Vert \varvec{N}\Vert }{\sigma _{{r}}({\varvec{X}})}<1. \end{aligned}$$

(b) merges the product of multiple ${\varvec{N}}$ into higher-order terms. (c) uses the SVD of ${\varvec{X}}$ to get

$$\begin{aligned} \begin{aligned}&{\varvec{X}}{\varvec{X}}^{-\top }={\varvec{U}}_{{\varvec{X}}}\varvec{\varSigma }_{{\varvec{X}}} {\varvec{V}}_{{\varvec{X}}}^\top {\varvec{V}}_{{\varvec{X}}}\varvec{\varSigma }_{{\varvec{X}}}^{-1}{\varvec{U}}_{{\varvec{X}}}^\top ={\varvec{U}}_{{\varvec{X}}}{\varvec{U}}_{{\varvec{X}}}^\top =P_{{\varvec{U}}_{{\varvec{X}}}},\\&{\varvec{X}}^{-\top }{\varvec{X}}={\varvec{V}}_{{\varvec{X}}}\varvec{\varSigma }_{{\varvec{X}}}^{-1}{\varvec{U}}_{{\varvec{X}}}^\top {\varvec{U}}_{{\varvec{X}}}\varvec{\varSigma }_{{\varvec{X}}} {\varvec{V}}_{{\varvec{X}}}^\top ={\varvec{V}}_{{\varvec{X}}}{\varvec{V}}_{{\varvec{X}}}^\top =P_{{\varvec{V}}_{{\varvec{X}}}},\\&{\varvec{X}}{\varvec{X}}^{-\top }{\varvec{X}}=P_{\varvec{U}_{{\varvec{X}}}} {\varvec{U}}_{{\varvec{X}}}\varvec{\varSigma }_{{\varvec{X}}} {\varvec{V}}_{\varvec{X}}^\top ={\varvec{U}}_{{\varvec{X}}}\varvec{\varSigma }_{\varvec{X}} {\varvec{V}}_{{\varvec{X}}}^\top ={\varvec{X}}. \end{aligned} \end{aligned}$$

$\square $

1.2 E.2 Convergence for Algorithm 4

Proof

According to Algorithm 4, we have

$$\begin{aligned} \begin{aligned} {\varvec{E}}_{t+1}&={\mathcal {R}}_{{\varvec{X}}_t}(-\mu _t\text {grad}f({\varvec{X}}_t))-{\varvec{X}}_\star \\&{\mathop {=}\limits ^{(a)}}{\mathcal {P}}_{{{\mathbb {T}}}_{X_t}{{\mathbb {M}}}_r}({\varvec{X}}_t-\mu _t\nabla f({\varvec{X}}_t))-{\varvec{X}}_\star +{\mathcal {O}}(\Vert {\varvec{E}}_{t}\Vert _F^2)\\&{\mathop {=}\limits ^{(b)}}({\varvec{E}}_{t}-\mu _t\nabla f({\varvec{X}}_t))-P_{{\varvec{U}}_t}^\perp ({\varvec{E}}_{t}-\mu _t\nabla f({\varvec{X}}_t))P_{{\varvec{V}}_t}^\perp +{\mathcal {O}}(\Vert {\varvec{E}}_{t}\Vert _F^2)\\&{\mathop {=}\limits ^{(c)}}({\varvec{E}}_{t}-\mu _t\nabla f({\varvec{X}}_t))-P_{{\varvec{U}}_\star }^\perp ({\varvec{E}}_{t}-\mu _t\nabla f({\varvec{X}}_t))P_{{\varvec{V}}_\star }^\perp +{\mathcal {O}}(\Vert {\varvec{E}}_{t}\Vert _F^2),\\ \end{aligned} \end{aligned}$$

where $(a)$ uses Lemma 3, $(b)$ is based on the tangent space projection in (22), and $(c)$ uses the subspace perturbation in Lemma 5, and replaces the subspace ${\mathcal {P}}_{{{\mathbb {T}}}_{{\varvec{X}}_t}{{\mathbb {M}}}_r}$ with $\mathcal {P}_{{{\mathbb {T}}}_{{\varvec{X}}_\star }{{\mathbb {M}}}_r}$.

$$\begin{aligned} \begin{aligned} \Vert P_{{\varvec{U}}_t}^\perp {\varvec{A}}{\varvec{P}}_{\varvec{V}_t}^\perp -P_{{\varvec{U}}_\star }^\perp {\varvec{A}}{\varvec{P}}_{\varvec{V}_\star }^\perp \Vert&=\Vert P_{{\varvec{U}}_t}^\perp {\varvec{A}}{\varvec{P}}_{{\varvec{V}}_t}^\perp -P_{{\varvec{U}}_t}^\perp {\varvec{A}}{\varvec{P}}_{{\varvec{V}}_\star }^\perp +P_{{\varvec{U}}_t}^\perp {\varvec{A}}{\varvec{P}}_{{\varvec{V}}_\star }^\perp -P_{{\varvec{U}}_\star }^\perp {\varvec{A}}{\varvec{P}}_{{\varvec{V}}_\star }^\perp \Vert \\&\le \Vert P_{{\varvec{U}}_t}^\perp {\varvec{A}}{\varvec{P}}_{{\varvec{V}}_t}^\perp -P_{{\varvec{U}}_t}^\perp {\varvec{A}}{\varvec{P}}_{{\varvec{V}}_\star }^\perp \Vert +\Vert P_{{\varvec{U}}_t}^\perp {\varvec{A}}{\varvec{P}}_{{\varvec{V}}_\star }^\perp -P_{{\varvec{U}}_\star }^\perp {\varvec{A}}{\varvec{P}}_{{\varvec{V}}_\star }^\perp \Vert \\&\le \Vert P_{{\varvec{U}}_t}^\perp \Vert \Vert {\varvec{A}}\Vert \Vert P_{{\varvec{V}}_t}^\perp -P_{{\varvec{V}}_\star }^\perp \Vert +\Vert P_{{\varvec{U}}_t}^\perp -P_{{\varvec{U}}_\star }^\perp \Vert \Vert {\varvec{A}}\Vert \Vert P_{{\varvec{V}}_\star }^\perp \Vert \\&={\mathcal {O}}(\Vert {\varvec{E}}_t\Vert _F^2), \end{aligned} \end{aligned}$$

The subsequent proof is consistent with the proof of Theorem 1 in Appendix 1. $\square $

1.3 E.3 Convergence for Algorithm 5

Proof

The proof is divided into three steps to analyse $\varvec{X}_{t-1}$, ${\varvec{Y}}_t$ and ${\varvec{X}}_{t+1}$, respectively.

Step 1 Calculate the orthographic retraction of $\varvec{X}_{t-1}$ and the inverse matrix.

$$\begin{aligned} \begin{aligned} \textsf{inv} {\mathcal {R}}^{\textsf{orth}}_{\varvec{X}_t}({\varvec{X}}_{t-1})&={\mathcal {P}}_{{{\mathbb {T}}}_{{\varvec{X}}_t}{{\mathbb {M}}}_r}({\varvec{X}}_{t-1}-{\varvec{X}}_t)\\&={\mathcal {P}}_{{{\mathbb {T}}}_{{\varvec{X}}_t}{{\mathbb {M}}}_r}({\varvec{X}}_{t-1})-{\varvec{X}}_t\\&={\mathcal {P}}_{{{\mathbb {T}}}_{{\varvec{X}}_\star }{{\mathbb {M}}}_r}({\varvec{X}}_{t-1})-{\varvec{X}}_t+{\mathcal {O}}(\Vert {\varvec{E}}_{t}\Vert _F^2)\\&={\varvec{X}}_{t-1}-{\varvec{X}}_t+{\mathcal {O}}(\Vert {\varvec{E}}_{t}\Vert _F^2+\Vert {\varvec{E}}_{t-1}\Vert _F^2).\\ \end{aligned} \end{aligned}$$

It gives an approximation of ${\varvec{X}}_{t-1}$ on the tangent space ${{\mathbb {T}}}_{{\varvec{X}}_t}{{\mathbb {M}}}_r$.

Step 2 Similar to Appendix 1, we calculate the residual of ${\varvec{Y}}_{t}$

$$\begin{aligned} \begin{aligned} {\varvec{Y}}_{t}-{\varvec{X}}_\star&={\mathcal {R}}^{\textsf{orth}}_{{\varvec{X}}_t}(-\eta _t \textsf{inv} {\mathcal {R}}^{\textsf{orth}}_{{\varvec{X}}_t}({\varvec{X}}_{t-1}))-X_\star \\&={\mathcal {P}}_{{{\mathbb {T}}}_{{\varvec{X}}_t}{{\mathbb {M}}}_r}({\varvec{X}}_t-\eta _t \textsf{inv} {\mathcal {R}}^{\textsf{orth}}_{{\varvec{X}}_t}({\varvec{X}}_{t-1}))-{\varvec{X}}_\star +{\mathcal {O}}(\Vert {\varvec{E}}_t\Vert _F^2)\\&={\varvec{X}}_t-\eta _t \textsf{inv} {\mathcal {R}}^{\textsf{orth}}_{{\varvec{X}}_t}({\varvec{X}}_{t-1})-{\varvec{X}}_\star +{\mathcal {O}}(\Vert {\varvec{E}}_t\Vert _F^2)\\&={\varvec{X}}_t-{\varvec{X}}_\star +\eta _t ({\varvec{X}}_t-{\varvec{X}}_{t-1})+{\mathcal {O}}(\Vert {\varvec{E}}_{t}\Vert _F^2+\Vert {\varvec{E}}_{t-1}\Vert _F^2)\\&={\varvec{E}}_t+\eta _t({\varvec{E}}_t-\varvec{E}_{t-1})+{\mathcal {O}}(\Vert {\varvec{E}}_{t}\Vert _F^2+\Vert \varvec{E}_{t-1}\Vert _F^2). \end{aligned} \end{aligned}$$

It also satisfies the linear extrapolation in Euclidean space.

Step 3 Compute ${\varvec{X}}_{t+1}-{\varvec{X}}_\star $ to get the recursive form

$$\begin{aligned} \begin{aligned} {\varvec{E}}_{t+1}&={\varvec{X}}_{t+1}-{\varvec{X}}_\star \\&={\mathcal {R}}^{\textsf{orth}}_{{\varvec{Y}}_t}(-\mu _t\text {grad}f({\varvec{Y}}_t))-{\varvec{X}}_\star \\&={\mathcal {P}}_{{{\mathbb {T}}}_{{\varvec{Y}}_t}{{\mathbb {M}}}_r}({\varvec{Y}}_t-\mu _t \nabla f({\varvec{Y}}_t))-{\varvec{X}}_\star +{\mathcal {O}}(\Vert {\varvec{Y}}_t-{\varvec{X}}_\star \Vert _F^2)\\&={\mathcal {P}}_{{{\mathbb {T}}}_{{\varvec{X}}_\star }{{\mathbb {M}}}_r}({\varvec{Y}}_t-\mu _t \nabla f({\varvec{Y}}_t))-{\varvec{X}}_\star +{\mathcal {O}}(\Vert {\varvec{Y}}_t-{\varvec{X}}_\star \Vert _F^2)\\&=({\varvec{Y}}_t-{\varvec{X}}_\star -\mu _t \nabla f({\varvec{Y}}_t))-P_{{\varvec{U}}_\star }^\perp ({\varvec{Y}}_t-{\varvec{X}}_\star -\mu _t \nabla f({\varvec{Y}}_t))P_{{\varvec{V}}_\star }^\perp +{\mathcal {O}}(\Vert {\varvec{Y}}_t-{\varvec{X}}_\star \Vert _F^2).\\ \end{aligned} \end{aligned}$$

The subsequent proof is consistent with proof of Theorem 2 in Appendix 1. $\square $

1.4 E.4 Proof of Restart Condition Equivalence in (28)

Proof

When condition (8) hold, we have

$$\begin{aligned} \begin{aligned} \begin{aligned} \langle \nabla f({\varvec{Y}}_{t-1}), {\varvec{X}}_t-\varvec{X}_{t-1}\rangle&=\langle \text {grad}~f({\varvec{Y}}_{t-1})+\nabla f({\varvec{Y}}_{t-1})-\text {grad}~f({\varvec{Y}}_{t-1}), X_t-{\varvec{X}}_{t-1}\rangle \\&{\mathop {=}\limits ^{(a)}}\langle \text {grad}~f({\varvec{Y}}_{t-1}), \textsf{inv} {\mathcal {R}}^{\textsf{orth}}_{{\varvec{Y}}_{t-1}}({\varvec{X}}_{t})-\textsf{inv} {\mathcal {R}}^{\textsf{orth}}_{{\varvec{Y}}_{t-1}}({\varvec{X}}_{t-1})\rangle \\&\quad +\langle \nabla f({\varvec{Y}}_{t-1})-\text {grad}~f({\varvec{Y}}_t), {\varvec{X}}_t-\textsf{inv} {\mathcal {R}}^{\textsf{orth}}_{{\varvec{Y}}_{t-1}}({\varvec{X}}_{t})\rangle \\&\quad -\langle \nabla f({\varvec{Y}}_{t-1})-\text {grad}~f({\varvec{Y}}_t), {\varvec{X}}_{t-1}-\textsf{inv} {\mathcal {R}}^{\textsf{orth}}_{{\varvec{Y}}_{t-1}}({\varvec{X}}_{t-1})\rangle \\&{\mathop {\approx }\limits ^{(b)}}\langle \text {grad}~f({\varvec{Y}}_{t-1}), \textsf{inv} {\mathcal {R}}^{\textsf{orth}}_{{\varvec{Y}}_{t-1}}({\varvec{X}}_{t})-\textsf{inv} {\mathcal {R}}^{\textsf{orth}}_{{\varvec{Y}}_{t-1}}({\varvec{X}}_{t-1})\rangle ,\\ \end{aligned} \end{aligned} \end{aligned}$$

where (a) uses the orthogonal relationship of the Riemannian gradient and tangent space. Based on the first-order expansion, we appropriately omit the higher-order terms in (a) to obtain the approximate relationship (b), which will not change the sign before and after the approximation. As mentioned in step 2 in Appendix 1, $\nabla f(\varvec{Y}_{t-1})=\varvec{\varTheta } \text {vec}(\varvec{Y}_{t-1}-{\varvec{X}}_\star )$ and $\text {grad}~f(\varvec{Y}_{t-1})$ both are first order w.r.t. the residual. According to Lemma 3, $\varvec{X}_t-\textsf{inv} {\mathcal {R}}^{\textsf{orth}}_{\varvec{Y}_{t-1}}({\varvec{X}}_{t})$ and ${\varvec{X}}_{t-1}-\textsf{inv} {\mathcal {R}}^{\textsf{orth}}_{{\varvec{Y}}_{t-1}}(\varvec{X}_{t-1})$ are second order. So the last two terms of (a) are third order, while the remaining inner product is second order. $\square $

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Li, H., Peng, Z., Pan, C. et al. Fast Gradient Method for Low-Rank Matrix Estimation. J Sci Comput 96, 41 (2023). https://doi.org/10.1007/s10915-023-02266-7

Download citation

Received: 25 November 2022
Revised: 29 May 2023
Accepted: 30 May 2023
Published: 17 June 2023
DOI: https://doi.org/10.1007/s10915-023-02266-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast Gradient Method for Low-Rank Matrix Estimation

Abstract

Access this article

Similar content being viewed by others

Analysis of Singular Value Thresholding Algorithm for Matrix Completion

An improved Riemannian conjugate gradient method and its application to robust matrix completion

An Approximate Augmented Lagrangian Method for Nonnegative Low-Rank Matrix Approximation

Data Availibility

Code Availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

A Auxiliary Lemmas

1.1 A.1 Relationship of Matrix Eigenvalues

Lemma 4

Proof

1.2 A.2 Perturbation Analysis of Subspaces

Lemma 5

Lemma 6

B Proof of Theorem 1

Proof

C Proof of Proposition 1

Proof

Lemma 7

D Proof of Theorem 2

Proof

E Proof in Sect. 4

1.1 E.1 Proof of Lemma 3

Proof

1.2 E.2 Convergence for Algorithm 4

Proof

1.3 E.3 Convergence for Algorithm 5

Proof

1.4 E.4 Proof of Restart Condition Equivalence in (28)

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation