Abstract
Projected gradient descent and its Riemannian variant belong to a typical class of methods for low-rank matrix estimation. This paper proposes a new Nesterov’s Accelerated Riemannian Gradient algorithm using efficient orthographic retraction and tangent space projection. The subspace relationship between iterative and extrapolated sequences on the low-rank matrix manifold provides computational convenience. With perturbation analysis of truncated singular value decomposition and two retractions, we systematically analyze the local convergence of gradient algorithms and Nesterov’s variants in the Euclidean and Riemannian settings. Theoretically, we estimate the exact rate of local linear convergence under different parameters using the spectral radius in a closed form and give the optimal convergence rate and the corresponding momentum parameter. When the parameter is unknown, the adaptive restart scheme can avoid the oscillation problem caused by high momentum, thus approaching the optimal convergence rate. Extensive numerical experiments confirm the estimations of convergence rate and demonstrate that the proposed algorithm is competitive with first-order methods for matrix completion and matrix sensing.
Similar content being viewed by others
Data Availibility
Enquiries about data availability should be directed to the authors.
Code Availability
The codes used to perform the experiments in this paper are available from https://github.com/pxxyyz/FastGradient.
References
Absil, P.A., Malick, J.: Projection-like retractions on matrix manifolds. SIAM J. Optim. 22(1), 135–158 (2012)
Absil, P.A., Oseledets, I.V.: Low-rank retractions: a survey and new results. Comput. Optim. Appl. 62(1), 5–29 (2015)
Ahn, K., Sra, S.: From Nesterov’s estimate sequence to Riemannian acceleration. In: Conference on Learning Theory, pp. 84–118. PMLR (2020)
Boumal, N.: An Introduction to Optimization on Smooth Manifolds. Cambridge University Press (2023)
Cai, J.F., Wei, K.: Exploiting the structure effectively and efficiently in low-rank matrix recovery. In: Handbook of Numerical Analysis, vol. 19, pp. 21–51. Elsevier (2018)
Chen, Y., Chi, Y.: Harnessing structures in big data via guaranteed low-rank matrix estimation: recent theory and fast algorithms via convex and nonconvex optimization. IEEE Sign. Process Mag. 35(4), 14–31 (2018)
Chen, Y., Chi, Y., Fan, J., Ma, C.: Gradient descent with random initialization: fast global convergence for nonconvex phase retrieval. Math. Program. 176, 5–37 (2019)
Chen, Y., Chi, Y., Fan, J., Ma, C., et al.: Spectral methods for data science: a statistical perspective. Found. Trends Mach. Learn. 14(5), 566–806 (2021)
Chi, Y., Lu, Y.M., Chen, Y.: Nonconvex optimization meets low-rank matrix factorization: an overview. IEEE Trans. Sign. Process. 67(20), 5239–5269 (2019)
Chunikhina, E., Raich, R., Nguyen, T.: Performance analysis for matrix completion via iterative hard-thresholded SVD. In: 2014 IEEE Workshop on Statistical Signal Processing (SSP), pp. 392–395. IEEE (2014)
Davenport, M.A., Romberg, J.: An overview of low-rank matrix recovery from incomplete observations. IEEE J. Sel. Top. Sign. Process. 10(4), 608–622 (2016)
Duruisseaux, V., Leok, M.: A variational formulation of accelerated optimization on Riemannian manifolds. SIAM J. Math. Data Sci. 4(2), 649–674 (2022)
Gonzaga, C.C., Schneider, R.M.: On the steepest descent algorithm for quadratic functions. Comput. Optim. Appl. 63, 523–542 (2016)
Huang, J., Zhou, J.: A direct proof and a generalization for a Kantorovich type inequality. Linear Algebra Appl. 397, 185–192 (2005)
Huang, W., Wei, K.: An extension of fast iterative shrinkage-thresholding algorithm to Riemannian optimization for sparse principal component analysis. Numer. Linear Algebra Appl. 29(1), e2409 (2022)
Huang, Y., Dai, Y.H., Liu, X.W., Zhang, H.: On the asymptotic convergence and acceleration of gradient methods. J. Sci. Comput. 90, 1–29 (2022)
Jain, P., Meka, R., Dhillon, I.: Guaranteed rank minimization via singular value projection. Adv. Neu. Inf. Process. Syst. 23 (2010)
Kim, D., Fessler, J.A.: Adaptive restart of the optimized gradient method for convex optimization. J. Optim. Theory Appl. 178(1), 240–263 (2018)
Kim, J., Yang, I.: Nesterov acceleration for Riemannian optimization. arXiv preprint arXiv:2202.02036 (2022)
Kyrillidis, A., Cevher, V.: Matrix recipes for hard thresholding methods. J. Math. Imag. Vis. 48, 235–265 (2014)
Lessard, L., Recht, B., Packard, A.: Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016)
Li, H., Fang, C., Lin, Z.: Accelerated first-order optimization algorithms for machine learning. Proc. IEEE 108(11), 2067–2082 (2020)
Li, H., Lin, Z.: Accelerated alternating direction method of multipliers: an optimal o (1/k) nonergodic analysis. J. Sci. Comput. 79, 671–699 (2019)
Liang, J., Fadili, J., Peyré, G.: Activity identification and local linear convergence of forward-backward-type methods. SIAM J. Optim. 27(1), 408–437 (2017)
Liang, J., Luo, T., Schonlieb, C.B.: Improving “fast iterative shrinkage-thresholding algorithm’’: faster, smarter, and greedier. SIAM J. Sci. Comput. 44(3), A1069–A1091 (2022)
Luenberger, D.G., Ye, Y.: Linear and Nonlinear Programming, vol. 228. Springer Nature (2021)
Nesterov, Y.E.: A method of solving a convex programming problem with convergence rate o\(\left(\frac{1}{k^{2}}\right)\). In: Doklady Akademii Nauk, vol. 269, pp. 543–547. Russian Academy of Sciences (1983)
Odonoghue, B., Candes, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15, 715–732 (2015)
Park, J.: Accelerated additive Schwarz methods for convex optimization with adaptive restart. J. Sci. Comput. 89(3), 58 (2021)
Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)
Tanner, J., Wei, K.: Normalized iterative hard thresholding for matrix completion. SIAM J. Sci. Comput. 35(5), S104–S125 (2013)
Tong, T., Ma, C., Chi, Y.: Accelerating ill-conditioned low-rank matrix estimation via scaled gradient descent. J. Mach. Learn. Res. 22(1), 6639–6701 (2021)
Vandereycken, B.: Low-rank matrix completion by Riemannian optimization. SIAM J. Optim. 23(2), 1214–1236 (2013)
Vu, T., Raich, R.: Accelerating iterative hard thresholding for low-rank matrix completion via adaptive restart. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2917–2921. IEEE (2019)
Vu, T., Raich, R.: On local convergence of iterative hard thresholding for matrix completion. arXiv preprint arXiv:2112.14733 (2021)
Vu, T., Raich, R.: On asymptotic linear convergence of projected gradient descent for constrained least squares. IEEE Trans. Sign. Process. 70, 4061–4076 (2022)
Wang, D., He, Y., De Sterck, H.: On the asymptotic linear convergence speed of Anderson acceleration applied to ADMM. J. Sci. Comput. 88(2), 38 (2021)
Wang, H., Cai, J.F., Wang, T., Wei, K.: Fast Cadzow’s algorithm and a gradient variant. J. Sci. Comput. 88(2), 41 (2021)
Wang, R., Zhang, C., Wang, L., Shao, Y.: A stochastic Nesterov’s smoothing accelerated method for general nonsmooth constrained stochastic composite convex optimization. J. Sci. Comput. 93(2), 52 (2022)
Wei, K., Cai, J.F., Chan, T.F., Leung, S.: Guarantees of Riemannian optimization for low rank matrix recovery. SIAM J. Matrix Anal. Appl. 37(3), 1198–1222 (2016)
Wei, K., Cai, J.F., Chan, T.F., Leung, S.: Guarantees of Riemannian optimization for low rank matrix completion. Inverse Probl. Imag. 14(2), 233–265 (2020)
Wibisono, A., Wilson, A.C., Jordan, M.I.: A variational perspective on accelerated methods in optimization. Proc. Natl. Acad. Sci. 113(47), E7351–E7358 (2016)
Zhang, H., Sra, S.: Towards Riemannian accelerated gradient methods. arXiv preprint arXiv:1806.02812 (2018)
Zhang, T., Yang, Y.: Robust PCA by manifold optimization. J. Mach. Learn. Res. 19(1), 3101–3139 (2018)
Acknowledgements
The authors would like to thank the anonymous reviewers for their review and helpful comments.
Funding
This work was supported by the National Natural Science Foundation of China (Grant no. 61771001).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
A Auxiliary Lemmas
1.1 A.1 Relationship of Matrix Eigenvalues
Lemma 4
Let \(\varvec{\varTheta }\) be a symmetric positive semi-definite matrix, and \({\varvec{P}}\in {{\mathbb {R}}}^{n\times n}\) be an orthogonal projection matrix. Denote \({\varvec{P}}^{\perp }=\varvec{I}-{\varvec{P}}\), then there exists an eigenvalue \(\lambda \ne 0\) of \(({\varvec{I}}-\mu \varvec{\varTheta } ) {\varvec{P}}^\perp \), such that
Proof
Assume \(\text {rank}({\varvec{P}})=r\). According to idempotent, we get the eigenvalues of \({\varvec{P}}\) and \(\varvec{P}^{\perp }\) of the form.
Here, we use \({\varvec{u}}_i\) and \({\varvec{v}}_j\) represent the eigenvectors of \({\varvec{P}}\) corresponding to eigenvalues \(1\) and \(0\), respectively. From the orthogonal relation between \({\varvec{P}}\) and \({\varvec{P}}^{\perp }\), \(\varvec{P}{\varvec{u}}_i={\varvec{u}}_i,{\varvec{P}}\varvec{v}_j={\varvec{0}},{\varvec{P}}^\perp {\varvec{u}}_i=\varvec{0},{\varvec{P}}^\perp {\varvec{v}}_j={\varvec{v}}_j\). Further, we have
It can be known that \({\varvec{u}}_i\) corresponds to the eigenvector of \(\mu \varvec{\varTheta } {\varvec{P}}^\perp +{\varvec{P}}\) with the eigenvalue of \(1\) and the eigenvector of \(\mu \varvec{\varTheta } {\varvec{P}}^\perp \) with the eigenvalue of \(0\), respectively. Besides, if \({\varvec{v}}_j\) happens to be an eigenvector of \(\mu \varvec{\varTheta }\), then \(\varvec{v}_j\) is also an eigenvector of \(\mu \varvec{\varTheta } \varvec{P}^\perp +{\varvec{P}}\) and \(\mu \varvec{\varTheta } \varvec{P}^\perp \). This conjecture implies the relevance of the above three matrix eigendecompositions. To this end, assume that there exists a non-zero vector \({\varvec{x}}\) such that
then
For \(\lambda \ne 0\), we will discuss \({\varvec{P}}\varvec{x}=(1-\lambda ){\varvec{x}}-\mu \varvec{\varTheta } \varvec{P}^\perp {\varvec{x}}\) case by case:
Case 1: when \({\varvec{P}}{\varvec{x}}={\varvec{0}}\), i.e., \({\varvec{x}}\) is a linear combination of \({\varvec{v}}_i\). Obviously, \(\mu \varvec{\varTheta } {\varvec{P}}^\perp x=(1-\lambda ){\varvec{x}}\) holds. We obtain \(1-\lambda \) is the eigenvalue of matrix \(\mu \varvec{\varTheta } {\varvec{P}}^\perp \).
Case 2: when \({\varvec{P}}{\varvec{x}}\ne {\varvec{0}}\), we know that \({\varvec{x}}\) can always be represented as a linear combination of orthonormal bases, as follows
and \({\varvec{P}}{\varvec{x}}\ne {\varvec{0}}\) means that there exists \(\alpha _i\ne 0\), otherwise \({\varvec{P}}\varvec{x}=\sum _{j=1}^{n-r}\beta _j {\varvec{P}}{\varvec{v}}_j=\varvec{0}\) if \(\forall i,\alpha _i=0\). And we expand the formula to get
where the left-hand side is a linear representation of the basis vector \(\{{\varvec{v}}_j\}\), while the right-hand side is a linear combination of mutually orthogonal basis vectors \(\{{\varvec{u}}_i\}\) and \(\{{\varvec{v}}_j\}\). So when \(\lambda \ne 0\), \(\forall i,\alpha _i=0\) holds. This contradicts \({\varvec{P}}{\varvec{x}}\ne {\varvec{0}}\). \(\square \)
1.2 A.2 Perturbation Analysis of Subspaces
Lemma 5
(Wedin’s \(\sin \varTheta \) Theorem [8]) Let \({\varvec{X}}_t={\varvec{U}}_t \varvec{\varSigma }_t {\varvec{V}}_t^\top \) and \({\varvec{X}}_\star =\varvec{U}_\star \varvec{\varSigma }_\star {\varvec{V}}_\star ^\top \) be the SVD of \({\varvec{X}}_t, {\varvec{X}}_\star \in {{\mathbb {M}}}_r\), respectively. If \(\Vert {\varvec{X}}_t-\varvec{X}_\star \Vert <\sigma _r({\varvec{X}}_\star )\), there is an upper bound for the perturbation of the singular subspace as follows
Lemma 6
(Perturbation of subspace projection [40]) Let \({\varvec{X}}_t=\varvec{U}_t \varvec{\varSigma }_t {\varvec{V}}_t^\top \) and \(\varvec{X}_\star ={\varvec{U}}_\star \varvec{\varSigma }_\star \varvec{V}_\star ^\top \) be the SVD of \({\varvec{X}}_t, \varvec{X}_\star \in {{\mathbb {M}}}_r\), respectively. If \(\Vert \varvec{X}_t-{\varvec{X}}_\star \Vert <\sigma _r({\varvec{X}}_\star )\), then the following inequality is satisfied
B Proof of Theorem 1
Proof
Let the residual matrix \({\varvec{E}}_t=\varvec{X}_t-{\varvec{X}}_\star \). According to the iteration, we have
where \((a)\) is the first-order expansion (7) of the truncated SVD. Since \(\Vert {\mathcal {I}}-\mu \mathcal {A}^*{\mathcal {A}}\Vert \le 1\), we have \(\Vert {\varvec{E}}_{t}-\mu \nabla f({\varvec{X}}_t)\Vert _F=\Vert ({\mathcal {I}}-\mu _t{\mathcal {A}}^*\mathcal {A})({\varvec{E}}_t)\Vert _F\le \Vert (\varvec{E}_t)\Vert _F\le \sigma _{{r}}({\varvec{X}}_\star )/2\), which verifies the condition of Lemma 1 holds. After vectorizing \(\varvec{e}_{t+1}=\text {vec}({\varvec{E}}_t)\), we get
where \((a)\) uses vectorization of Kronecker product, i.e., \(\text {vec}({\varvec{A}}{\varvec{B}}{\varvec{C}})=(\varvec{C}^\top \otimes {\varvec{A}})\text {vec}({\varvec{B}})\). \((b)\) is based on (3) and \(\Vert \varvec{E}_t\Vert _F=\Vert {\varvec{e}}_{t}\Vert _2\). The convergence rate with the constant stepsize \(\mu _t\equiv \mu \) is determined by the spectral radius of the matrix \({\varvec{H}}={\varvec{H}}(\mu )\).
Thus, the maximum and minimum eigenvalues of \({\varvec{H}}\) should be compared. Taking MS as an example, we compute the largest eigenvalue.
where \((a)\) is based on Lemma 4. Similarly, the minimum eigenvalue results are as follows:
Obviously, the optimal spectral radius occurs when \(\lambda _{\max }(\varvec{H}_{{\textsf{M}}{\textsf{S}}})=-\lambda _{\min }({\varvec{H}}_{{\textsf{M}}{\textsf{S}}})\), i.e., \(1-\mu \lambda _{\min }=\mu \lambda _{\max }-1\). The corresponding stepsize is \(\mu _\dagger =\frac{2}{\lambda _{\min }+\lambda _{\max }}\). Due to \({\varvec{I}}-P_{{\varvec{V}}_\star }^\perp \otimes P_{{\varvec{U}}_\star }^\perp \) is orthogonal projector, we have \(\Vert {\varvec{I}}-P_{{\varvec{V}}_\star }^\perp \otimes P_{\varvec{U}_\star }^\perp \Vert =1\). It is easy to check
so we can get (9). Especially for MC, as shown in (6), we get a simplified result similar to [34].
where \((a)\), \((c)\) and \((f)\) are based on the fact that AB and BA have the same eigenvalues, \((b)\) and \((e)\) correspond to the properties of the sampling matrix in (6), \((d)\) uses Lemma 4. Similarly, the minimum eigenvalue results are as follows
We can estimate the convergence rate of Algorithm 1. \(\square \)
C Proof of Proposition 1
Proof
Vectorizing (12) yields \(\nabla _{\mathcal {R}} f({\varvec{x}}_t)={\varvec{P}}\nabla f({\varvec{x}}_t)\), where \({\varvec{P}}=({\varvec{I}}-P_{{\varvec{U}}}^\perp \otimes P_{{\varvec{V}}}^\perp )\) is the orthogonal projection matrix. Bring (13) into the loss function to get
where (a) is because of \(f({\varvec{x}})=\frac{1}{2}(\varvec{x}-{\varvec{x}}_\star )^\top \varvec{\varTheta } (\varvec{x}-{\varvec{x}}_\star )=\frac{1}{2}\nabla _{\mathcal {R}} f(\varvec{x})^\top ({\varvec{P}}\varvec{\varTheta } {\varvec{P}})^{+} \nabla _{\mathcal {R}} f({\varvec{x}}_t)\). Furthermore, since \(\frac{|\nabla _{\mathcal {R}} f({\varvec{x}}_t)^\top \nabla f({\varvec{x}}_t)|}{\Vert \nabla _{\mathcal {R}} f({\varvec{x}}_t)^\top \Vert _2 \Vert \nabla f({\varvec{x}}_t)\Vert _2}\ge 0\), we apply the generalized Kantorovich type inequality [14] in Lemma 7 to get (b). To prove (c), we only need to show
here, \(\lambda _{\min }(\cdot )\) means the smallest non-zero eigenvalue. \(\square \)
Lemma 7
(Kantorovich inequality [14]) Let \({\varvec{A}}\) be a symmetric (semi-) positive definite matrix, and \(\lambda _{\max }\) and \(\lambda _{\min }\) correspond to the largest and smallest non-zero eigenvalues, respectively. If \(\varvec{x},{\varvec{y}}\in {{\mathbb {R}}}^n\) satisfies \(\frac{|{\varvec{x}}^\top {\varvec{y}}|}{\Vert {\varvec{x}}\Vert _2 \Vert \varvec{y}\Vert _2}\ge \cos \theta \) with \(0\le \theta \le \frac{\pi }{2}\), then
where \(\kappa =\frac{\lambda _{\max }}{\lambda _{\min }}\frac{1+\sin \theta }{1-\sin \theta }\) and \((\cdot )^{+}\) is the Moore-Penrose inverse. When \({\varvec{A}}\) is positive definite and \({\varvec{x}}={\varvec{y}}\), i.e., \({\varvec{A}}^{+}={\varvec{A}}^{-1}\) and \(\theta =0\), the above inequality degenerates into the traditional form.
D Proof of Theorem 2
Proof
According to Algorithm 3, we calculate the error as follows.
After vectorizing, we have
Stacking the errors of two adjacent iterations, we get the recursive form
The convergence rate depends on the spectral radius \(\rho ({\varvec{T}})\) of \({\varvec{T}}\in {\mathbb {R}}^{2n_1n_2\times 2n_1n_2}\). According to the eigendecomposition in [34], \({\varvec{T}}\) is similar to the block diagonal matrix composed of the \(2\times 2\) matrix \({\varvec{T}}_j\), i.e., \({\varvec{T}}\sim \text {bldiag}({\varvec{T}}_1,{\varvec{T}}_2,\ldots ,\varvec{T}_{n_1n_2})\), where each block \({\varvec{T}}_j\in {\mathbb {R}}^{2\times 2}\) is form
where \(\lambda _j\) is the eigenvalue of matrix \((\varvec{I}-P_{{\varvec{V}}_\star }^\perp \otimes P_{\varvec{U}_\star }^\perp )\varvec{\varTheta }\). Next, we aim to find the eigenvalues of the matrix \({\varvec{T}}_j\) using the characteristic polynomial.
According to the quadratic formula, set the discriminant \(\varDelta (\lambda _j,\mu _t,\eta _t)=(1+\eta _t)^2(1-\mu _t\lambda _j)^2-4\eta _t(1-\mu _t\lambda _j)\), then the solution to (31) is:
where the superscript \((\cdot )^{\pm }\) means addition or subtraction in numerator. For given \({\varvec{T}}\) with fixed \((\mu _t,\eta _t)\), \(\rho ({\varvec{T}})=\max _{\lambda _j} |r^{\pm }(\lambda _j,\mu _t,\eta _t)|\) is continuous and quasi-convex w.r.t. the eigenvalue \(\lambda _j\) [18, 21, 37]. Thus, the extremal value is attained on the boundary, i.e.
As a whole, \(\rho ({\varvec{T}})\) is determined by the maximum modulus of the roots of (33). We denote that surfaces \(\varPi _1\) and \(\varPi _2\) correspond to \(\lambda _{\min }\) and \(\lambda _{\max }\), respectively.
Below we show how to determine the minimum spectral radius and corresponding parameters. Back to (32), \(|r^{\pm }(\lambda _j,\mu _t,\eta _t)|\ge |(1+\eta _t)(1-\mu _t\lambda _j)|/2\) takes the equal if and only if \(\varDelta (\lambda _j,\mu _t,\eta _t)=0\). In this case, we can get a relationship of the parameter \((\mu _t,\eta _t)\)
Obviously, \(0<\eta _t^-<1<\eta _t^+\). Given \(\mu _t\), there are three cases for \(\eta _t\).
-
(32) with \(\eta _t\in (0,\eta _{t}^-)\cup (\eta _{t}^+,\infty )\) has two different solutions.
-
(32) with \(\eta _t=\eta _{t}^\pm \) has a single solution.
-
(32) with \(\eta _t\in (\eta _{t}^-,\eta _{t}^+)\) has conjugate complex solutions.
If \(\eta _t\in [\eta _{t}^+,\infty )\), \(r^{\pm }(\lambda _j,\mu _t,\eta _t^+)\ge |(1+\eta _t^+)(1-\mu _t\lambda _j)|/2=1+\sqrt{\mu _t\lambda _j}>1\), and \(\rho ({\varvec{T}})>1\) is obtained form (33). Conversely, when \(\eta _t=\eta _{t}^-\), \(r^{\pm }(\lambda _j,\mu _t,\eta _t^-)=|(1+\eta _t^-)(1-\mu _t\lambda _j)|/2=1-\sqrt{\mu _t\lambda _j}<1\). This is also why the parameter is selected as \(0<\eta \le 1\) in practice. When \(\eta _t\in (\eta _{t}^-,\eta _{t}^+)\), \(\rho ({\varvec{T}})=\max _{\lambda _j} \sqrt{\eta _t(1-\mu _t\lambda _j)}\) monotonically increases w.r.t. \(\eta _t\) and monotonically decreases w.r.t. \(\mu _t\). We can draw the geometric properties of \(\rho ({\varvec{T}})\) w.r.t. \((\mu _t,\eta _t)\), and condition \(\varDelta (\lambda _j,\mu _t,\eta _t^-)=0\) helps to find the theoretical lower bound of \(\rho ({\varvec{T}})\). The optimal parameter pair \((\mu _\flat ,\eta _\flat )\) is the intersection of \(r^{\pm }(\lambda _{\min },\mu _\flat ,\eta _\flat )\) in the curve \(\eta _t^-=\frac{1-\sqrt{\mu _t\lambda _j}}{1+\sqrt{\mu _t\lambda _j}}\) and the surface \(\varPi _2\), i.e., \(|r^{-}(\lambda _{\max },\mu _t,\eta _t)|\). So it satisfies the following equation
Bringing in \(\eta _\flat =\frac{1-\sqrt{\mu _\flat \lambda _{\min }}}{1+\sqrt{\mu _\flat \lambda _{\min }}}\), it is not difficult for us to get optimal convergence result \(\mu _{\flat }=\frac{4}{\lambda _{\min }+3\lambda _{\max }}\) and \(\rho _{\textsf{opt}}(\varvec{T})=1-\sqrt{\frac{4\lambda _{\min }}{\lambda _{\min }+3\lambda _{\max }}}\) in (18). Also, for \(\eta _t<1\), the intersection of \(\varPi _1\) and \(\varPi _2\) can be calculated according to monotonicity
If \(\eta _t=0\), it simplifies to \(\mu _t=2/(\lambda _{\min }+\lambda _{\max })=\mu _\dagger \) in (11). Due to momentum, the optimal stepsizes satisfy \(\mu _\flat <\mu _\dagger \). In fact, we bring \(\eta _t=0\) to get \({\varvec{e}}_t={\varvec{H}}{\varvec{e}}_{t-1}+\mathcal {O}(\Vert {\varvec{e}}_{t-1}\Vert _2^2)\), which is consistent with the non-accelerated iteration. Conversely, if \(\mu _t\ge \mu _\dagger \), then \(\eta _t=0\) is a good parameter choice, which means NAG degenerates to Grad. When \(\eta _t\ne 0\), we have
Despite the complex form, we use symbolic computing tools to solve when \(\mu _t\in (\mu _\flat ,\mu _\dagger )\)
We analyze the spectral radius of \({\varvec{T}}\) in (33) w.r.t. pair \((\mu _t,\eta _{t})\) by case. \(\square \)
E Proof in Sect. 4
1.1 E.1 Proof of Lemma 3
Proof
The first one obviously holds according to Lemma 1. From (25), we have
where (a) is the perturbation analysis of matrix inverse. As long as \(\Vert {\varvec{A}}^{-1}{\varvec{B}}\Vert <1\) or \(\Vert \varvec{B}{\varvec{A}}^{-1}\Vert <1\) holds, the Taylor expansion of the inverse of the matrix sum is as follows
Using the norm inequality \(\Vert {\varvec{A}}\varvec{B}\Vert \le \Vert {\varvec{A}}\Vert \Vert {\varvec{B}}\Vert \), combined with the condition \(\Vert {\varvec{N}}\Vert \le \Vert {\varvec{N}}\Vert _F< \sigma _{{r}}({\varvec{X}})/2\), it can be judged that the inverse matrix condition holds.
(b) merges the product of multiple \({\varvec{N}}\) into higher-order terms. (c) uses the SVD of \({\varvec{X}}\) to get
\(\square \)
1.2 E.2 Convergence for Algorithm 4
Proof
According to Algorithm 4, we have
where \((a)\) uses Lemma 3, \((b)\) is based on the tangent space projection in (22), and \((c)\) uses the subspace perturbation in Lemma 5, and replaces the subspace \({\mathcal {P}}_{{{\mathbb {T}}}_{{\varvec{X}}_t}{{\mathbb {M}}}_r}\) with \(\mathcal {P}_{{{\mathbb {T}}}_{{\varvec{X}}_\star }{{\mathbb {M}}}_r}\).
The subsequent proof is consistent with the proof of Theorem 1 in Appendix 1. \(\square \)
1.3 E.3 Convergence for Algorithm 5
Proof
The proof is divided into three steps to analyse \(\varvec{X}_{t-1}\), \({\varvec{Y}}_t\) and \({\varvec{X}}_{t+1}\), respectively.
Step 1 Calculate the orthographic retraction of \(\varvec{X}_{t-1}\) and the inverse matrix.
It gives an approximation of \({\varvec{X}}_{t-1}\) on the tangent space \({{\mathbb {T}}}_{{\varvec{X}}_t}{{\mathbb {M}}}_r\).
Step 2 Similar to Appendix 1, we calculate the residual of \({\varvec{Y}}_{t}\)
It also satisfies the linear extrapolation in Euclidean space.
Step 3 Compute \({\varvec{X}}_{t+1}-{\varvec{X}}_\star \) to get the recursive form
The subsequent proof is consistent with proof of Theorem 2 in Appendix 1. \(\square \)
1.4 E.4 Proof of Restart Condition Equivalence in (28)
Proof
When condition (8) hold, we have
where (a) uses the orthogonal relationship of the Riemannian gradient and tangent space. Based on the first-order expansion, we appropriately omit the higher-order terms in (a) to obtain the approximate relationship (b), which will not change the sign before and after the approximation. As mentioned in step 2 in Appendix 1, \(\nabla f(\varvec{Y}_{t-1})=\varvec{\varTheta } \text {vec}(\varvec{Y}_{t-1}-{\varvec{X}}_\star )\) and \(\text {grad}~f(\varvec{Y}_{t-1})\) both are first order w.r.t. the residual. According to Lemma 3, \(\varvec{X}_t-\textsf{inv} {\mathcal {R}}^{\textsf{orth}}_{\varvec{Y}_{t-1}}({\varvec{X}}_{t})\) and \({\varvec{X}}_{t-1}-\textsf{inv} {\mathcal {R}}^{\textsf{orth}}_{{\varvec{Y}}_{t-1}}(\varvec{X}_{t-1})\) are second order. So the last two terms of (a) are third order, while the remaining inner product is second order. \(\square \)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, H., Peng, Z., Pan, C. et al. Fast Gradient Method for Low-Rank Matrix Estimation. J Sci Comput 96, 41 (2023). https://doi.org/10.1007/s10915-023-02266-7
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10915-023-02266-7