Towards explicit superlinear convergence rate for SR1

Ye, Haishan; Lin, Dachao; Chang, Xiangyu; Zhang, Zhihua

doi:10.1007/s10107-022-01865-w

Towards explicit superlinear convergence rate for SR1

Full Length Paper
Series A
Published: 06 August 2022

Volume 199, pages 1273–1303, (2023)
Cite this article

Mathematical Programming Submit manuscript

Haishan Ye¹,
Dachao Lin²,
Xiangyu Chang ORCID: orcid.org/0000-0001-9225-0477^1,3 &
…
Zhihua Zhang⁴

820 Accesses
1 Citation
Explore all metrics

Abstract

We study the convergence rate of the famous Symmetric Rank-1 (SR1) algorithm, which has wide applications in different scenarios. Although it has been extensively investigated, SR1 still lacks a non-asymptotic superlinear rate compared with other quasi-Newton methods such as DFP and BFGS. In this paper, we address the aforementioned issue to obtain the first explicit non-asymptotic rates of superlinear convergence for the vanilla SR1 methods with a correction strategy that is used to achieve numerical stability. Specifically, the vanilla SR1 with the correction strategy achieves the rate of the form $\left( \frac{2n\ln (4\varkappa )}{k}\right) ^{k/2}$ for general smooth strongly-convex functions where k is the iteration counter, $\varkappa $ is the condition number of the objective function, and n is the dimensionality of the problem. Furthermore, the vanilla SR1 algorithm enjoys a little faster convergence rate and can find the optima of the quadratic objective function at most n steps.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cubic regularization in symmetric rank-1 quasi-Newton methods

Article 12 February 2018

Rates of superlinear convergence for classical quasi-Newton methods

Article Open access 08 February 2021

A Quadratically Convergent Algorithm for Structured Low-Rank Approximation

Article 11 March 2015

Notes

Indeed, the proof of [22, Corollary 4.4] gives $K_0^{\mathrm {greedy\_SR1}} = 2\varkappa \ln (2n+1)+n\ln (4n\varkappa )$.

References

Berahas, A.S., Jahani, M., Richtárik, P., Takáč, M.: Quasi-Newton methods for deep learning: Forget the past, just sample. arXiv preprint arXiv:1901.09997 (2019)
Boyd, S., Boyd, S.P., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Book MATH Google Scholar
Broyden, C.G.: The convergence of a class of double-rank minimization algorithms: 1. general considerations. IMA J. Appl. Math. 6(1), 76–90 (1970)
Article MATH Google Scholar
Broyden, C.G.: The convergence of a class of double-rank minimization algorithms: 2. the new algorithm. IMA J. Appl. Math. 6(3), 222–231 (1970)
Article MathSciNet MATH Google Scholar
Broyden, C.G., Dennis, J.E., Jr., Moré, J.J.: On the local and superlinear convergence of quasi-Newton methods. IMA J. Appl. Math. 12(3), 223–245 (1973)
Article MathSciNet MATH Google Scholar
Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim. 26(2), 1008–1031 (2016)
Article MathSciNet MATH Google Scholar
Byrd, R.H., Khalfan, H.F., Schnabel, R.B.: Analysis of a symmetric rank-one trust region method. SIAM J. Optim. 6(4), 1025–1039 (1996)
Article MathSciNet MATH Google Scholar
Byrd, R.H., Liu, D.C., Nocedal, J.: On the behavior of Broyden’s class of quasi-newton methods. SIAM J. Optim. 2(4), 533–557 (1992)
Article MathSciNet MATH Google Scholar
Byrd, R.H., Nocedal, J.: A tool for the analysis of quasi-Newton methods with application to unconstrained minimization. SIAM J. Numer. Anal. 26(3), 727–739 (1989)
Article MathSciNet MATH Google Scholar
Byrd, R.H., Nocedal, J., Yuan, Y.X.: Global convergence of a cass of quasi-newton methods on convex problems. SIAM J. Numer. Anal. 24(5), 1171–1190 (1987)
Article MathSciNet MATH Google Scholar
Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST) 2(3), 1–27 (2011)
Article Google Scholar
Conn, A.R., Gould, N.I., Toint, P.L.: Convergence of quasi-Newton matrices generated by the symmetric rank one update. Math. Program. 50(1), 177–195 (1991)
Article MathSciNet MATH Google Scholar
Dixon, L.: Quasi-Newton algorithms generate identical points. Math. Program. 2(1), 383–387 (1972)
Article MathSciNet MATH Google Scholar
Dixon, L.: Quasi Newton techniques generate identical points II: The proofs of four new theorems. Math. Program. 3(1), 345–358 (1972)
Article MathSciNet MATH Google Scholar
Fletcher, R.: A new approach to variable metric algorithms. Comput. J. 13(3), 317–322 (1970)
Article MATH Google Scholar
Goldfarb, D.: A family of variable-metric methods derived by variational means. Math. Comput. 24(109), 23–26 (1970)
Article MathSciNet MATH Google Scholar
Gower, R., Goldfarb, D., Richtárik, P.: Stochastic block BFGS: Squeezing more curvature out of data. In: International Conference on Machine Learning, pp. 1869–1878. PMLR (2016)
Gower, R.M., Richtárik, P.: Randomized quasi-Newton updates are linearly convergent matrix inversion algorithms. SIAM J. Matrix Anal. Appl. 38(4), 1380–1409 (2017)
Article MathSciNet MATH Google Scholar
Jin, Q., Mokhtari, A.: Non-asymptotic superlinear convergence of standard quasi-Newton methods. arXiv preprint arXiv:2003.13607 (2020)
Kao, C., Chen, S.P.: A stochastic quasi-Newton method for simulation response optimization. Eur. J. Oper. Res. 173(1), 30–46 (2006)
Article MathSciNet MATH Google Scholar
Kovalev, D., Gower, R.M., Richtárik, P., Rogozin, A.: Fast linear convergence of randomized BFGS. arXiv preprint arXiv:2002.11337 (2020)
Lin, D., Ye, H., Zhang, Z.: Greedy and random quasi-newton methods with faster explicit superlinear convergence. Adv. Neural. Inf. Process. Syst. 34, 6646–6657 (2021)
Google Scholar
Moritz, P., Nishihara, R., Jordan, M.: A linearly-convergent stochastic L-BFGS algorithm. In: Artificial Intelligence and Statistics, pp. 249–258. PMLR (2016)
Nocedal, J., Wright, S.: Numerical Optimization. Springer Science & Business Media, Berlin (2006)
MATH Google Scholar
Powell, M.: On the convergence of the variable metric algorithm. IMA J. Appl. Math. 7(1), 21–36 (1971)
Article MathSciNet MATH Google Scholar
Qu, S., Goh, M., Chan, F.T.: Quasi-Newton methods for solving multiobjective optimization. Oper. Res. Lett. 39(5), 397–399 (2011)
Article MathSciNet MATH Google Scholar
Rodomanov, A., Nesterov, Y.: Greedy quasi-Newton methods with explicit superlinear convergence. SIAM J. Optim. 31(1), 785–811 (2021)
Article MathSciNet MATH Google Scholar
Rodomanov, A., Nesterov, Y.: New results on superlinear convergence of classical Quasi-Newton methods. J. Optim. Theory Appl. 188(3), 744–769 (2021)
Article MathSciNet MATH Google Scholar
Rodomanov, A., Nesterov, Y.: Rates of superlinear convergence for classical quasi-Newton methods. Math. Program. 194(1), 159–190 (2022)
Article MathSciNet MATH Google Scholar
Shanno, D.F.: Conditioning of quasi-Newton methods for function minimization. Math. Comput. 24(111), 647–656 (1970)
Article MathSciNet MATH Google Scholar
Wei, Z., Yu, G., Yuan, G., Lian, Z.: The superlinear convergence of a modified BFGS-type method for unconstrained optimization. Comput. Optim. Appl. 29(3), 315–332 (2004)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

We would like to thank the two anonymous reviewers for their careful works and constructive comments that greatly help us improve the paper quality. Ye was supported in part by the National Natural Science Foundation of China under Grant 12101491. Chang was supported in part by the National Natural Science Foundation for Outstanding Young Scholars of China under Grant 72122018 and in part by the Natural Science Foundation of Shaanxi Province under Grant 2021JC-01. Haishan Ye and Dachao Lin have the equal contributions to this paper.

Author information

Authors and Affiliations

School of Management, Xi’an Jiaotong University, Xi’an, China
Haishan Ye & Xiangyu Chang
Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
Dachao Lin
Institute for Interdisciplinary Information Core Technology, Xi’an, China
Xiangyu Chang
School of Mathematical Sciences, Peking University, Beijing, China
Zhihua Zhang

Authors

Haishan Ye
View author publications
You can also search for this author in PubMed Google Scholar
Dachao Lin
View author publications
You can also search for this author in PubMed Google Scholar
Xiangyu Chang
View author publications
You can also search for this author in PubMed Google Scholar
Zhihua Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiangyu Chang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Useful lemmas

Lemma 10

Let approximate Hessians be updated recursively as $G_{i+1} \triangleq \mathrm {SR1}(A, G_i, u_i)$ (by Eqn. (5)) with an arbitrary $u_i \in {\mathbb {R}}^n, \forall i \ge 0$. For each $k \ge 0$, if $u_i^\top (G_i - A) u_i > 0 $ holds for all $0 \le i \le k$, then vectors $u_0,\dots ,u_k$ are linearly independent. Furthermore, it holds that

$$\begin{aligned} Au_i = G_k u_i, \text{ for } \text{ all } 0\le i\le k-1. \end{aligned}$$

(65)

Proof

Denote $R_i = G_i - A, \forall i \ge 0$. By the update of SR1, we can obtain that for all $i \ge 0$,

$$\begin{aligned} R_{i+1} = {\left\{ \begin{array}{ll} R_{i} - \frac{R_{i}u_{i}u_{i}^\top R_{i}}{u_{i}^\top R_{i} u_{i}} &{} \text { if } R_{i}u_{i} \ne 0, \\ R_{i} &{} \text { otherwise}. \end{array}\right. } \end{aligned}$$

Thus, we get

$$\begin{aligned} \mathrm {Ker}(R_{i}) \subseteq \mathrm {Ker}(R_{i+1}), \text { and } u_{i} \in \mathrm {Ker}(R_{i+1}), \forall i \ge 0, \end{aligned}$$

(66)

where $\mathrm {Ker}(R_i)$ is the null space of $R_i$.

Next, we prove $u_0,\dots , u_k$ are linearly independent provided that $u_{i}^\top (G_{i} - A) u_{i} > 0, \forall i\le k$ by induction. First, for $k = 0$, the result holds trivially. Then, we assume that $u_0,\dots , u_{k}$ are linearly independent for some $k \ge 0$, and we will show that $u_0,\dots , u_{k+1}$ are linearly independent provided that $u_{k+1}^\top (G_{k+1} - A) u_{k+1} > 0$. We try to prove the result by contradiction and assume that $u_{k+1}$ can be represented as

$$\begin{aligned} u_{k+1} = \alpha _0 u_0 + \alpha _1 u_1 +\dots +\alpha _{k} u_{k}, \end{aligned}$$

where scalars $\alpha _0, \dots , \alpha _{k}$ are not all zero. Applying Eq. (66), we have $R_{k+1} u_i=0, \forall 0 \le i < k+1$. Then we further obtain that $R_{k+1} u_{k+1} = R_{k+1} (\alpha _0 u_0 + \dots + \alpha _{k} u_{k}) = 0 + \dots + 0 = 0$. This contradicts the assumption that $u_{k+1}^\top R_{k+1} u_{k+1} = u_{k+1}^\top (G_{k+1} - A) u_{k+1} > 0$. Thus, $u_0,\dots , u_{k+1}$ are linearly independent, and we finish the induction.

Finally, Eq. (65) can be immediately obtained by observing that $u_i \in \mathrm {Ker}(R_{i+1}) \subseteq \mathrm {Ker}(R_k) $ for all $i = 0, \dots , k-1$ based on Eq. (66). $\square $

Upper bound of M for logistic regression

The logistic regression is defined as follows:

$$\begin{aligned} f(x) = \frac{1}{m}\sum _{i=1}^{m} \log [1+\exp (-b_i a_i^\top x)] + \frac{\gamma }{2}\Vert x\Vert ^2, \end{aligned}$$

where $a_i \in {\mathbb {R}}^{n}$ is the i-th input vector, $b_i\in \{-1,1\}$ is the corresponding label, and $\gamma \ge 0$ is the regularization parameter. Accordingly, for all $x,h \in {\mathbb {R}}^n$, we have that

$$\begin{aligned} h^\top \nabla ^2 f(x) h= & {} \frac{1}{m}\sum _{i=1}^{m} p_i(x)(1-p_i(x)) (a_i^\top h)^2 + \gamma \left\| h\right\| ^2, \\ \; p_i(x)\triangleq & {} \frac{1}{1+\exp (-b_i a_i^\top x)} \in (0, 1), \end{aligned}$$

and

$$\begin{aligned} \begin{aligned} D^3 f(x)[h,h,h]&= \frac{1}{m}\sum _{i=1}^{m} (1-2p_i(x)) \left( [p_i'(x)]^\top h\right) (b_i a_i^\top h)^2 \\&{\mathop {=}\limits ^{(i)}} \frac{1}{m}\sum _{i=1}^{m} (1-2p_i(x)) p_i(x)(1-p_i(x)) (b_i a_i^\top h)^3. \end{aligned} \end{aligned}$$

(67)

$D^3 f(x)[h,h,h] {=} \left. \frac{d^3}{d t^3} f(x+t h)\right| _{t=0}$ is the third derivative of f along the direction h and $(i)$ uses the derivation

$$\begin{aligned} p_i'(x) = \frac{\exp (-b_i \cdot a_i^\top x)}{\left( 1+\exp (-b_i \cdot a_i^\top x)\right) ^2} \cdot b_i a_i = p_i(x)(1-p_i(x))b_i a_i. \end{aligned}$$

Since $|t(1-t)(1-2t)| \le \frac{\sqrt{3}}{18}, \forall t\in (0,1)$ and the equality holds when $ t = \frac{1}{2} - \frac{1}{2\sqrt{3}}$, we further obtain

$$\begin{aligned} D^3 f(x)[h,h,h] {\mathop {\le }\limits ^{(67)}} \frac{\sqrt{3}}{18m}\sum _{i=1}^{m} \left\| b_i a_i\right\| ^3 \cdot \left\| h\right\| ^3 = \frac{\sqrt{3}}{18m}\sum _{i=1}^{m} \left\| a_i\right\| ^3 \cdot \left\| h\right\| ^3. \end{aligned}$$

(68)

By Eq. (68), f(x) has $L'$-Lipschitz Hessians with $L'\triangleq \frac{\sqrt{3}}{18m}\sum _{i=1}^{m} \left\| a_i\right\| ^3$.

Finally, combining with f(x) being $\gamma $-strongly convex, we could get that given $\forall x,y, z, w\in {\mathbb {R}}^n$, it holds that

$$\begin{aligned} \nabla ^2 f(x) -\nabla ^2 f(y)\preceq & {} L' \cdot \left\| x-y\right\| \cdot I \preceq L' \cdot \frac{\left\| x-y\right\| _{z}}{\sqrt{\gamma }} \cdot \frac{\nabla ^2 f(w)}{\gamma } \\= & {} \frac{L'}{\gamma ^{3/2}} \cdot \left\| x-y\right\| _{z} \cdot \nabla ^2 f(w), \end{aligned}$$

where the second inequality is because of f(x) is $\gamma $-strongly convex. Hence, f is M-strongly self-concordant with

$$\begin{aligned} M \le \frac{L'}{\gamma ^{3/2}} = \frac{\sqrt{3}}{18} \cdot \frac{\sum _{i=1}^{m} \left\| a_i\right\| ^3}{ m\gamma ^{3/2}}. \end{aligned}$$

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ye, H., Lin, D., Chang, X. et al. Towards explicit superlinear convergence rate for SR1. Math. Program. 199, 1273–1303 (2023). https://doi.org/10.1007/s10107-022-01865-w

Download citation

Received: 30 July 2021
Accepted: 10 July 2022
Published: 06 August 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s10107-022-01865-w

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards explicit superlinear convergence rate for SR1

Abstract

Access this article

Similar content being viewed by others

Cubic regularization in symmetric rank-1 quasi-Newton methods

Rates of superlinear convergence for classical quasi-Newton methods

A Quadratically Convergent Algorithm for Structured Low-Rank Approximation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Useful lemmas

Lemma 10

Proof

Upper bound of M for logistic regression

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Towards explicit superlinear convergence rate for SR1

Abstract

Access this article

Similar content being viewed by others

Cubic regularization in symmetric rank-1 quasi-Newton methods

Rates of superlinear convergence for classical quasi-Newton methods

A Quadratically Convergent Algorithm for Structured Low-Rank Approximation

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Useful lemmas

Lemma 10

Proof

Upper bound of M for logistic regression

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation