Skip to main content
Log in

Towards explicit superlinear convergence rate for SR1

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

We study the convergence rate of the famous Symmetric Rank-1 (SR1) algorithm, which has wide applications in different scenarios. Although it has been extensively investigated, SR1 still lacks a non-asymptotic superlinear rate compared with other quasi-Newton methods such as DFP and BFGS. In this paper, we address the aforementioned issue to obtain the first explicit non-asymptotic rates of superlinear convergence for the vanilla SR1 methods with a correction strategy that is used to achieve numerical stability. Specifically, the vanilla SR1 with the correction strategy achieves the rate of the form \(\left( \frac{2n\ln (4\varkappa )}{k}\right) ^{k/2}\) for general smooth strongly-convex functions where k is the iteration counter, \(\varkappa \) is the condition number of the objective function, and n is the dimensionality of the problem. Furthermore, the vanilla SR1 algorithm enjoys a little faster convergence rate and can find the optima of the quadratic objective function at most n steps.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. Indeed, the proof of [22,  Corollary 4.4] gives \(K_0^{\mathrm {greedy\_SR1}} = 2\varkappa \ln (2n+1)+n\ln (4n\varkappa )\).

References

  1. Berahas, A.S., Jahani, M., Richtárik, P., Takáč, M.: Quasi-Newton methods for deep learning: Forget the past, just sample. arXiv preprint arXiv:1901.09997 (2019)

  2. Boyd, S., Boyd, S.P., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)

    Book  MATH  Google Scholar 

  3. Broyden, C.G.: The convergence of a class of double-rank minimization algorithms: 1. general considerations. IMA J. Appl. Math. 6(1), 76–90 (1970)

    Article  MATH  Google Scholar 

  4. Broyden, C.G.: The convergence of a class of double-rank minimization algorithms: 2. the new algorithm. IMA J. Appl. Math. 6(3), 222–231 (1970)

    Article  MathSciNet  MATH  Google Scholar 

  5. Broyden, C.G., Dennis, J.E., Jr., Moré, J.J.: On the local and superlinear convergence of quasi-Newton methods. IMA J. Appl. Math. 12(3), 223–245 (1973)

    Article  MathSciNet  MATH  Google Scholar 

  6. Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-Newton method for large-scale optimization. SIAM J. Optim. 26(2), 1008–1031 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  7. Byrd, R.H., Khalfan, H.F., Schnabel, R.B.: Analysis of a symmetric rank-one trust region method. SIAM J. Optim. 6(4), 1025–1039 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  8. Byrd, R.H., Liu, D.C., Nocedal, J.: On the behavior of Broyden’s class of quasi-newton methods. SIAM J. Optim. 2(4), 533–557 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  9. Byrd, R.H., Nocedal, J.: A tool for the analysis of quasi-Newton methods with application to unconstrained minimization. SIAM J. Numer. Anal. 26(3), 727–739 (1989)

    Article  MathSciNet  MATH  Google Scholar 

  10. Byrd, R.H., Nocedal, J., Yuan, Y.X.: Global convergence of a cass of quasi-newton methods on convex problems. SIAM J. Numer. Anal. 24(5), 1171–1190 (1987)

    Article  MathSciNet  MATH  Google Scholar 

  11. Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST) 2(3), 1–27 (2011)

    Article  Google Scholar 

  12. Conn, A.R., Gould, N.I., Toint, P.L.: Convergence of quasi-Newton matrices generated by the symmetric rank one update. Math. Program. 50(1), 177–195 (1991)

    Article  MathSciNet  MATH  Google Scholar 

  13. Dixon, L.: Quasi-Newton algorithms generate identical points. Math. Program. 2(1), 383–387 (1972)

    Article  MathSciNet  MATH  Google Scholar 

  14. Dixon, L.: Quasi Newton techniques generate identical points II: The proofs of four new theorems. Math. Program. 3(1), 345–358 (1972)

    Article  MathSciNet  MATH  Google Scholar 

  15. Fletcher, R.: A new approach to variable metric algorithms. Comput. J. 13(3), 317–322 (1970)

    Article  MATH  Google Scholar 

  16. Goldfarb, D.: A family of variable-metric methods derived by variational means. Math. Comput. 24(109), 23–26 (1970)

    Article  MathSciNet  MATH  Google Scholar 

  17. Gower, R., Goldfarb, D., Richtárik, P.: Stochastic block BFGS: Squeezing more curvature out of data. In: International Conference on Machine Learning, pp. 1869–1878. PMLR (2016)

  18. Gower, R.M., Richtárik, P.: Randomized quasi-Newton updates are linearly convergent matrix inversion algorithms. SIAM J. Matrix Anal. Appl. 38(4), 1380–1409 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  19. Jin, Q., Mokhtari, A.: Non-asymptotic superlinear convergence of standard quasi-Newton methods. arXiv preprint arXiv:2003.13607 (2020)

  20. Kao, C., Chen, S.P.: A stochastic quasi-Newton method for simulation response optimization. Eur. J. Oper. Res. 173(1), 30–46 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  21. Kovalev, D., Gower, R.M., Richtárik, P., Rogozin, A.: Fast linear convergence of randomized BFGS. arXiv preprint arXiv:2002.11337 (2020)

  22. Lin, D., Ye, H., Zhang, Z.: Greedy and random quasi-newton methods with faster explicit superlinear convergence. Adv. Neural. Inf. Process. Syst. 34, 6646–6657 (2021)

    Google Scholar 

  23. Moritz, P., Nishihara, R., Jordan, M.: A linearly-convergent stochastic L-BFGS algorithm. In: Artificial Intelligence and Statistics, pp. 249–258. PMLR (2016)

  24. Nocedal, J., Wright, S.: Numerical Optimization. Springer Science & Business Media, Berlin (2006)

    MATH  Google Scholar 

  25. Powell, M.: On the convergence of the variable metric algorithm. IMA J. Appl. Math. 7(1), 21–36 (1971)

    Article  MathSciNet  MATH  Google Scholar 

  26. Qu, S., Goh, M., Chan, F.T.: Quasi-Newton methods for solving multiobjective optimization. Oper. Res. Lett. 39(5), 397–399 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  27. Rodomanov, A., Nesterov, Y.: Greedy quasi-Newton methods with explicit superlinear convergence. SIAM J. Optim. 31(1), 785–811 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  28. Rodomanov, A., Nesterov, Y.: New results on superlinear convergence of classical Quasi-Newton methods. J. Optim. Theory Appl. 188(3), 744–769 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  29. Rodomanov, A., Nesterov, Y.: Rates of superlinear convergence for classical quasi-Newton methods. Math. Program. 194(1), 159–190 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  30. Shanno, D.F.: Conditioning of quasi-Newton methods for function minimization. Math. Comput. 24(111), 647–656 (1970)

    Article  MathSciNet  MATH  Google Scholar 

  31. Wei, Z., Yu, G., Yuan, G., Lian, Z.: The superlinear convergence of a modified BFGS-type method for unconstrained optimization. Comput. Optim. Appl. 29(3), 315–332 (2004)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

We would like to thank the two anonymous reviewers for their careful works and constructive comments that greatly help us improve the paper quality. Ye was supported in part by the National Natural Science Foundation of China under Grant 12101491. Chang was supported in part by the National Natural Science Foundation for Outstanding Young Scholars of China under Grant 72122018 and in part by the Natural Science Foundation of Shaanxi Province under Grant 2021JC-01. Haishan Ye and Dachao Lin have the equal contributions to this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiangyu Chang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Useful lemmas

Lemma 10

Let approximate Hessians be updated recursively as \(G_{i+1} \triangleq \mathrm {SR1}(A, G_i, u_i)\) (by Eqn. (5)) with an arbitrary \(u_i \in {\mathbb {R}}^n, \forall i \ge 0\). For each \(k \ge 0\), if \(u_i^\top (G_i - A) u_i > 0 \) holds for all \(0 \le i \le k\), then vectors \(u_0,\dots ,u_k\) are linearly independent. Furthermore, it holds that

$$\begin{aligned} Au_i = G_k u_i, \text{ for } \text{ all } 0\le i\le k-1. \end{aligned}$$
(65)

Proof

Denote \(R_i = G_i - A, \forall i \ge 0\). By the update of SR1, we can obtain that for all \(i \ge 0\),

$$\begin{aligned} R_{i+1} = {\left\{ \begin{array}{ll} R_{i} - \frac{R_{i}u_{i}u_{i}^\top R_{i}}{u_{i}^\top R_{i} u_{i}} &{} \text { if } R_{i}u_{i} \ne 0, \\ R_{i} &{} \text { otherwise}. \end{array}\right. } \end{aligned}$$

Thus, we get

$$\begin{aligned} \mathrm {Ker}(R_{i}) \subseteq \mathrm {Ker}(R_{i+1}), \text { and } u_{i} \in \mathrm {Ker}(R_{i+1}), \forall i \ge 0, \end{aligned}$$
(66)

where \(\mathrm {Ker}(R_i)\) is the null space of \(R_i\).

Next, we prove \(u_0,\dots , u_k\) are linearly independent provided that \(u_{i}^\top (G_{i} - A) u_{i} > 0, \forall i\le k\) by induction. First, for \(k = 0\), the result holds trivially. Then, we assume that \(u_0,\dots , u_{k}\) are linearly independent for some \(k \ge 0\), and we will show that \(u_0,\dots , u_{k+1}\) are linearly independent provided that \(u_{k+1}^\top (G_{k+1} - A) u_{k+1} > 0\). We try to prove the result by contradiction and assume that \(u_{k+1}\) can be represented as

$$\begin{aligned} u_{k+1} = \alpha _0 u_0 + \alpha _1 u_1 +\dots +\alpha _{k} u_{k}, \end{aligned}$$

where scalars \(\alpha _0, \dots , \alpha _{k}\) are not all zero. Applying Eq. (66), we have \(R_{k+1} u_i=0, \forall 0 \le i < k+1\). Then we further obtain that \(R_{k+1} u_{k+1} = R_{k+1} (\alpha _0 u_0 + \dots + \alpha _{k} u_{k}) = 0 + \dots + 0 = 0\). This contradicts the assumption that \(u_{k+1}^\top R_{k+1} u_{k+1} = u_{k+1}^\top (G_{k+1} - A) u_{k+1} > 0\). Thus, \(u_0,\dots , u_{k+1}\) are linearly independent, and we finish the induction.

Finally, Eq. (65) can be immediately obtained by observing that \(u_i \in \mathrm {Ker}(R_{i+1}) \subseteq \mathrm {Ker}(R_k) \) for all \(i = 0, \dots , k-1\) based on Eq. (66). \(\square \)

Upper bound of M for logistic regression

The logistic regression is defined as follows:

$$\begin{aligned} f(x) = \frac{1}{m}\sum _{i=1}^{m} \log [1+\exp (-b_i a_i^\top x)] + \frac{\gamma }{2}\Vert x\Vert ^2, \end{aligned}$$

where \(a_i \in {\mathbb {R}}^{n}\) is the i-th input vector, \(b_i\in \{-1,1\}\) is the corresponding label, and \(\gamma \ge 0\) is the regularization parameter. Accordingly, for all \(x,h \in {\mathbb {R}}^n\), we have that

$$\begin{aligned} h^\top \nabla ^2 f(x) h= & {} \frac{1}{m}\sum _{i=1}^{m} p_i(x)(1-p_i(x)) (a_i^\top h)^2 + \gamma \left\| h\right\| ^2, \\ \; p_i(x)\triangleq & {} \frac{1}{1+\exp (-b_i a_i^\top x)} \in (0, 1), \end{aligned}$$

and

$$\begin{aligned} \begin{aligned} D^3 f(x)[h,h,h]&= \frac{1}{m}\sum _{i=1}^{m} (1-2p_i(x)) \left( [p_i'(x)]^\top h\right) (b_i a_i^\top h)^2 \\&{\mathop {=}\limits ^{(i)}} \frac{1}{m}\sum _{i=1}^{m} (1-2p_i(x)) p_i(x)(1-p_i(x)) (b_i a_i^\top h)^3. \end{aligned} \end{aligned}$$
(67)

\(D^3 f(x)[h,h,h] {=} \left. \frac{d^3}{d t^3} f(x+t h)\right| _{t=0}\) is the third derivative of f along the direction h and \((i)\) uses the derivation

$$\begin{aligned} p_i'(x) = \frac{\exp (-b_i \cdot a_i^\top x)}{\left( 1+\exp (-b_i \cdot a_i^\top x)\right) ^2} \cdot b_i a_i = p_i(x)(1-p_i(x))b_i a_i. \end{aligned}$$

Since \(|t(1-t)(1-2t)| \le \frac{\sqrt{3}}{18}, \forall t\in (0,1)\) and the equality holds when \( t = \frac{1}{2} - \frac{1}{2\sqrt{3}}\), we further obtain

$$\begin{aligned} D^3 f(x)[h,h,h] {\mathop {\le }\limits ^{(67)}} \frac{\sqrt{3}}{18m}\sum _{i=1}^{m} \left\| b_i a_i\right\| ^3 \cdot \left\| h\right\| ^3 = \frac{\sqrt{3}}{18m}\sum _{i=1}^{m} \left\| a_i\right\| ^3 \cdot \left\| h\right\| ^3. \end{aligned}$$
(68)

By Eq. (68), f(x) has \(L'\)-Lipschitz Hessians with \(L'\triangleq \frac{\sqrt{3}}{18m}\sum _{i=1}^{m} \left\| a_i\right\| ^3\).

Finally, combining with f(x) being \(\gamma \)-strongly convex, we could get that given \(\forall x,y, z, w\in {\mathbb {R}}^n\), it holds that

$$\begin{aligned} \nabla ^2 f(x) -\nabla ^2 f(y)\preceq & {} L' \cdot \left\| x-y\right\| \cdot I \preceq L' \cdot \frac{\left\| x-y\right\| _{z}}{\sqrt{\gamma }} \cdot \frac{\nabla ^2 f(w)}{\gamma } \\= & {} \frac{L'}{\gamma ^{3/2}} \cdot \left\| x-y\right\| _{z} \cdot \nabla ^2 f(w), \end{aligned}$$

where the second inequality is because of f(x) is \(\gamma \)-strongly convex. Hence, f is M-strongly self-concordant with

$$\begin{aligned} M \le \frac{L'}{\gamma ^{3/2}} = \frac{\sqrt{3}}{18} \cdot \frac{\sum _{i=1}^{m} \left\| a_i\right\| ^3}{ m\gamma ^{3/2}}. \end{aligned}$$

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ye, H., Lin, D., Chang, X. et al. Towards explicit superlinear convergence rate for SR1. Math. Program. 199, 1273–1303 (2023). https://doi.org/10.1007/s10107-022-01865-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-022-01865-w

Keywords

Mathematics Subject Classification

Navigation