Skip to main content
Log in

Regularized nonlinear acceleration

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

We describe a convergence acceleration technique for unconstrained optimization problems. Our scheme computes estimates of the optimum from a nonlinear average of the iterates produced by any optimization method. The weights in this average are computed via a simple linear system, whose solution can be updated online. This acceleration scheme runs in parallel to the base algorithm, providing improved estimates of the solution on the fly, while the original optimization method is running. Numerical experiments are detailed on classical classification problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Aitken, A. C.: XXV.—On Bernoulli’s Numerical Solution of Algebraic Equations. In: Proceedings of the Royal Society of Edinburgh, vol. 46, pp. 289–305 (1927)

    Article  Google Scholar 

  2. Anderson, D.G.: Iterative procedures for nonlinear integral equations. J. ACM (JACM) 12(4), 547–560 (1965)

    Article  MathSciNet  Google Scholar 

  3. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)

    Article  MathSciNet  Google Scholar 

  4. Ben-Tal, A., Nemirovski, A.: Lectures on modern convex optimization: analysis, algorithms, and engineering applications. SIAM (2001)

  5. Brezinski, C.: Accélération de la convergence en analyse numérique, vol. 584. Springer, Berlin (2006)

    MATH  Google Scholar 

  6. Cabay, S., Jackson, L.: A polynomial extrapolation method for finding limits and antilimits of vector sequences. SIAM J. Numer. Anal. 13(5), 734–752 (1976)

    Article  MathSciNet  Google Scholar 

  7. Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimization: a novel approach. Math. Program. 145(1–2), 451–482 (2014)

    Article  MathSciNet  Google Scholar 

  8. Durbin, J.: The fitting of time-series models. Revue de l’Institut International de Statistique, pp. 233–244 (1960)

  9. Eddy, R.: Extrapolating to the limit of a vector sequence. In: Information Linkage Between Applied Mathematics and Industry, pp. 387–396 (1979)

    Chapter  Google Scholar 

  10. Golub, G.H., Varga, R.S.: Chebyshev semi-iterative methods, successive overrelaxation iterative methods, and second order richardson iterative methods. Numerische Mathematik 3(1), 147–156 (1961)

    Article  MathSciNet  Google Scholar 

  11. Hardt, M.: The zen of gradient descent (2013)

  12. Hazan, E.: Personal communication (2014)

  13. Heinig, G., Rost, K.: Fast algorithms for toeplitz and hankel matrices. Linear Algebra Appl. 435(1), 1–59 (2011)

    Article  MathSciNet  Google Scholar 

  14. Lasserre, J.B.: Global optimization with polynomials and the problem of moments. SIAM J. Optim. 11(3), 796–817 (2001)

    Article  MathSciNet  Google Scholar 

  15. Lessard, L., Recht, B., Packard, A.: Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016)

    Article  MathSciNet  Google Scholar 

  16. Levinson, N.: The wiener rms error criterion in filter design and prediction, appendix b of wiener, n. (1949). Extrapolation, Interpolation, and Smoothing of Stationary Time Series (1949)

  17. Lin, H., Mairal, J., Harchaoui, Z.: A universal catalyst for first-order optimization. In: Advances in Neural Information Processing Systems, pp. 3384–3392 (2015)

  18. Mešina, M.: Convergence acceleration for the iterative solution of the equations x = ax + f. Comput. Methods Appl. Mech. Eng. 10(2), 165–173 (1977)

    Article  MathSciNet  Google Scholar 

  19. Nemirovskii, A., Nesterov, Y.E.: Optimal methods of smooth convex minimization. USSR Comput. Math. Math. Phys. 25(2), 21–30 (1985)

    Article  MathSciNet  Google Scholar 

  20. Nemirovskiy, A.S., Polyak, B.T.: Iterative methods for solving linear ill-posed problems under precise information. Eng. Cyber. 4, 50–56 (1984)

    MathSciNet  MATH  Google Scholar 

  21. Nesterov, Y.: A method of solving a convex programming problem with convergence rate o (1/k2). In: Soviet Mathematics Doklady, vol. 27, pp. 372–376 (1983)

  22. Nesterov, Y.: Squared functional systems and optimization problems. In: High performance optimization, pp. 405–440. Springer, Berlin (2000)

    MATH  Google Scholar 

  23. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, Berlin (2013)

    MATH  Google Scholar 

  24. Nesterov, Y.: Universal gradient methods for convex optimization problems. Math. Program. 152(1–2), 381–404 (2015)

    Article  MathSciNet  Google Scholar 

  25. Parrilo, P.A.: Structured semidefinite programs and semialgebraic geometry methods in robustness and optimization, Ph.D. thesis, California Institute of Technology (2000)

  26. Shanks, D.: Non-linear transformations of divergent and slowly convergent sequences. Stud. Appl. Math. 34(1–4), 1–42 (1955)

    MathSciNet  MATH  Google Scholar 

  27. Sidi, A., Ford, W.F., Smith, D.A.: Acceleration of convergence of vector sequences. SIAM J. Numer. Anal. 23(1), 178–196 (1986)

    Article  MathSciNet  Google Scholar 

  28. Smith, D.A., Ford, W.F., Sidi, A.: Extrapolation methods for vector sequences. SIAM Rev. 29(2), 199–233 (1987)

    Article  MathSciNet  Google Scholar 

  29. Su, W., Boyd, S., Candes, E.: In: Advances in Neural Information Processing Systems, pp. 2510–2518 (2014)

  30. Tyrtyshnikov, E.E.: How bad are hankel matrices? Numerische Mathematik 67(2), 261–269 (1994)

    Article  MathSciNet  Google Scholar 

  31. Wibisono, A., Wilson, A.C.: On accelerated methods in optimization. arXiv preprint (2015). arXiv:1509.03616

  32. Wynn, P.: On a device for computing the e m (s n) transformation. In: Mathematical Tables and Other Aids to Computation, pp. 91–96 (1956)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

AA is at the département d’informatique de l’ENS, École normale supérieure, UMR CNRS 8548, PSL Research University, 75005 Paris, France, and INRIA Sierra project-team. The authors would like to acknowledge support from a starting grant from the European Research Council (ERC project SIPA), from the ITN MacSeNet (project number 642685), as well as support from the chaire \(\acute{E}\)conomie des nouvelles donn\(\acute{e}\)es with the data science joint research initiative with the fonds AXA pour la recherche, and from a Google focused award.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexandre d’Aspremont.

Additional information

A subset of these results appeared at the 2016 NIPS conference under the same title.

Appendix A: Missing propositions and proofs

Appendix A: Missing propositions and proofs

1.1 A.1: Missing propositions

Proposition A.1

Consider the function

$$\begin{aligned} f(x) = \sqrt{a - \lambda x^2} + bx \end{aligned}$$

defined for \(x \in [0,\sqrt{a/\lambda }]\). The its maximal value is attained at

$$\begin{aligned} x_{ \text {opt} } = \frac{b \sqrt{a}}{\sqrt{\lambda ^2\kappa ^2 + \lambda b^2}} \end{aligned}$$

and its maximal value is thus, if \(x_{ \text {opt} } \in [0,\sqrt{a/\lambda }]\),

$$\begin{aligned} f_{\max } = \sqrt{a} \sqrt{\kappa ^2 + \frac{b^2}{\lambda }} . \end{aligned}$$
(41)

Proof

The (positive) root of the derivative of f follows

$$\begin{aligned} b\sqrt{a-\lambda x^2} - \kappa \lambda x = 0 \qquad \Leftrightarrow \qquad x = \frac{b \sqrt{a}}{\sqrt{\lambda ^2\kappa ^2 + \lambda b^2}} . \end{aligned}$$

If we inject the solution in our function, we obtain its maximal value,

$$\begin{aligned} \kappa \sqrt{a - \lambda \left( \frac{b \sqrt{a}}{\sqrt{\lambda ^2\kappa ^2 + \lambda b^2}}\right) ^2} + b \frac{b \sqrt{a}}{\sqrt{\lambda ^2\kappa ^2 + \lambda b^2}}= & {} \kappa \sqrt{a - \lambda \frac{b^2 a}{\lambda ^2\kappa ^2 + \lambda b^2}} \\&+ b \frac{b \sqrt{a}}{\sqrt{\lambda ^2\kappa ^2 + \lambda b^2}},\\= & {} \kappa \sqrt{a - \lambda \frac{b^2 a}{\lambda ^2\kappa ^2 + \lambda b^2}} \\&+ b \frac{b \sqrt{a}}{\sqrt{\lambda ^2\kappa ^2 + \lambda b^2}}, \\= & {} \kappa \sqrt{\frac{a \lambda ^2 \kappa ^2}{\lambda ^2\kappa ^2 + \lambda b^2}} + b \frac{b \sqrt{a}}{\sqrt{\lambda ^2\kappa ^2 + \lambda b^2}}, \\= & {} \sqrt{a} \frac{ \kappa ^2 \lambda + b^2 }{\sqrt{\lambda ^2\kappa ^2 + \lambda b^2}}, \\= & {} \frac{\sqrt{a}}{\lambda } \sqrt{\lambda ^2\kappa ^2 + \lambda b^2}. \end{aligned}$$

The simplification with \(\lambda \) in the last equality concludes the proof. \(\square \)

1.2 A.2: Proof of proposition 3.8

First, we show that the choice \(\sigma = 1-\frac{\mu }{L}\) satisfies \(\Vert G\Vert = \Vert g'(x^*)\Vert \le \sigma \). Our fixed-point function g reads

$$\begin{aligned} g(x) = x-\frac{1}{L}f'(x) . \end{aligned}$$

Since \(g'(x) = I-\frac{1}{L}f''(x)\), we have \(g'(x^*) = I-\frac{1}{L}f''(x^*)\). Because f is \(\mu \)-strongly convex, \(f''(x)\succeq \mu I\), in particular at \(x=x^*\). In conclusion,

$$\begin{aligned} \Vert g'(x^*)\Vert = \Vert I-\frac{1}{L}f''(x^*)\Vert \le 1-\frac{\mu }{L}. \end{aligned}$$

Now, consider the matrix \({\tilde{R}}\). Since the \(i-th\) column \({\tilde{R}}_i\) is equal to \({\tilde{x}}_{i}-{\tilde{x}}_{i-1}\),

$$\begin{aligned} \Vert {\tilde{R}}_i\Vert= & {} \Vert {\tilde{x}}_{i}-{\tilde{x}}_{i-1}\Vert , \\= & {} \frac{1}{L}\Vert f'({\tilde{x}}_i)\Vert , \\\le & {} \Vert {\tilde{x}}_i-x^*\Vert . \end{aligned}$$

In the last inequality, we used the fact that f is L-Lipschitz, so \(\Vert f(x)-f(x^*)\Vert \le L \Vert x-x^*\Vert \). It is also possible to prove [23] that gradient method converges at rate

$$\begin{aligned} \Vert {\tilde{x}}_{i+1}-x^*\Vert \le \sigma \Vert x_i-x^*\Vert . \end{aligned}$$

It remains to link this quantity to \(\Vert {\tilde{R}}\Vert \),

$$\begin{aligned} \Vert {\tilde{R}}\Vert\le & {} \sum _{i=0}^k \Vert R_i\Vert , \\\le & {} \sum _{i=0}^k \sigma ^i\Vert x_0-x^*\Vert , \\= & {} \frac{1-\sigma ^{k+1}}{1-\sigma }\Vert x_0-x^*\Vert . \\ \end{aligned}$$

We continue with \(\Vert \mathcal {E}\Vert \). We express \(\Vert \mathcal {E}_i\Vert = \Vert {\tilde{x}}_{i+1}-x_{i+1}\Vert _2\) in function of \(\Vert {\tilde{x}}_0-x_0\Vert _2\) using a recursion with \(\Vert {\tilde{x}}_i-x_i\Vert _2\),

$$\begin{aligned} {\tilde{x}}_{i+1} - x_{i+1}= & {} {\tilde{x}}_i - \frac{1}{L}\nabla f({\tilde{x}}_i) - x_i + \frac{1}{L} \nabla ^2 f(x^*)(x_i-x^*) , \\= & {} {\tilde{x}}_{i} - x_{i} - \frac{1}{L}(\nabla f({\tilde{x}}_i) - \nabla ^2 f(x^*)(x_i-x^*)) , \\= & {} \left( I-\frac{\nabla ^2 f(x^*)}{L}\right) ({\tilde{x}}_{i} - x_{i}) - \frac{1}{L}(\nabla f({\tilde{x}}_i) - \nabla ^2 f(x^*)({\tilde{x}}_i-x^*)) . \end{aligned}$$

Since our function has a Lipschitz-continuous Hessian, it is possible to show that ([23], Lemma 1.2.4)

$$\begin{aligned} \left\| \nabla f(y) - \nabla f(x) - \nabla ^2 f(x)(y-x)\right\| _2 \le \frac{M}{2}\Vert y-x\Vert ^2. \end{aligned}$$
(42)

We can thus bound the norm of the error at the \(i{\text {th}}\) iteration,

$$\begin{aligned} \Vert x_{i+1}-{\tilde{x}}_{i+1}\Vert _2\le & {} \left\| I-\frac{\nabla ^2 f(x^*)}{L})\right\| _2 \Vert x_{i}-{\tilde{x}}_{i} \Vert _2 + \frac{1}{L}\left\| \nabla f({\tilde{x}}_i) - \nabla ^2 f(x^*)({\tilde{x}}_i-x^*)\right\| _2,\\= & {} \Vert g''(x^*)\Vert _2\Vert x_{i}-{\tilde{x}}_{i} \Vert _2 + \frac{1}{L}\left\| \nabla f({\tilde{x}}_i) - \nabla f(x^*) - \nabla ^2 f(x^*)({\tilde{x}}_i-x^*)\right\| _2.\\ \end{aligned}$$

By equation (42), and because \(\Vert g''(x^*)\Vert \le \sigma \), we have

$$\begin{aligned} \Vert x_{i+1}-{\tilde{x}}_{i+1}\Vert _2\le & {} \sigma \Vert x_{i}-{\tilde{x}}_{i} \Vert _2 + \frac{M}{2L}\left\| {\tilde{x}}_i - x^*\right\| _2^2 , \\\le & {} \sigma \Vert x_{i}-{\tilde{x}}_{i}\Vert _2 + \frac{M}{2L}\sigma ^{2i} \Vert x_0-x^*\Vert _2^2 , \\\le & {} \Vert x_{i}-{\tilde{x}}_{i}\Vert _2 + \frac{M}{2L} \Vert x_0-x^*\Vert _2^2. \end{aligned}$$

The simplification in the last line greatly simplifies future computations. We thus have the bound

$$\begin{aligned} \Vert x_{i+1}-{\tilde{x}}_{i+1}\Vert _2 \le (i+1) \frac{M}{2L}\Vert x_0-x^*\Vert ^2. \end{aligned}$$

Finally,

$$\begin{aligned} \Vert \mathcal {E}\Vert\le & {} \sum _{i=0}^k \Vert x_{i+1}-{\tilde{x}}_{i+1}\Vert _2 , \\\le & {} \sum _{i=0}^k (i+1) \frac{M}{2L}\Vert x_0-x^*\Vert ^2 , \\\le & {} (k+2)^2 \frac{M}{4L}\Vert x_0-x^*\Vert ^2. \end{aligned}$$

Despite the simplification made earlier, the results of this bounds are close to the one obtained without simplification.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Scieur, D., d’Aspremont, A. & Bach, F. Regularized nonlinear acceleration. Math. Program. 179, 47–83 (2020). https://doi.org/10.1007/s10107-018-1319-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-018-1319-8

Keywords

Mathematics Subject Classification

Navigation