Abstract
We describe a convergence acceleration technique for unconstrained optimization problems. Our scheme computes estimates of the optimum from a nonlinear average of the iterates produced by any optimization method. The weights in this average are computed via a simple linear system, whose solution can be updated online. This acceleration scheme runs in parallel to the base algorithm, providing improved estimates of the solution on the fly, while the original optimization method is running. Numerical experiments are detailed on classical classification problems.
Similar content being viewed by others
References
Aitken, A. C.: XXV.—On Bernoulli’s Numerical Solution of Algebraic Equations. In: Proceedings of the Royal Society of Edinburgh, vol. 46, pp. 289–305 (1927)
Anderson, D.G.: Iterative procedures for nonlinear integral equations. J. ACM (JACM) 12(4), 547–560 (1965)
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Ben-Tal, A., Nemirovski, A.: Lectures on modern convex optimization: analysis, algorithms, and engineering applications. SIAM (2001)
Brezinski, C.: Accélération de la convergence en analyse numérique, vol. 584. Springer, Berlin (2006)
Cabay, S., Jackson, L.: A polynomial extrapolation method for finding limits and antilimits of vector sequences. SIAM J. Numer. Anal. 13(5), 734–752 (1976)
Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimization: a novel approach. Math. Program. 145(1–2), 451–482 (2014)
Durbin, J.: The fitting of time-series models. Revue de l’Institut International de Statistique, pp. 233–244 (1960)
Eddy, R.: Extrapolating to the limit of a vector sequence. In: Information Linkage Between Applied Mathematics and Industry, pp. 387–396 (1979)
Golub, G.H., Varga, R.S.: Chebyshev semi-iterative methods, successive overrelaxation iterative methods, and second order richardson iterative methods. Numerische Mathematik 3(1), 147–156 (1961)
Hardt, M.: The zen of gradient descent (2013)
Hazan, E.: Personal communication (2014)
Heinig, G., Rost, K.: Fast algorithms for toeplitz and hankel matrices. Linear Algebra Appl. 435(1), 1–59 (2011)
Lasserre, J.B.: Global optimization with polynomials and the problem of moments. SIAM J. Optim. 11(3), 796–817 (2001)
Lessard, L., Recht, B., Packard, A.: Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016)
Levinson, N.: The wiener rms error criterion in filter design and prediction, appendix b of wiener, n. (1949). Extrapolation, Interpolation, and Smoothing of Stationary Time Series (1949)
Lin, H., Mairal, J., Harchaoui, Z.: A universal catalyst for first-order optimization. In: Advances in Neural Information Processing Systems, pp. 3384–3392 (2015)
Mešina, M.: Convergence acceleration for the iterative solution of the equations x = ax + f. Comput. Methods Appl. Mech. Eng. 10(2), 165–173 (1977)
Nemirovskii, A., Nesterov, Y.E.: Optimal methods of smooth convex minimization. USSR Comput. Math. Math. Phys. 25(2), 21–30 (1985)
Nemirovskiy, A.S., Polyak, B.T.: Iterative methods for solving linear ill-posed problems under precise information. Eng. Cyber. 4, 50–56 (1984)
Nesterov, Y.: A method of solving a convex programming problem with convergence rate o (1/k2). In: Soviet Mathematics Doklady, vol. 27, pp. 372–376 (1983)
Nesterov, Y.: Squared functional systems and optimization problems. In: High performance optimization, pp. 405–440. Springer, Berlin (2000)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, Berlin (2013)
Nesterov, Y.: Universal gradient methods for convex optimization problems. Math. Program. 152(1–2), 381–404 (2015)
Parrilo, P.A.: Structured semidefinite programs and semialgebraic geometry methods in robustness and optimization, Ph.D. thesis, California Institute of Technology (2000)
Shanks, D.: Non-linear transformations of divergent and slowly convergent sequences. Stud. Appl. Math. 34(1–4), 1–42 (1955)
Sidi, A., Ford, W.F., Smith, D.A.: Acceleration of convergence of vector sequences. SIAM J. Numer. Anal. 23(1), 178–196 (1986)
Smith, D.A., Ford, W.F., Sidi, A.: Extrapolation methods for vector sequences. SIAM Rev. 29(2), 199–233 (1987)
Su, W., Boyd, S., Candes, E.: In: Advances in Neural Information Processing Systems, pp. 2510–2518 (2014)
Tyrtyshnikov, E.E.: How bad are hankel matrices? Numerische Mathematik 67(2), 261–269 (1994)
Wibisono, A., Wilson, A.C.: On accelerated methods in optimization. arXiv preprint (2015). arXiv:1509.03616
Wynn, P.: On a device for computing the e m (s n) transformation. In: Mathematical Tables and Other Aids to Computation, pp. 91–96 (1956)
Acknowledgements
AA is at the département d’informatique de l’ENS, École normale supérieure, UMR CNRS 8548, PSL Research University, 75005 Paris, France, and INRIA Sierra project-team. The authors would like to acknowledge support from a starting grant from the European Research Council (ERC project SIPA), from the ITN MacSeNet (project number 642685), as well as support from the chaire \(\acute{E}\)conomie des nouvelles donn\(\acute{e}\)es with the data science joint research initiative with the fonds AXA pour la recherche, and from a Google focused award.
Author information
Authors and Affiliations
Corresponding author
Additional information
A subset of these results appeared at the 2016 NIPS conference under the same title.
Appendix A: Missing propositions and proofs
Appendix A: Missing propositions and proofs
1.1 A.1: Missing propositions
Proposition A.1
Consider the function
defined for \(x \in [0,\sqrt{a/\lambda }]\). The its maximal value is attained at
and its maximal value is thus, if \(x_{ \text {opt} } \in [0,\sqrt{a/\lambda }]\),
Proof
The (positive) root of the derivative of f follows
If we inject the solution in our function, we obtain its maximal value,
The simplification with \(\lambda \) in the last equality concludes the proof. \(\square \)
1.2 A.2: Proof of proposition 3.8
First, we show that the choice \(\sigma = 1-\frac{\mu }{L}\) satisfies \(\Vert G\Vert = \Vert g'(x^*)\Vert \le \sigma \). Our fixed-point function g reads
Since \(g'(x) = I-\frac{1}{L}f''(x)\), we have \(g'(x^*) = I-\frac{1}{L}f''(x^*)\). Because f is \(\mu \)-strongly convex, \(f''(x)\succeq \mu I\), in particular at \(x=x^*\). In conclusion,
Now, consider the matrix \({\tilde{R}}\). Since the \(i-th\) column \({\tilde{R}}_i\) is equal to \({\tilde{x}}_{i}-{\tilde{x}}_{i-1}\),
In the last inequality, we used the fact that f is L-Lipschitz, so \(\Vert f(x)-f(x^*)\Vert \le L \Vert x-x^*\Vert \). It is also possible to prove [23] that gradient method converges at rate
It remains to link this quantity to \(\Vert {\tilde{R}}\Vert \),
We continue with \(\Vert \mathcal {E}\Vert \). We express \(\Vert \mathcal {E}_i\Vert = \Vert {\tilde{x}}_{i+1}-x_{i+1}\Vert _2\) in function of \(\Vert {\tilde{x}}_0-x_0\Vert _2\) using a recursion with \(\Vert {\tilde{x}}_i-x_i\Vert _2\),
Since our function has a Lipschitz-continuous Hessian, it is possible to show that ([23], Lemma 1.2.4)
We can thus bound the norm of the error at the \(i{\text {th}}\) iteration,
By equation (42), and because \(\Vert g''(x^*)\Vert \le \sigma \), we have
The simplification in the last line greatly simplifies future computations. We thus have the bound
Finally,
Despite the simplification made earlier, the results of this bounds are close to the one obtained without simplification.
Rights and permissions
About this article
Cite this article
Scieur, D., d’Aspremont, A. & Bach, F. Regularized nonlinear acceleration. Math. Program. 179, 47–83 (2020). https://doi.org/10.1007/s10107-018-1319-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-018-1319-8