Skip to main content
Log in

On the Proximal Gradient Algorithm with Alternated Inertia

  • Published:
Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Abstract

In this paper, we investigate attractive properties of the proximal gradient algorithm with inertia. Notably, we show that using alternated inertia yields monotonically decreasing functional values, which contrasts with usual accelerated proximal gradient methods. We also provide convergence rates for the algorithm with alternated inertia, based on local geometric properties of the objective function. The results are put into perspective by discussions on several extensions (strongly convex case, non-convex case, and alternated extrapolation) and illustrations on common regularized optimization problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. For a non-smooth (possibly non-convex) function \(\varPhi :\mathbb {R}^n\rightarrow \mathbb {R}\), we denote by \(\partial \varPhi (x)\) the limiting (Fréchet) subdifferential at x [26]. If \(\varPhi \) is convex, this subdifferential coincides with the standard convex subdifferential.

  2. \(a_k = \varOmega (b_k) \) if \(\exists a,K\) such that \(\forall k\ge K\) we have \(a_k\ge a.b_k\).

  3. In the case of the proximal gradient, the proof of Lemma 2.2 recalled in “Appendix” requires (i) convexity of f in order to take \(x\ne y\) [see Eq. (16)]; (ii) convexity of g to get the term in \(\Vert \mathsf {T}_\gamma (x) - y\Vert \) by strong convexity of the proximal surrogate [see Eq. (14)].

  4. https://archive.ics.uci.edu/ml/datasets/ionosphere.

References

  1. Chambolle, A., De Vore, R.A., Lee, N.Y., Lucier, B.J.: Nonlinear wavelet image processing: variational problems, compression, and noise removal through wavelet shrinkage. IEEE Trans. Image Process. 7(3), 319–335 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  2. Daubechies, I., Defrise, M., De Mol, C.: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on pure and applied mathematics 57(11), 1413–1457 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  3. Hale, E.T., Yin, W., Zhang, Y.: Fixed-point continuation for \(\ell _{1}\)-minimization: methodology and convergence. SIAM J. Optim. 19(3), 1107–1130 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  4. Alvarez, F.: Weak convergence of a relaxed and inertial hybrid projection-proximal point algorithm for maximal monotone operators in hilbert space. SIAM Journal on Optim. 14(3), 773–782 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  5. Lorenz, D., Pock, T.: An inertial forward–backward algorithm for monotone inclusions. J. Math. Imag. Vis. 51(2), 311–325 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  6. Chambolle, A., Dossal, C.: On the convergence of the iterates of the “fast iterative shrinkage/thresholding algorithm”. J. Optim. Theory Appl. 166(3), 968–982 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  7. Aujol, J.F., Dossal, C.: Stability of over-relaxations for the forward-backward algorithm, application to fista. SIAM J. Optim. 25(4), 2408–2433 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  8. Attouch, H., Peypouquet, J.: The rate of convergence of Nesterov’s accelerated forward-backward method is actually faster than \(1/k^2\). SIAM J. Optim. 26(3), 1824–1834 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  9. Nesterov, Y.: A method of solving a convex programming problem with convergence rate o (1/k2). Sov. Math. Dokl. 27(2), 372–376 (1983)

    MATH  Google Scholar 

  10. Güler, O.: New proximal point algorithms for convex minimization. SIAM J. Optim. 2(4), 649–664 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  11. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci. 2(1), 183–202 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  12. Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  13. Maingé, P.E.: Convergence theorems for inertial km-type algorithms. J. Comput. Appl. Math. 219(1), 223–236 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  14. Beck, A., Teboulle, M.: Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. IEEE Trans. Image Process. 18(11), 2419–2434 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  15. Malitsky, Y., Pock, T.: A first-order primal-dual algorithm with linesearch. arXiv preprint arXiv:1608.08883 (2016)

  16. Correa, R., Lemaréchal, C.: Convergence of some algorithms for convex minimization. Math. Program. 62(2), 261–275 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  17. Bioucas-Dias, J.M., Figueiredo, M.A.: A new twist: two-step iterative shrinkage/thresholding algorithms for image restoration. IEEE Trans. Image Process. 16(12), 2992–3004 (2007)

    Article  MathSciNet  Google Scholar 

  18. Fuentes, M., Malick, J., Lemaréchal, C.: Descentwise inexact proximal algorithms for smooth optimization. Comput. Optim. Appl. 53(3), 755–769 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  19. Li, H., Lin, Z.: Accelerated proximal gradient methods for nonconvex programming. In: Advances in neural information processing systems, pp. 379–387 (2015)

  20. Mu, Z., Peng, Y.: A note on the inertial proximal point method. Stat. Optim. Inf. Comput. 3(3), 241–248 (2015)

    Article  MathSciNet  Google Scholar 

  21. Iutzeler, F., Hendrickx, J.M.: A generic linear rate acceleration of optimization algorithms via relaxation and inertia. arXiv preprint arXiv:1603.05398 (2016)

  22. Hiriart-Urruty, J.B., Lemaréchal, C.: Convex Analysis and Minimization Algorithms, vol. 2. Springer, Heidelberg (1993)

    MATH  Google Scholar 

  23. Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, Berlin (2011)

    Book  MATH  Google Scholar 

  24. Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of first-order descent methods for convex functions. Math. Program. 165(2), 471–507 (2017)

  25. NGuyen, T.P.: Kurdyka-lojasiewicz and convexity: algorithms and applications. Ph.D. thesis, Toulouse University (2017)

  26. Rockafellar, R.T., Wets, R.J.B.: Variational Analysis. Springer, Berlin (1998)

    Book  MATH  Google Scholar 

  27. Bolte, J., Daniilidis, A., Lewis, A.: The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim. 17(4), 1205–1223 (2007)

    Article  MATH  Google Scholar 

  28. Attouch, H., Bolte, J.: On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Math. Program. 116(1), 5–16 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  29. Frankel, P., Garrigos, G., Peypouquet, J.: Splitting methods with variable metric for kurdyka-łojasiewicz functions and general convergence rates. J. Optim. Theory Appl. 165(3), 874–900 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  30. Karimi, H., Nutini, J., Schmidt, M.: Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 795–811 (2016)

  31. Ochs, P., Chen, Y., Brox, T., Pock, T.: ipiano: Inertial proximal algorithm for nonconvex optimization. SIAM J. Imag. Sci. 7(2), 1388–1419 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  32. Liang, J., Fadili, J., Peyré, G.: A multi-step inertial forward-backward splitting method for non-convex optimization. In: Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R. (eds.) Advances in Neural Information Processing Systems, pp. 4035–4043. Curran Associates, Inc. (2016)

  33. Chartrand, R., Yin, W.: Nonconvex sparse regularization and splitting algorithms. In: Glowinski, R., Osher, S.J., Yin, W. (eds.) Splitting Methods in Communication, Imaging, Science, and Engineering, pp. 237–249. Springer (2016)

Download references

Acknowledgements

The work of the authors is partly supported by the PGMO Grant Advanced Non-smooth Optimization Methods for Stochastic Programming.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Franck Iutzeler.

Additional information

Communicated by Marc Teboulle.

Appendix

Appendix

For the sake of completeness, we provide short and direct proofs of known lemmas recalled in Sect. 2.2.

Proof of Lemma 2.2

Let \(x\in \mathbb {R}^n\); by definition, we have

$$\begin{aligned} \mathsf {T}_\gamma (x)&= \mathop {\hbox {argmin}}\limits _w \left( \gamma g(w) + \frac{1}{2} \left\| w- \left( x - \gamma \nabla f(x) \right) \right\| ^2 \right) \\&= \mathop {\hbox {argmin}}\limits _w \left( \underbrace{ f(x) + g(w) + \langle w-x ; \nabla f(x) \rangle + \frac{1}{2\gamma } \left\| w- x \right\| ^2 }_{s_x(w)} \right) \end{aligned}$$

and, as it is defined as the minimizer of \(\frac{1}{\gamma }\)-strongly convex surrogate function \(s_x\), we have for any \(y\in \mathbb {R}^n\) that \(s_x ( \mathsf {T}_\gamma (x)) + \frac{1}{2\gamma } \Vert \mathsf {T}_\gamma (x) - y\Vert ^2 \le s_x (y)\) so

$$\begin{aligned}&f(x) + g(\mathsf {T}_\gamma (x) ) + \langle \mathsf {T}_\gamma (x) -x ; \nabla f(x) \rangle + \frac{\left\| \mathsf {T}_\gamma (x) - x \right\| ^2 }{2\gamma } + \frac{\left\| \mathsf {T}_\gamma (x) - y \right\| ^2 }{2\gamma } \nonumber \\&\quad \le f(x) + g(y) + \langle y-x ; \nabla f(x) \rangle + \frac{\left\| y- x \right\| ^2 }{2\gamma }. \end{aligned}$$
(14)

Now we use (i) the descent lemma on L-smooth function f (see [23, Th. 18.15]) to show that

$$\begin{aligned} f( \mathsf {T}_\gamma (x) ) \le f(x) + \langle \mathsf {T}_\gamma (x) -x ; \nabla f(x) \rangle + \frac{L}{2} \left\| \mathsf {T}_\gamma (x) - x \right\| ^2 \end{aligned}$$
(15)

and (ii) the convexity of f to have

$$\begin{aligned} f(x) + \langle y-x ; \nabla f(x) \rangle \le f(y) . \end{aligned}$$
(16)

Using Eq. (15) on the left-hand side of (14) and Eq. (16) on the right-hand side, we get

$$\begin{aligned}&f(\mathsf {T}_\gamma (x)) + g(\mathsf {T}_\gamma (x) ) + \frac{ (1-\gamma L) \left\| \mathsf {T}_\gamma (x) - x \right\| ^2 }{2\gamma } + \frac{\left\| \mathsf {T}_\gamma (x) - y \right\| ^2 }{2\gamma } \\&\quad \le f(y) + g(y) + \frac{\left\| y- x \right\| ^2 }{2\gamma }. \end{aligned}$$

\(\square \)

Proof of Lemma 2.3

Let \(x\in \mathbb {R}^n\), and let \(y = \mathsf {T}_\gamma (x) \in {{\mathrm{argmin}}}_w \left( \gamma g(w) + \frac{1}{2} \left\| w- \left( x - \gamma \nabla f(x) \right) \right\| ^2 \right) \), then

$$\begin{aligned}&0 \in \gamma \partial g(y) + y - x + \gamma \nabla f(x) ~~~~ \Leftrightarrow ~~~~ 0 \in \nabla f(y) + \partial g(y) + \nabla f(x) \\&\quad - \nabla f(y) + \frac{1}{\gamma }(y-x) \end{aligned}$$

so \( \nabla f(y) - \nabla f(x) + \frac{1}{\gamma }(x-y) \in \partial F(y)\), thus we have \( {{\mathrm{dist}}}(0,\partial F(y)) \le \Vert \nabla f(y) - \nabla f(x) + \frac{1}{\gamma }(x-y) \Vert \le \left( L + \frac{1}{\gamma }\right) \Vert x-y\Vert \). \(\square \)

Lemma

(Non-convex version of Lemma 2.2) Let Assumption 1 hold but with g possibly non-convex, and take \(\gamma >0\). Then, for any \(x,y\in \mathbb {R}^n\),

$$\begin{aligned} F( \mathsf {T}_\gamma (x) ) + \frac{ (1-\gamma L)}{2\gamma } \left\| \mathsf {T}_\gamma (x) - x \right\| ^2 \le F(y) + \frac{1}{2\gamma } \left\| x - y \right\| ^2. \end{aligned}$$

Proof

Let \(x\in \mathbb {R}^n\); by definition, we have, as in the proof of Lemma 2.2,

$$\begin{aligned} \mathsf {T}_\gamma (x)&= \mathop {\hbox {argmin}}\limits _w \left( \gamma g(w) + \frac{1}{2} \left\| w- \left( x - \gamma \nabla f(x) \right) \right\| ^2 \right) \\&= \mathop {\hbox {argmin}}\limits _w \left( \underbrace{ f(x) + g(w) + \langle w-x ; \nabla f(x) \rangle + \frac{1}{2\gamma } \left\| w- x \right\| ^2 }_{s_x(w)} \right) \end{aligned}$$

and, as it is defined as a minimizer of (non-necessarily convex) surrogate function \(s_x\), we have for any \(y\in \mathbb {R}^n\) that \(s_x ( \mathsf {T}_\gamma (x)) \le s_x (y)\) (which differs from the convex case of Lemma 2.2) thus

$$\begin{aligned}&f(x) + g(\mathsf {T}_\gamma (x) ) + \langle \mathsf {T}_\gamma (x) -x ; \nabla f(x) \rangle + \frac{\left\| \mathsf {T}_\gamma (x) - x \right\| ^2 }{2\gamma } \nonumber \\&\quad \le f(x) + g(y) + \langle y-x ; \nabla f(x) \rangle + \frac{\left\| y- x \right\| ^2 }{2\gamma }. \end{aligned}$$
(17)

The proof then follows the same lines as that of Lemma 2.2.\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Iutzeler, F., Malick, J. On the Proximal Gradient Algorithm with Alternated Inertia. J Optim Theory Appl 176, 688–710 (2018). https://doi.org/10.1007/s10957-018-1226-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10957-018-1226-4

Keywords

Mathematics Subject Classification

Navigation