# Proximal Gradient Methods for Machine Learning and Imaging

• 362 Accesses

Part of the Applied and Numerical Harmonic Analysis book series (ANHA)

## Abstract

Convex optimization plays a key role in data sciences. The objective of this work is to provide basic tools and methods at the core of modern nonlinear convex optimization. Starting from the gradient descent method we will focus on a comprehensive convergence analysis for the proximal gradient algorithm and its state-of-the art variants, including accelerated, stochastic and block-wise implementations, which are nowadays very popular techniques to solve machine learning and inverse problems.

This is a preview of subscription content, access via your institution.

We’re sorry, something doesn't seem to be working properly.

## Notes

1. 1.

Note that if $$\inf \Phi = -\infty$$, it follows from (18) that $$\inf \Phi =\sup (-\Psi )= - \inf \Psi =-\infty$$. In this case, $$\Psi \equiv +\infty$$ and $$\inf \Phi + \inf \Psi = -\infty + \infty$$ does not make sense. Anyway, since there is no gap between $$\Phi$$ and $$-\Psi$$, by convention, we set $$\inf \Phi + \inf \Psi = 0$$. The same situation occurs if $$\inf \Psi =-\infty$$.

## References

1. Alvarez, F., Attouch, H.: An inertial proximal method for maximal monotone operators via discretization of a nonlinear oscillator with damping. Set-Valued Anal. 9, 3–11 (2001)

2. Atchadé, Y.F., Fort, G., Moulines, E.: On perturbed proximal gradient algorithms. J. Mach. Learn. Res. 18, 1–33 (2017)

3. Attouch, H., Bolte, J.: On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Math. Progr. 116, 5–16 (2009)

4. Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems. An approach based on the Kurdyka-Ł ojasiewicz inequality, Math. Oper. Res. 35, 438–457 (2010)

5. Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods. Math. Progr. 137, 91–129 (2013)

6. H. Attouch, Z. Chbani, J. Peypouquet, P. Redont, Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity. Math. Prog. Ser. B 168, 123–175 (2018)

7. Aujol, J.-F., Dossal, C., Rondepierre, A.: Optimal convergence rates for Nesterov Acceleration. SIAM J. Optim. 29, 3131–3153 (2019)

8. Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with Sparsity-Inducing Penalties. Optim. Mach. Learn. 5, 19–53 (2011)

9. Baillon, J.B., Bruck, R.E., Reich, S.: On the asymptotic behavior of nonexpansive mappings and semigroups in Banach spaces. Houston J. Math. 4, 1–9 (1978)

10. Barbu, V., Precupanu, T.: Convexity and Optimization in Banach Spaces. Springer, Dordrecht (2012)

11. Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces, 2nd edn. Springer, New York (2017)

12. Beck, A., Teboulle, M.: A fast iterative Shrinkage-Thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2, 183–202 (2009)

13. Beck, A., Teboulle, M.: Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. IEEE Trans. Image Process. 18, 2419–2434 (2009)

14. Bolte, J., Daniilidis, A., Lewis, A.: The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim. 17, 1205–1223 (2006)

15. Beck, A., Teboulle, M.: A fast dual proximal gradient algorithm for convex minimization and applications. Oper. Res. Lett. 42, 1–6 (2014)

16. Bolte, J., Daniilidis, A., Lewis, A., Shiota, M.: Clarke subgradients of stratifiable functions. SIAM J. Optim. 18, 556–572 (2007)

17. Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of first-order descent methods for convex functions. Math. Program. 165, 471–507 (2017)

18. Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Prog. 146, 459–494 (2013)

19. Borwein, J.M., Vanderwerff, J.D.: Convex Functions: Constructions, Characterizations and Counterexamples. Cambridge University Press, Cambridge (2010)

20. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory—COLT ’92, p. 144 (1992)

21. Bottou, L., Bousquet, O.: The tradeoffs of large-scale learning. In: Optimization for Machine Learning, pp. 351–368, The MIT Press, Cambridge (2012)

22. Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60, 223–311 (2018)

23. Bourbaki, N.: General Topology, 2nd edn. Springer, New York (1989)

24. Bredies, K.: A forward-backward splitting algorithm for the minimization of non-smooth convex functionals in Banach space. Inv. Prob. 25, Art. 015005 (2009)

25. Browder, F.E., Petryshyn, W.V.: The solution by iteration of nonlinear functional equations in Banach spaces. Bull. Am. Math. Soc. 72, 571–575 (1966)

26. Browder, F.E., Petryshyn, W.V.: Construction of fixed points of nonlinear mappings in Hilbert space. J. Math. Anal. Appl. 20, 197–228 (1967)

27. Burke, J.V., Ferris, M.C.: Weak sharp minima in mathematical programming. SIAM J. Control Optim. 31, 1340–1359 (1993)

28. Chambolle, A.: An algorithm for total variation minimization and applications. J. Math. Imaging Vis. 20, 89–97 (2004)

29. Chambolle, A., Dossal, C.: On the convergence of the iterates of the “Fast Iterative Shrinkage/Thresholding Algorithm". J. Optim. Theory Appl. 166, 968–982 (2015)

30. Chambolle, A., Lions, P.-L.: Image restoration by constrained total variation minimization and variants. In: Investigative and Trial Image Processing, San Diego, CA (SPIE), vol. 2567, pp. 50–59 (1995)

31. Chambolle, A., Lions, P.-L.: Image recovery via total variation minimization and related problems. Numer. Math. 76, 167–188 (1997)

32. Chambolle, A., Pock, T.: An introduction to continuous optimization for imaging. Acta Numerica 25, 161–319 (2016)

33. Chan, T.F., Golub, G.H., Mulet, P.: A nonlinear primal-dual method for total variation-based image restoration. SIAM J. Sci. Comput. 20, 1964–1977 (1999)

34. Combettes, P.L., Pesquet, J.-C.: Proximal splitting methods in signal processing, In: Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pp. 185–212. Springer, New York, NY (2011)

35. Combettes, P.L., Pesquet, J.-C.: Stochastic quasi-Fejér block-coordinate fixed point iterations with random sweeping. SIAM J. Optim. 25, 1121–1248 (2015)

36. Combettes, P.L., Pesquet, J.-C.: Proximal thresholding algorithms for minimization over orthonormal bases. SIAM J. Optim. 18, 1351–1376 (2007)

37. Combettes, P.L., V u, B.C.: Dualization of signal recovery problems. Set-Valued Anal. 18, 373–404 (2010)

38. Combettes, P.L., Yamada, I.: Compositions and convex combinations of averaged nonexpansive operators. J. Math. Anal. Appl. 425, 55–70 (2015)

39. Combettes, P.L., Wajs, V.: Signal recovery by proximal forward-backward splitting. Multiscale Model. Simul. 4, 1168–1200 (2005)

40. Cortes, C., Vapnik, V.: Support vector networks. Mach. Learn. 20, 273–297 (1995)

41. Daubechies, I., Defrise, M., De Mol, C.: An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Comm. Pure Appl. Math. 57, 1413–1457 (2004)

42. Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems, vol. 27 (2014)

43. Dotson, W.G.: On the Mann iterative process. Trans. Am. Math. Soc. 149, 65–73 (1970)

44. Duchi, J., Singer, Y.: Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res. 10, 2899–2934 (2009)

45. Dünner, C., Forte, S., Takac, M., Jaggi, M.: Primal-dual rates and certificates. In: Proceedings of The 33rd International Conference on Machine Learning, PMLR, vol. 48, pp. 783–792 (2016)

46. Ekeland, I., Témam, R.: Roger. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, Convex analysis and variational problems (1999)

47. Ermoliev, Yu.M.: On the method of generalized stochastic gradients and quasi-Fejér sequences. Cybernetics 5, 208–220 (1969)

48. Fenchel, W.: Convex Cones, Sets, and Functions. Princeton University (1953)

49. Foucart, S., Rauhut, H.: A mathematical introduction to compressive sensing. Birkäuser. Springer, New York (2010)

50. Frankel, P., Garrigos, G., Peypouquet, J.: Splitting methods with variable metric for Kurdyka-Łojasiewicz functions and general convergence rates. J. Optim. Theory Appl. 165, 874–900 (2015)

51. Gabay, D.: Applications of the method of multipliers to variational inequalities. In: Fortin, M., Glowinski, R. (eds.) Augmented Lagrangian Methods: Applications to the Numerical Solution of Boundary-Value Problems, North-Holland, Amsterdam, vol. 15, pp. 299–331 (1983)

52. Garrigos, G., Rosasco, L., Villa, S.: Convergence of the Forward-Backward Algorithm: Beyond the Worst Case with the Help of Geometry (2017). https://arxiv.org/abs/1703.09477

53. Goldstein, A.A.: Convex programming in Hilbert space. Bull. Am. Math. Soc. 70, 709–710 (1964)

54. Groetsch, C.W.: A note on segmenting Mann iterates. J. Math. Anal. Appl. 40, 369–372 (1972)

55. Guler, O.: New proximal point algorithms for convex minimization. SIAM J. Optim. 2, 649–664 (1992)

56. Blatt, D., Hero, A., Gauchman, H.: A convergent incremental gradient method with a constant step size. SIAM J. Optim. 18, 29–51 (2007)

57. Hiriart-Urruty, J.-B., Lemaréchal, C.: Fundamentals of Convex Analysis. Springer, Berlin (2001)

58. Jensen, J.L.W.V.: Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta Math. 30, 175–193 (1906)

59. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural Inf. Process. Syst. 26, 315–323 (2013)

60. Karimi, H., Nutini, J., Schmidt, M.: Linear Convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. In: Frasconi, P., Landwehr, N., Manco, G., Vreeken, J. (eds.), Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2016. Lecture Notes in Computer Science, vol. 9851. Springer, Cham

61. Kiefer, J., Wolfowitz, J.: Stochastic estimation of the maximum of a regression function. Ann. Math. Stat. 23, 462–466 (1952)

62. Kingma, D.P., Ba, L.J.: Adam: a method for stochastic optimization. In: Proceedings of Conference on Learning Representations (ICLR), San Diego (2015)

63. Krasnoselski, M.A.: Two remarks on the method of successive approximations. Uspekhi Mat. Nauk. 10, 123–127 (1955)

64. Levitin, E.S., Polyak, B.T.: Constrained minimization methods. U.S.S.R. Comput. Math. Math. Phys. 6, 1–50 (1966)

65. Li, W.: Error bounds for piecewise convex quadratic programs and applications. SIAM J. Control Optim 33, 1510–1529 (1995)

66. Li, G.: Global error bounds for piecewise convex polynomials. Math. Prog. Ser. A 137, 37–64 (2013)

67. Lions, P.L., Mercier, I.: Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer. Anal. 16, 964–979 (1979)

68. Luo, Z.Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: a general approach. Ann. Oper. Res. 46, 157–178 (1993)

69. Luque, F.: Asymptotic convergence analysis of the proximal point algorithm. SIAM J. Control Optim. 22, 277–293 (1984)

70. Mann, W.R.: Mean value methods in iteration. Proc. Am. Math. Soc. 4, 506–510 (1953)

71. Martinet, B.: Régularisation d’in Opér. 4, Sér. R-3, pp. 154–158 (1970)

72. Mercier, B.: Inéquations Variationnelles de la Mécanique. No. 80.01 in Publications Mathématiques d’Orsay. Université de Paris-XI, Orsay, France (1980)

73. Minkowski, H.: Theorie der konvexen Körper, insbesondere Begründung ihres Oberflächenbegriffs. In: Hilbert, D. (ed.) Gesammelte abhandlungen von Hermann Minkowski [Collected Papers of Hermann Minkowski], vol. 2, pp. 131–229. B.G. Teubner, Leipzig (1911)

74. Moreau, J.J.: Fonctions convexes duales et points proximaux dans un espace hilbertien, C. R. Acad. Sci. Paris Ser. A Math. 255, 2897–2899 (1962)

75. Moreau, J.J.: Propriétés des applications “prox”, C. R. Acad. Sci. Paris Ser. A Math. 256, 1069–1071 (1963)

76. Moreau, J.J.: Proximité et dualité dans un espace Hilbertien. Bull. de la Société Mathématique de France 93, 273–299 (1965)

77. Mosci, S., Rosasco, L., Santoro, M., Verri, A., Villa, S.: Solving structured sparsity regularization with proximal methods. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 418–433. Springer, Berlin, Heidelberg (2010)

78. Necoara, I., Clipici, D.: Parallel random coordinate descent method for composite minimization: convergence analysis and error bounds. SIAM J. Optim. 26, 197–226 (2016)

79. Necoara, I., Nesterov, Y., Glineur, F.: Random block coordinate descent methods for linearly constrained optimization over networks. J. Optim. Theory Appl. 173, 227–254 (2017)

80. Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19, 1574–1609 (2009)

81. Nemirovsij, A.S., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley-Interscience, New York (1983)

82. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers, London (2004)

83. Nesterov, Y.: A method for solving the convex programming problem with convergence rate $$O(1/k^2)$$. Dokl. Akad. Nauk SSSR 269, 543–547 (1983)

84. Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22, 341–362 (2012)

85. Opial, Z.: Weak convergence of the sequence of successive approximations for nonexpansive mappings. Bull. Am. Math. Soc. 73, 591–597 (1967)

86. Osher, S., Burger, M., Goldfarb, D., Xu, J., Yin, W.: An iterative regularization method for total variation- based image restoration. Multiscale Model. Sim. 4, 460–489 (2005)

87. Passty, G.B.: Ergodic convergence of a zero of the sum of monotone operators in Hilbert space. J. Math. Anal. Appl. 72, 383–390 (1979)

88. Peypouquet, J.: Convex Optimization in Normed Spaces. Springer, Cham (2015)

89. Phelps, R.R.: Convex Functions, Monotone Operators and Differentiability. Springer, Berlin (1993)

90. Polyak, B.T.: Dokl. Akad. Nauk SSSR 174

91. Polyak, B.T.: Gradient methods for minimizing functionals. Zh. Vychisl. Mat. Mat. Fiz. 3, 643–653 (1963)

92. Polyak, B.T.: Subgradient methods: a survey of Soviet research. In: Lemaréchal, C.L., Mifflin, R. (eds.) Proceedings of a IIASA Workshop, Nonsmooth Optimization, pp. 5–28. Pergamon Press, New York (1977)

93. Polyak, B.T.: Introduction to Optimization. Optimization Software, Inc. (1987)

94. Richtàrik, P., Takàc̆, M.: Parallel coordinate descent methods for big data optimization. Math. Program. Ser. A 156, 56–484 (2016)

95. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)

96. Robbins, H., Siegmund, D.: A convergence theorem for non negative almost supermartingales and some applications. In: Optimizing Methods in Statistics, pp. 233–257. Academic Press (1971)

97. Rockafellar, T.: Monotone operators and the proximal point algorithm. SIAM J. Optim. 14, 877–898 (1976)

98. Rockafellar, T.: Convex Analysis. Princeton University Press, Princeton (1970)

99. Rockafellar, T.: Conjugate duality and optimization. Society for Industrial and Applied Mathematics, Philadelphia (1974)

100. Rosasco, L., Villa, S., Vũ, B.C.: Convergence of stochastic proximal gradient method. Appl. Math. Optim. 82, 891–917 (2020)

101. Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D 60, 259–268 (1992)

102. Salzo, S.: The variable metric forward-backward splitting algorithm under mild differentiability assumptions. SIAM J. Optim. 27(4), 2153–2181 (2017)

103. Salzo, S. Villa, S.: Parallel random block-coordinate forward-backward algorithm: a unified convergence analysis. Math. Program. Ser. A. https://doi.org/10.1007s10107-020-01602-1

104. Schaefer, H.: Über die Methode sukzessiver Approximationen. Jber. Deutsch. Math.-Verein. 59, 131–140 (1957)

105. Shamir, O., Zhang, T.: Stochastic gradient descent for non-smooth optimization: convergence results and optimal averaging schemes. In: Proceedings of the 30th International Conference on Machine Learning, pp. 71–79 (2013)

106. Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A.: Pegasos: primal estimated sub-gradient solver for SVM. Math. Program. 127, 3–30 (2011)

107. Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)

108. Shor, N.: Minimization Methods for Non-differentiable Functions. Springer, New York (1985)

109. Sibony, M.: Méthodes itéraratives pour les équations et inéquations aux dérivées partielles non linéaires de type monotone. Calcolo 7, 65–183 (1970)

110. Steinwart, I., Christmann, A.: Support Vector Machines. Springer, New York (2008)

111. Su, W., Boyd, S., Candès, E.J.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. J. Mach. Learn. Res. 17, 1–43 (2016)

112. Tseng, P.: Applications of a splitting algorithm to decomposition in convex programming and variational inequalities. SIAM J. Control Optim. 29, 119–138 (1991)

113. Wolfe, P.: A method of conjugate subgradients for minimizing nondifferentiable functions. Nondifferentiable optimization. Math. Program. Stud. 3, 145–173 (1975)

114. Wright, S.: Coordinate descent algorithms. Math. Program. 151, 3–34 (2015)

115. Zălinescu, C.: Convex Analysis in General Vector Spaces. World Scientific Publishing Co. Inc, River Edge, NJ (2002)

116. Zhang, X., Burger, M., Bresson, X., Osher, S.: Bregmanized nonlocal regularization for deconvolution and sparse reconstruction. SIAM J. Imaging Sci. 3, 253–276 (2010)

## Acknowledgements

The work of S. Villa has been supported by the ITN-ETN project TraDE-OPT funded by the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska–Curie grant agreement No 861137 and by the project “Processi evolutivi con memoria descrivibili tramite equazioni integro-differenziali” funded by Gruppo Nazionale per l’ Analisi Matematica, la Probabilità e le loro Applicazioni (GNAMPA) of the Istituto Nazionale di Alta Matematica (INdAM).

## Author information

Authors

### Corresponding author

Correspondence to Saverio Salzo .

## Rights and permissions

Reprints and Permissions