General Convergence Analysis of Stochastic First-Order Methods for Composite Optimization

Necoara, Ion

doi:10.1007/s10957-021-01821-2

General Convergence Analysis of Stochastic First-Order Methods for Composite Optimization

Published: 22 February 2021

Volume 189, pages 66–95, (2021)
Cite this article

Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Ion Necoara ORCID: orcid.org/0000-0003-1102-2654^1,2

572 Accesses
4 Citations
Explore all metrics

Abstract

In this paper, we consider stochastic composite convex optimization problems with the objective function satisfying a stochastic bounded gradient condition, with or without a quadratic functional growth property. These models include the most well-known classes of objective functions analyzed in the literature: nonsmooth Lipschitz functions and composition of a (potentially) nonsmooth function and a smooth function, with or without strong convexity. Based on the flexibility offered by our optimization model, we consider several variants of stochastic first-order methods, such as the stochastic proximal gradient and the stochastic proximal point algorithms. Usually, the convergence theory for these methods has been derived for simple stochastic optimization models satisfying restrictive assumptions, and the rates are in general sublinear and hold only for specific decreasing stepsizes. Hence, we analyze the convergence rates of stochastic first-order methods with constant or variable stepsize under general assumptions covering a large class of objective functions. For constant stepsize, we show that these methods can achieve linear convergence rate up to a constant proportional to the stepsize and under some strong stochastic bounded gradient condition even pure linear convergence. Moreover, when a variable stepsize is chosen we derive sublinear convergence rates for these stochastic first-order methods. Finally, the stochastic gradient mapping and the Moreau smoothing mapping introduced in the present paper lead to simple and intuitive proofs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Convergence Properties of Monotone and Nonmonotone Proximal Gradient Methods Revisited

Article Open access 29 September 2022

Block-coordinate and incremental aggregated proximal gradient methods for nonsmooth nonconvex problems

Article 13 January 2021

A globally convergent proximal Newton-type method in nonsmooth convex optimization

Article 22 March 2022

References

Atchade, Y.F., Fort, G., Moulines, E.: On stochastic proximal gradient algorithms. arXiv:1402.2365 (2014)
Blatt, D., Hero, A.O.: Energy based sensor network source localization via projection onto convex sets. IEEE Trans. Signal Process. 54(9), 3614–3619 (2006)
Article Google Scholar
Burke, J.V., Ferris, M.C.: Weak sharp minima in mathematical programming. SIAM J. Control Optim. 31(6), 1340–1359 (1993)
Article MathSciNet Google Scholar
Bauschke, H., Combettes, P.: Convex analysis and monotone operator theory in Hilbert spaces. Springer, New York (2011)
Book Google Scholar
Bottou, L., Curtis, F., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60, 223–311 (2018)
Article MathSciNet Google Scholar
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122. Now Publishers (2011)
Devolder, O., Glineur, F., Nesterov, Yu.: First-order methods of smooth convex optimization with inexact oracle. Math. Program. 146, 37–75 (2014)
Article MathSciNet Google Scholar
Duchi, J., Singer, Y.: Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res. 10, 2899–2934 (2009)
MathSciNet MATH Google Scholar
Lacoste-Julien, S., Schmidt, M., Bach, F.: A simpler approach to obtaining an ${\cal{O}}(1/t)$ convergence rate for projected stochastic subgradient descent. arXiv:1212.2002 (2012)
Lan, G.: An optimal method for stochastic composite optimization. Math. Program. 133(1), 365–397 (2012)
Article MathSciNet Google Scholar
Moulines, E., Bach, F.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 24. Curran Associates, Inc. (2011)
Necoara, I., Nesterov, Yu., Glineur, F.: Linear convergence of first order methods for non-strongly convex optimization. Math. Program. 175(1), 69–107 (2019)
Article MathSciNet Google Scholar
Necoara, I., Richtarik, P., Patrascu, A.: Randomized projection methods for convex feasibility problems: conditioning and convergence rates. SIAM J. Optim. 29(4), 2814–2852 (2019)
Article MathSciNet Google Scholar
Necoara, I., Nedelcu, V., Dumitrache, I.: Parallel and distributed optimization methods for estimation and control in networks. J. Process Control 21(5), 756–766 (2011)
Article Google Scholar
Nedelcu, V., Necoara, I., Tran Dinh, Q.: Computational complexity of inexact gradient augmented lagrangian methods: application to constrained MPC. SIAM J. Control Optim. 52(5), 3109–3134 (2014)
Article MathSciNet Google Scholar
Nedich, A., Necoara, I.: Random minibatch projection algorithms for convex problems with functional constraints. Appl. Math. Optim. 8(3), 801–833 (2019)
Article Google Scholar
Nedich, A., Bertsekas, D.: Convergence rate of incremental subgradient algorithms. In: Uryasev, S., Pardalos, P. (eds.) Stochastic optimization: algorithms and applications, pp. 263–304. Springer, Berlin (2000)
Google Scholar
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)
Article MathSciNet Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Boston (2004)
Book Google Scholar
Patrascu, A., Necoara, I.: Nonasymptotic convergence of stochastic proximal point algorithms for constrained convex optimization. J. Mach. Learn. Res. 18(198), 1–42 (2018)
MATH Google Scholar
Polyak, B.: Minimization of unsmooth functionals. Comput. Math. Math. Phys. 9(3), 14–29 (1969)
Article Google Scholar
Polyak, B.: Introduction to optimization. Optimization Software. Optimization Software, Inc., New York (1987)
Rosasco, L., Villa, S., Vu, B.C.: Convergence of stochastic proximal gradient algorithm. Appl. Math. Optim. https://doi.org/10.1007/s00245-019-09617-7 (2019)
Ryu, E., Boyd, S.: Stochastic proximal iteration: a non-asymptotic improvement upon stochastic gradient descent. http://web.stanford.edu/~eryu/ (2016)
Schmidt, M., Le Roux, N.: Fast convergence of stochastic gradient descent under a strong growth condition. arXiv:1308.6370 (2013)
Schmidt, M., Le Roux, N., Bach, F.: Convergence rates of inexact proximal-gradient methods for convex optimization. In: Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 24. Curran Associates, Inc. (2011)
Toulis, P., Tran, D., Airoldi, E.M.: Towards stability and optimality in stochastic gradient descent. In: International conference on artificial intelligence and statistics (2016)
Tibshirani, R.: The solution path of the generalized Lasso. Ph.d. Thesis, Stanford University (2011)
Yang, T., Lin, Q.: RSG: Beating subgradient method without smoothness and strong convexity. J. Mach. Learn. Res. 19(6), 1–33 (2018)
MathSciNet MATH Google Scholar
Xiao, L., Zhang, T.: A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24(4), 2057–2075 (2015)
Article MathSciNet Google Scholar

Download references

Acknowledgements

The research leading to these results has received funding from the NO Grants 2014–2021, under project ELO-Hyp, Contract No. 24/2020.

Author information

Authors and Affiliations

Department of Automatic Control and Systems Engineering, University Politehnica Bucharest, 060042, Bucharest, Romania
Ion Necoara
Institute of Mathematical Statistics and Applied Mathematics of the Romanian Academy, 050711, Bucharest, Romania
Ion Necoara

Authors

Ion Necoara
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ion Necoara.

Additional information

Communicated by Negash G. Medhin.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

In this appendix, we present several objective functions satisfying Assumptions 2.1 and 2.2 . First, we provide two examples of objective functions satisfying the stochastic bounded gradient condition (2).

Example 1

(Nonsmooth (Lipschitz) functions satisfy Assumption 2.1) Assume that the functions $f(\cdot ,\xi )$ and $g(\cdot ,\xi )$ have bounded (sub)gradients:

$$\begin{aligned} \Vert \nabla f(x,\xi )\Vert \le B_f \quad \text {and} \quad \Vert \nabla g(x,\xi ) \Vert \le B_g \quad \forall x \in \text {dom} \ F. \end{aligned}$$

Then, obviously Assumption 2.1 holds with $ L=0 \quad \text {and} \quad B^2 = 2 B_f^2 + 2 B_g^2.$

Example 2

(Smooth (Lipschitz gradient) functions satisfy Assumption 2.1) Condition (2) contains the class of functions formed as a sum of two terms: one having Lipschitz continuous gradient and the other having bounded subgradients. Indeed, let us assume that $f(\cdot ,\xi )$ has Lipschitz continuous gradient, i.e., there exists $L(\xi )>0$ such that:

$$\begin{aligned} \Vert \nabla f(x, \xi ) - \nabla f(\bar{x},\xi )\Vert \le L(\xi ) \Vert x - \bar{x}\Vert \quad \forall x \in \text {dom} \ F. \end{aligned}$$

Then, using standard arguments we have [19]:

$$\begin{aligned} f(x, \xi ) - f(\bar{x},\xi ) \ge \langle \nabla f(\bar{x}, \xi ), x -\bar{x} \rangle + \frac{1}{2L(\xi )} \Vert \nabla f(x, \xi ) - \nabla f(\bar{x}, \xi ) \Vert ^2. \end{aligned}$$

Assuming that $L(\xi ) \le L_f$ for all $\xi \in \varOmega $ and $g(\cdot ,\xi )$ convex, then adding $g(x,\xi ) - g(\bar{x},\xi ) \ge \langle \nabla g(\bar{x},\xi ), x -\bar{x} \rangle $ in the previous inequality, where $\nabla g(\bar{x},\xi ) \in \partial g(\bar{x},\xi )$, and then taking expectation w.r.t. $\xi $, we get:

$$\begin{aligned} F(x) - F(\bar{x})&\ge \langle \nabla F(\bar{x}), x -\bar{x} \rangle + \frac{1}{2L_f} \mathbf {E}\left[ \Vert \nabla f(x, \xi ) - \nabla f(\bar{x}, \xi ) \Vert ^2\right] , \end{aligned}$$

where we used that $\nabla f(\bar{x},\xi )$ and $\nabla g(\bar{x},\xi )$ are unbiased stochastic estimates of the (sub)gradients of f and g and thus $\nabla F(\bar{x}) = \mathbf {E}\left[ \nabla g(\bar{x},\xi ) + \nabla g(\bar{x},\xi )\right] \in \partial F(\bar{x})$. Using the optimality conditions for (1) in $\bar{x}$, i.e., $0 \in \partial F(\bar{x})$, we get:

$$\begin{aligned} F(x) - F^*&\ge \frac{1}{2L_f} \mathbf {E}\left[ \Vert \nabla f(x, \xi ) - \nabla f(\bar{x}, \xi ) \Vert ^2\right] . \end{aligned}$$

Therefore, for any $\nabla g(x,\xi ) \in \partial g(x,\xi )$ we have:

$$\begin{aligned} \mathbf {E}\left[ \Vert \nabla F(x, \xi ) \Vert ^2\right]&= \mathbf {E}\left[ \Vert \nabla f(x, \xi ) - \nabla f(\bar{x}, \xi ) +\nabla g(x,\xi ) + \nabla f(\bar{x}, \xi ) \Vert ^2\right] \nonumber \\&\le 2 \mathbf {E}[\Vert \nabla f(x, \xi ) - \nabla f(\bar{x}, \xi )\Vert ^2 ] +2 \mathbf {E}[\Vert \nabla g(x,\xi ) + \nabla f(\bar{x}, \xi )\Vert ^2 ]\\&\le 4L_f (F(x) - F^*) +2 \mathbf {E}[\Vert \nabla g(x,\xi ) + \nabla f(\bar{x}, \xi )\Vert ^2 ]. \end{aligned}$$

Assuming now that the regularization function $g(x,\xi )$ has bounded subgradients, i.e., $\Vert \nabla g(x,\xi )\Vert \le B_g$, then we get that the stochastic bounded gradient condition (2) holds with:

$$\begin{aligned} L = 4 L_f \quad \text {and} \quad B^2 = 4 (B_g^2 +\min _{\bar{x} \in X^*} \mathbf {E}[\Vert \nabla f(\bar{x}, \xi )\Vert ^2]). \end{aligned}$$

Further, many practical problems satisfy the quadratic functional growth condition (3); the most relevant one is given next.

Example 3

(Composition between a strongly convex function and a linear map satisfy Assumption 2.2): Assume $F(x)= {{\hat{f}}}(A^Tx) + g(x)$, where ${{\hat{f}}}$ is a strongly convex function with constant $\sigma _f>0$, A is a matrix of appropriate dimension and g is a polyhedral function. Since g has a polyhedral epigraph, then the optimization problem (1) can be equivalently written as:

$$\begin{aligned} \min _{x,\zeta } {{\hat{f}}}(A^Tx) + \zeta \qquad \text {s.t.}: \quad Cx + c \zeta \le d, \end{aligned}$$

for some appropriate matrix C and vectors c and d of appropriate dimensions. In conclusion, this reformulation leads to the following extended problem:

$$\begin{aligned} \hat{F^*} = \min _{{\hat{x}}=[x^T \; \zeta ]^T} {{\hat{F}}}({\hat{x}}) \quad \left( = {{\hat{f}}}({\hat{A}} {\hat{x}}) + {\hat{c}}^T {\hat{x}} \right) \qquad \text {s.t.}: \quad {\hat{C}} {\hat{x}} \le {\hat{d}}, \end{aligned}$$

where ${\hat{A}}=[A \; 0], \; {\hat{c}} =[0 \; 1]^T, \;{\hat{C}} = [C \; c]$ and ${\hat{d}} =d$. It can be easily seen that ${\hat{x}}^* = [(x^*)^T \; \zeta ^*]^T$ is an optimal point of this extended optimization problem if $x^*$ is optimal for the original problem and $g(x^*) = \zeta ^*$. Moreover, we have $\hat{F^*}= F^*$. Following a standard argument, as, e.g., in [12], there exist $b^*$ and $s^*$ such that the optimal set of the extended optimization problem is given by ${\hat{X}}^* =\{{\hat{x}}: \; {\hat{A}} {\hat{x}} = b^*, \; {\hat{c}}^T {\hat{x}}= s^*, \; {\hat{C}} {\hat{x}} \le {\hat{d}} \}$. Further, since ${{\hat{f}}}$ is strongly convex function with constant $\sigma _f>0$, it follows from [12][Theorem 10] that for any $M>0$ on any sublevel set defined in terms of M the function ${\hat{F}}$ satisfies a quadratic functional growth condition of the form:

$$\begin{aligned} {\hat{F}}({\hat{x}}) - \hat{F^*} \ge \frac{\mu (M)}{2} \Vert {\hat{x}} - {\hat{x}}^* \Vert ^2 \quad \forall {\hat{x}}: \; {\hat{F}}({\hat{x}}) - \hat{F^*} \le M, \end{aligned}$$

where $\mu (M) = \frac{\sigma _f}{\theta ^2 (1 + M \sigma _f + 2 \Vert \nabla {{\hat{f}}}({\hat{A}} {\hat{x}}^*)\Vert ^2)}$, with ${\hat{x}}^* \in {\hat{X}}^*$ and $\theta $ is the Hoffman bound for the optimal polyhedral set ${\hat{X}}^*$. Now, setting $\zeta = g(x)$ in the previous inequality, we get $F(x) - F({\overline{x}}) \!\ge \! \frac{\mu (M)}{2} ( \Vert x - {\overline{x}}\Vert ^2 + (\zeta - \zeta ^*)^2) \!\ge \! \frac{\mu (M)}{2} \Vert x - {\overline{x}}\Vert ^2$ for all $x: F(x) - F({\overline{x}}) \!\le \! M$. In conclusion, the objective function F satisfies the quadratic functional growth condition (3) on any sublevel set, that is, for any $M>0$ there exists $\mu (M)>0$ defined above such that:

$$\begin{aligned} F(x) - F({\overline{x}}) \ge \frac{\mu (M)}{2} \Vert x - {\overline{x}}\Vert ^2 \quad \forall x: \; F(x) - F({\overline{x}}) \le M. \end{aligned}$$

The quadratic functional growth condition (3) is a relaxation of strong convexity notion, see [12] for a more detailed discussion. Clearly, any strongly convex function F satisfies (3) (see, e.g., [19]).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Necoara, I. General Convergence Analysis of Stochastic First-Order Methods for Composite Optimization. J Optim Theory Appl 189, 66–95 (2021). https://doi.org/10.1007/s10957-021-01821-2

Download citation

Received: 08 March 2020
Accepted: 27 January 2021
Published: 22 February 2021
Issue Date: April 2021
DOI: https://doi.org/10.1007/s10957-021-01821-2

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

General Convergence Analysis of Stochastic First-Order Methods for Composite Optimization

Abstract

Access this article

Similar content being viewed by others

Convergence Properties of Monotone and Nonmonotone Proximal Gradient Methods Revisited

Block-coordinate and incremental aggregated proximal gradient methods for nonsmooth nonconvex problems

A globally convergent proximal Newton-type method in nonsmooth convex optimization

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Example 1

Example 2

Example 3

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

General Convergence Analysis of Stochastic First-Order Methods for Composite Optimization

Abstract

Access this article

Similar content being viewed by others

Convergence Properties of Monotone and Nonmonotone Proximal Gradient Methods Revisited

Block-coordinate and incremental aggregated proximal gradient methods for nonsmooth nonconvex problems

A globally convergent proximal Newton-type method in nonsmooth convex optimization

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

Example 1

Example 2

Example 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation