Skip to main content
Log in

General Convergence Analysis of Stochastic First-Order Methods for Composite Optimization

  • Published:
Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Abstract

In this paper, we consider stochastic composite convex optimization problems with the objective function satisfying a stochastic bounded gradient condition, with or without a quadratic functional growth property. These models include the most well-known classes of objective functions analyzed in the literature: nonsmooth Lipschitz functions and composition of a (potentially) nonsmooth function and a smooth function, with or without strong convexity. Based on the flexibility offered by our optimization model, we consider several variants of stochastic first-order methods, such as the stochastic proximal gradient and the stochastic proximal point algorithms. Usually, the convergence theory for these methods has been derived for simple stochastic optimization models satisfying restrictive assumptions, and the rates are in general sublinear and hold only for specific decreasing stepsizes. Hence, we analyze the convergence rates of stochastic first-order methods with constant or variable stepsize under general assumptions covering a large class of objective functions. For constant stepsize, we show that these methods can achieve linear convergence rate up to a constant proportional to the stepsize and under some strong stochastic bounded gradient condition even pure linear convergence. Moreover, when a variable stepsize is chosen we derive sublinear convergence rates for these stochastic first-order methods. Finally, the stochastic gradient mapping and the Moreau smoothing mapping introduced in the present paper lead to simple and intuitive proofs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Atchade, Y.F., Fort, G., Moulines, E.: On stochastic proximal gradient algorithms. arXiv:1402.2365 (2014)

  2. Blatt, D., Hero, A.O.: Energy based sensor network source localization via projection onto convex sets. IEEE Trans. Signal Process. 54(9), 3614–3619 (2006)

    Article  Google Scholar 

  3. Burke, J.V., Ferris, M.C.: Weak sharp minima in mathematical programming. SIAM J. Control Optim. 31(6), 1340–1359 (1993)

    Article  MathSciNet  Google Scholar 

  4. Bauschke, H., Combettes, P.: Convex analysis and monotone operator theory in Hilbert spaces. Springer, New York (2011)

    Book  Google Scholar 

  5. Bottou, L., Curtis, F., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60, 223–311 (2018)

    Article  MathSciNet  Google Scholar 

  6. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122. Now Publishers (2011)

  7. Devolder, O., Glineur, F., Nesterov, Yu.: First-order methods of smooth convex optimization with inexact oracle. Math. Program. 146, 37–75 (2014)

    Article  MathSciNet  Google Scholar 

  8. Duchi, J., Singer, Y.: Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res. 10, 2899–2934 (2009)

    MathSciNet  MATH  Google Scholar 

  9. Lacoste-Julien, S., Schmidt, M., Bach, F.: A simpler approach to obtaining an \({\cal{O}}(1/t)\) convergence rate for projected stochastic subgradient descent. arXiv:1212.2002 (2012)

  10. Lan, G.: An optimal method for stochastic composite optimization. Math. Program. 133(1), 365–397 (2012)

    Article  MathSciNet  Google Scholar 

  11. Moulines, E., Bach, F.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 24. Curran Associates, Inc. (2011)

  12. Necoara, I., Nesterov, Yu., Glineur, F.: Linear convergence of first order methods for non-strongly convex optimization. Math. Program. 175(1), 69–107 (2019)

    Article  MathSciNet  Google Scholar 

  13. Necoara, I., Richtarik, P., Patrascu, A.: Randomized projection methods for convex feasibility problems: conditioning and convergence rates. SIAM J. Optim. 29(4), 2814–2852 (2019)

    Article  MathSciNet  Google Scholar 

  14. Necoara, I., Nedelcu, V., Dumitrache, I.: Parallel and distributed optimization methods for estimation and control in networks. J. Process Control 21(5), 756–766 (2011)

    Article  Google Scholar 

  15. Nedelcu, V., Necoara, I., Tran Dinh, Q.: Computational complexity of inexact gradient augmented lagrangian methods: application to constrained MPC. SIAM J. Control Optim. 52(5), 3109–3134 (2014)

    Article  MathSciNet  Google Scholar 

  16. Nedich, A., Necoara, I.: Random minibatch projection algorithms for convex problems with functional constraints. Appl. Math. Optim. 8(3), 801–833 (2019)

    Article  Google Scholar 

  17. Nedich, A., Bertsekas, D.: Convergence rate of incremental subgradient algorithms. In: Uryasev, S., Pardalos, P. (eds.) Stochastic optimization: algorithms and applications, pp. 263–304. Springer, Berlin (2000)

    Google Scholar 

  18. Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)

    Article  MathSciNet  Google Scholar 

  19. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Boston (2004)

    Book  Google Scholar 

  20. Patrascu, A., Necoara, I.: Nonasymptotic convergence of stochastic proximal point algorithms for constrained convex optimization. J. Mach. Learn. Res. 18(198), 1–42 (2018)

    MATH  Google Scholar 

  21. Polyak, B.: Minimization of unsmooth functionals. Comput. Math. Math. Phys. 9(3), 14–29 (1969)

    Article  Google Scholar 

  22. Polyak, B.: Introduction to optimization. Optimization Software. Optimization Software, Inc., New York (1987)

  23. Rosasco, L., Villa, S., Vu, B.C.: Convergence of stochastic proximal gradient algorithm. Appl. Math. Optim. https://doi.org/10.1007/s00245-019-09617-7 (2019)

  24. Ryu, E., Boyd, S.: Stochastic proximal iteration: a non-asymptotic improvement upon stochastic gradient descent. http://web.stanford.edu/~eryu/ (2016)

  25. Schmidt, M., Le Roux, N.: Fast convergence of stochastic gradient descent under a strong growth condition. arXiv:1308.6370 (2013)

  26. Schmidt, M., Le Roux, N., Bach, F.: Convergence rates of inexact proximal-gradient methods for convex optimization. In: Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 24. Curran Associates, Inc. (2011)

  27. Toulis, P., Tran, D., Airoldi, E.M.: Towards stability and optimality in stochastic gradient descent. In: International conference on artificial intelligence and statistics (2016)

  28. Tibshirani, R.: The solution path of the generalized Lasso. Ph.d. Thesis, Stanford University (2011)

  29. Yang, T., Lin, Q.: RSG: Beating subgradient method without smoothness and strong convexity. J. Mach. Learn. Res. 19(6), 1–33 (2018)

    MathSciNet  MATH  Google Scholar 

  30. Xiao, L., Zhang, T.: A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24(4), 2057–2075 (2015)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The research leading to these results has received funding from the NO Grants 2014–2021, under project ELO-Hyp, Contract No. 24/2020.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ion Necoara.

Additional information

Communicated by Negash G. Medhin.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

In this appendix, we present several objective functions satisfying Assumptions 2.1 and 2.2 . First, we provide two examples of objective functions satisfying the stochastic bounded gradient condition (2).

Example 1

(Nonsmooth (Lipschitz) functions satisfy Assumption 2.1) Assume that the functions \(f(\cdot ,\xi )\) and \(g(\cdot ,\xi )\) have bounded (sub)gradients:

$$\begin{aligned} \Vert \nabla f(x,\xi )\Vert \le B_f \quad \text {and} \quad \Vert \nabla g(x,\xi ) \Vert \le B_g \quad \forall x \in \text {dom} \ F. \end{aligned}$$

Then, obviously Assumption 2.1 holds with \( L=0 \quad \text {and} \quad B^2 = 2 B_f^2 + 2 B_g^2.\)

Example 2

(Smooth (Lipschitz gradient) functions satisfy Assumption 2.1) Condition (2) contains the class of functions formed as a sum of two terms: one having Lipschitz continuous gradient and the other having bounded subgradients. Indeed, let us assume that \(f(\cdot ,\xi )\) has Lipschitz continuous gradient, i.e., there exists \(L(\xi )>0\) such that:

$$\begin{aligned} \Vert \nabla f(x, \xi ) - \nabla f(\bar{x},\xi )\Vert \le L(\xi ) \Vert x - \bar{x}\Vert \quad \forall x \in \text {dom} \ F. \end{aligned}$$

Then, using standard arguments we have [19]:

$$\begin{aligned} f(x, \xi ) - f(\bar{x},\xi ) \ge \langle \nabla f(\bar{x}, \xi ), x -\bar{x} \rangle + \frac{1}{2L(\xi )} \Vert \nabla f(x, \xi ) - \nabla f(\bar{x}, \xi ) \Vert ^2. \end{aligned}$$

Assuming that \(L(\xi ) \le L_f\) for all \(\xi \in \varOmega \) and \(g(\cdot ,\xi )\) convex, then adding \(g(x,\xi ) - g(\bar{x},\xi ) \ge \langle \nabla g(\bar{x},\xi ), x -\bar{x} \rangle \) in the previous inequality, where \(\nabla g(\bar{x},\xi ) \in \partial g(\bar{x},\xi )\), and then taking expectation w.r.t. \(\xi \), we get:

$$\begin{aligned} F(x) - F(\bar{x})&\ge \langle \nabla F(\bar{x}), x -\bar{x} \rangle + \frac{1}{2L_f} \mathbf {E}\left[ \Vert \nabla f(x, \xi ) - \nabla f(\bar{x}, \xi ) \Vert ^2\right] , \end{aligned}$$

where we used that \(\nabla f(\bar{x},\xi )\) and \(\nabla g(\bar{x},\xi )\) are unbiased stochastic estimates of the (sub)gradients of f and g and thus \(\nabla F(\bar{x}) = \mathbf {E}\left[ \nabla g(\bar{x},\xi ) + \nabla g(\bar{x},\xi )\right] \in \partial F(\bar{x})\). Using the optimality conditions for (1) in \(\bar{x}\), i.e., \(0 \in \partial F(\bar{x})\), we get:

$$\begin{aligned} F(x) - F^*&\ge \frac{1}{2L_f} \mathbf {E}\left[ \Vert \nabla f(x, \xi ) - \nabla f(\bar{x}, \xi ) \Vert ^2\right] . \end{aligned}$$

Therefore, for any \(\nabla g(x,\xi ) \in \partial g(x,\xi )\) we have:

$$\begin{aligned} \mathbf {E}\left[ \Vert \nabla F(x, \xi ) \Vert ^2\right]&= \mathbf {E}\left[ \Vert \nabla f(x, \xi ) - \nabla f(\bar{x}, \xi ) +\nabla g(x,\xi ) + \nabla f(\bar{x}, \xi ) \Vert ^2\right] \nonumber \\&\le 2 \mathbf {E}[\Vert \nabla f(x, \xi ) - \nabla f(\bar{x}, \xi )\Vert ^2 ] +2 \mathbf {E}[\Vert \nabla g(x,\xi ) + \nabla f(\bar{x}, \xi )\Vert ^2 ]\\&\le 4L_f (F(x) - F^*) +2 \mathbf {E}[\Vert \nabla g(x,\xi ) + \nabla f(\bar{x}, \xi )\Vert ^2 ]. \end{aligned}$$

Assuming now that the regularization function \(g(x,\xi )\) has bounded subgradients, i.e., \(\Vert \nabla g(x,\xi )\Vert \le B_g\), then we get that the stochastic bounded gradient condition (2) holds with:

$$\begin{aligned} L = 4 L_f \quad \text {and} \quad B^2 = 4 (B_g^2 +\min _{\bar{x} \in X^*} \mathbf {E}[\Vert \nabla f(\bar{x}, \xi )\Vert ^2]). \end{aligned}$$

Further, many practical problems satisfy the quadratic functional growth condition (3); the most relevant one is given next.

Example 3

(Composition between a strongly convex function and a linear map satisfy Assumption 2.2): Assume \(F(x)= {{\hat{f}}}(A^Tx) + g(x)\), where \({{\hat{f}}}\) is a strongly convex function with constant \(\sigma _f>0\), A is a matrix of appropriate dimension and g is a polyhedral function. Since g has a polyhedral epigraph, then the optimization problem (1) can be equivalently written as:

$$\begin{aligned} \min _{x,\zeta } {{\hat{f}}}(A^Tx) + \zeta \qquad \text {s.t.}: \quad Cx + c \zeta \le d, \end{aligned}$$

for some appropriate matrix C and vectors c and d of appropriate dimensions. In conclusion, this reformulation leads to the following extended problem:

$$\begin{aligned} \hat{F^*} = \min _{{\hat{x}}=[x^T \; \zeta ]^T} {{\hat{F}}}({\hat{x}}) \quad \left( = {{\hat{f}}}({\hat{A}} {\hat{x}}) + {\hat{c}}^T {\hat{x}} \right) \qquad \text {s.t.}: \quad {\hat{C}} {\hat{x}} \le {\hat{d}}, \end{aligned}$$

where \({\hat{A}}=[A \; 0], \; {\hat{c}} =[0 \; 1]^T, \;{\hat{C}} = [C \; c]\) and \({\hat{d}} =d\). It can be easily seen that \({\hat{x}}^* = [(x^*)^T \; \zeta ^*]^T\) is an optimal point of this extended optimization problem if \(x^*\) is optimal for the original problem and \(g(x^*) = \zeta ^*\). Moreover, we have \(\hat{F^*}= F^*\). Following a standard argument, as, e.g., in [12], there exist \(b^*\) and \(s^*\) such that the optimal set of the extended optimization problem is given by \({\hat{X}}^* =\{{\hat{x}}: \; {\hat{A}} {\hat{x}} = b^*, \; {\hat{c}}^T {\hat{x}}= s^*, \; {\hat{C}} {\hat{x}} \le {\hat{d}} \}\). Further, since \({{\hat{f}}}\) is strongly convex function with constant \(\sigma _f>0\), it follows from [12][Theorem 10] that for any \(M>0\) on any sublevel set defined in terms of M the function \({\hat{F}}\) satisfies a quadratic functional growth condition of the form:

$$\begin{aligned} {\hat{F}}({\hat{x}}) - \hat{F^*} \ge \frac{\mu (M)}{2} \Vert {\hat{x}} - {\hat{x}}^* \Vert ^2 \quad \forall {\hat{x}}: \; {\hat{F}}({\hat{x}}) - \hat{F^*} \le M, \end{aligned}$$

where \(\mu (M) = \frac{\sigma _f}{\theta ^2 (1 + M \sigma _f + 2 \Vert \nabla {{\hat{f}}}({\hat{A}} {\hat{x}}^*)\Vert ^2)}\), with \({\hat{x}}^* \in {\hat{X}}^*\) and \(\theta \) is the Hoffman bound for the optimal polyhedral set \({\hat{X}}^*\). Now, setting \(\zeta = g(x)\) in the previous inequality, we get \(F(x) - F({\overline{x}}) \!\ge \! \frac{\mu (M)}{2} ( \Vert x - {\overline{x}}\Vert ^2 + (\zeta - \zeta ^*)^2) \!\ge \! \frac{\mu (M)}{2} \Vert x - {\overline{x}}\Vert ^2\) for all \(x: F(x) - F({\overline{x}}) \!\le \! M\). In conclusion, the objective function F satisfies the quadratic functional growth condition (3) on any sublevel set, that is, for any \(M>0\) there exists \(\mu (M)>0\) defined above such that:

$$\begin{aligned} F(x) - F({\overline{x}}) \ge \frac{\mu (M)}{2} \Vert x - {\overline{x}}\Vert ^2 \quad \forall x: \; F(x) - F({\overline{x}}) \le M. \end{aligned}$$

The quadratic functional growth condition (3) is a relaxation of strong convexity notion, see [12] for a more detailed discussion. Clearly, any strongly convex function F satisfies (3) (see, e.g., [19]).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Necoara, I. General Convergence Analysis of Stochastic First-Order Methods for Composite Optimization. J Optim Theory Appl 189, 66–95 (2021). https://doi.org/10.1007/s10957-021-01821-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10957-021-01821-2

Keywords

Mathematics Subject Classification

Navigation