Abstract
In this paper, we consider stochastic composite convex optimization problems with the objective function satisfying a stochastic bounded gradient condition, with or without a quadratic functional growth property. These models include the most well-known classes of objective functions analyzed in the literature: nonsmooth Lipschitz functions and composition of a (potentially) nonsmooth function and a smooth function, with or without strong convexity. Based on the flexibility offered by our optimization model, we consider several variants of stochastic first-order methods, such as the stochastic proximal gradient and the stochastic proximal point algorithms. Usually, the convergence theory for these methods has been derived for simple stochastic optimization models satisfying restrictive assumptions, and the rates are in general sublinear and hold only for specific decreasing stepsizes. Hence, we analyze the convergence rates of stochastic first-order methods with constant or variable stepsize under general assumptions covering a large class of objective functions. For constant stepsize, we show that these methods can achieve linear convergence rate up to a constant proportional to the stepsize and under some strong stochastic bounded gradient condition even pure linear convergence. Moreover, when a variable stepsize is chosen we derive sublinear convergence rates for these stochastic first-order methods. Finally, the stochastic gradient mapping and the Moreau smoothing mapping introduced in the present paper lead to simple and intuitive proofs.
Similar content being viewed by others
References
Atchade, Y.F., Fort, G., Moulines, E.: On stochastic proximal gradient algorithms. arXiv:1402.2365 (2014)
Blatt, D., Hero, A.O.: Energy based sensor network source localization via projection onto convex sets. IEEE Trans. Signal Process. 54(9), 3614–3619 (2006)
Burke, J.V., Ferris, M.C.: Weak sharp minima in mathematical programming. SIAM J. Control Optim. 31(6), 1340–1359 (1993)
Bauschke, H., Combettes, P.: Convex analysis and monotone operator theory in Hilbert spaces. Springer, New York (2011)
Bottou, L., Curtis, F., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60, 223–311 (2018)
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122. Now Publishers (2011)
Devolder, O., Glineur, F., Nesterov, Yu.: First-order methods of smooth convex optimization with inexact oracle. Math. Program. 146, 37–75 (2014)
Duchi, J., Singer, Y.: Efficient online and batch learning using forward backward splitting. J. Mach. Learn. Res. 10, 2899–2934 (2009)
Lacoste-Julien, S., Schmidt, M., Bach, F.: A simpler approach to obtaining an \({\cal{O}}(1/t)\) convergence rate for projected stochastic subgradient descent. arXiv:1212.2002 (2012)
Lan, G.: An optimal method for stochastic composite optimization. Math. Program. 133(1), 365–397 (2012)
Moulines, E., Bach, F.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 24. Curran Associates, Inc. (2011)
Necoara, I., Nesterov, Yu., Glineur, F.: Linear convergence of first order methods for non-strongly convex optimization. Math. Program. 175(1), 69–107 (2019)
Necoara, I., Richtarik, P., Patrascu, A.: Randomized projection methods for convex feasibility problems: conditioning and convergence rates. SIAM J. Optim. 29(4), 2814–2852 (2019)
Necoara, I., Nedelcu, V., Dumitrache, I.: Parallel and distributed optimization methods for estimation and control in networks. J. Process Control 21(5), 756–766 (2011)
Nedelcu, V., Necoara, I., Tran Dinh, Q.: Computational complexity of inexact gradient augmented lagrangian methods: application to constrained MPC. SIAM J. Control Optim. 52(5), 3109–3134 (2014)
Nedich, A., Necoara, I.: Random minibatch projection algorithms for convex problems with functional constraints. Appl. Math. Optim. 8(3), 801–833 (2019)
Nedich, A., Bertsekas, D.: Convergence rate of incremental subgradient algorithms. In: Uryasev, S., Pardalos, P. (eds.) Stochastic optimization: algorithms and applications, pp. 263–304. Springer, Berlin (2000)
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Boston (2004)
Patrascu, A., Necoara, I.: Nonasymptotic convergence of stochastic proximal point algorithms for constrained convex optimization. J. Mach. Learn. Res. 18(198), 1–42 (2018)
Polyak, B.: Minimization of unsmooth functionals. Comput. Math. Math. Phys. 9(3), 14–29 (1969)
Polyak, B.: Introduction to optimization. Optimization Software. Optimization Software, Inc., New York (1987)
Rosasco, L., Villa, S., Vu, B.C.: Convergence of stochastic proximal gradient algorithm. Appl. Math. Optim. https://doi.org/10.1007/s00245-019-09617-7 (2019)
Ryu, E., Boyd, S.: Stochastic proximal iteration: a non-asymptotic improvement upon stochastic gradient descent. http://web.stanford.edu/~eryu/ (2016)
Schmidt, M., Le Roux, N.: Fast convergence of stochastic gradient descent under a strong growth condition. arXiv:1308.6370 (2013)
Schmidt, M., Le Roux, N., Bach, F.: Convergence rates of inexact proximal-gradient methods for convex optimization. In: Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 24. Curran Associates, Inc. (2011)
Toulis, P., Tran, D., Airoldi, E.M.: Towards stability and optimality in stochastic gradient descent. In: International conference on artificial intelligence and statistics (2016)
Tibshirani, R.: The solution path of the generalized Lasso. Ph.d. Thesis, Stanford University (2011)
Yang, T., Lin, Q.: RSG: Beating subgradient method without smoothness and strong convexity. J. Mach. Learn. Res. 19(6), 1–33 (2018)
Xiao, L., Zhang, T.: A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24(4), 2057–2075 (2015)
Acknowledgements
The research leading to these results has received funding from the NO Grants 2014–2021, under project ELO-Hyp, Contract No. 24/2020.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Negash G. Medhin.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
In this appendix, we present several objective functions satisfying Assumptions 2.1 and 2.2 . First, we provide two examples of objective functions satisfying the stochastic bounded gradient condition (2).
Example 1
(Nonsmooth (Lipschitz) functions satisfy Assumption 2.1) Assume that the functions \(f(\cdot ,\xi )\) and \(g(\cdot ,\xi )\) have bounded (sub)gradients:
Then, obviously Assumption 2.1 holds with \( L=0 \quad \text {and} \quad B^2 = 2 B_f^2 + 2 B_g^2.\)
Example 2
(Smooth (Lipschitz gradient) functions satisfy Assumption 2.1) Condition (2) contains the class of functions formed as a sum of two terms: one having Lipschitz continuous gradient and the other having bounded subgradients. Indeed, let us assume that \(f(\cdot ,\xi )\) has Lipschitz continuous gradient, i.e., there exists \(L(\xi )>0\) such that:
Then, using standard arguments we have [19]:
Assuming that \(L(\xi ) \le L_f\) for all \(\xi \in \varOmega \) and \(g(\cdot ,\xi )\) convex, then adding \(g(x,\xi ) - g(\bar{x},\xi ) \ge \langle \nabla g(\bar{x},\xi ), x -\bar{x} \rangle \) in the previous inequality, where \(\nabla g(\bar{x},\xi ) \in \partial g(\bar{x},\xi )\), and then taking expectation w.r.t. \(\xi \), we get:
where we used that \(\nabla f(\bar{x},\xi )\) and \(\nabla g(\bar{x},\xi )\) are unbiased stochastic estimates of the (sub)gradients of f and g and thus \(\nabla F(\bar{x}) = \mathbf {E}\left[ \nabla g(\bar{x},\xi ) + \nabla g(\bar{x},\xi )\right] \in \partial F(\bar{x})\). Using the optimality conditions for (1) in \(\bar{x}\), i.e., \(0 \in \partial F(\bar{x})\), we get:
Therefore, for any \(\nabla g(x,\xi ) \in \partial g(x,\xi )\) we have:
Assuming now that the regularization function \(g(x,\xi )\) has bounded subgradients, i.e., \(\Vert \nabla g(x,\xi )\Vert \le B_g\), then we get that the stochastic bounded gradient condition (2) holds with:
Further, many practical problems satisfy the quadratic functional growth condition (3); the most relevant one is given next.
Example 3
(Composition between a strongly convex function and a linear map satisfy Assumption 2.2): Assume \(F(x)= {{\hat{f}}}(A^Tx) + g(x)\), where \({{\hat{f}}}\) is a strongly convex function with constant \(\sigma _f>0\), A is a matrix of appropriate dimension and g is a polyhedral function. Since g has a polyhedral epigraph, then the optimization problem (1) can be equivalently written as:
for some appropriate matrix C and vectors c and d of appropriate dimensions. In conclusion, this reformulation leads to the following extended problem:
where \({\hat{A}}=[A \; 0], \; {\hat{c}} =[0 \; 1]^T, \;{\hat{C}} = [C \; c]\) and \({\hat{d}} =d\). It can be easily seen that \({\hat{x}}^* = [(x^*)^T \; \zeta ^*]^T\) is an optimal point of this extended optimization problem if \(x^*\) is optimal for the original problem and \(g(x^*) = \zeta ^*\). Moreover, we have \(\hat{F^*}= F^*\). Following a standard argument, as, e.g., in [12], there exist \(b^*\) and \(s^*\) such that the optimal set of the extended optimization problem is given by \({\hat{X}}^* =\{{\hat{x}}: \; {\hat{A}} {\hat{x}} = b^*, \; {\hat{c}}^T {\hat{x}}= s^*, \; {\hat{C}} {\hat{x}} \le {\hat{d}} \}\). Further, since \({{\hat{f}}}\) is strongly convex function with constant \(\sigma _f>0\), it follows from [12][Theorem 10] that for any \(M>0\) on any sublevel set defined in terms of M the function \({\hat{F}}\) satisfies a quadratic functional growth condition of the form:
where \(\mu (M) = \frac{\sigma _f}{\theta ^2 (1 + M \sigma _f + 2 \Vert \nabla {{\hat{f}}}({\hat{A}} {\hat{x}}^*)\Vert ^2)}\), with \({\hat{x}}^* \in {\hat{X}}^*\) and \(\theta \) is the Hoffman bound for the optimal polyhedral set \({\hat{X}}^*\). Now, setting \(\zeta = g(x)\) in the previous inequality, we get \(F(x) - F({\overline{x}}) \!\ge \! \frac{\mu (M)}{2} ( \Vert x - {\overline{x}}\Vert ^2 + (\zeta - \zeta ^*)^2) \!\ge \! \frac{\mu (M)}{2} \Vert x - {\overline{x}}\Vert ^2\) for all \(x: F(x) - F({\overline{x}}) \!\le \! M\). In conclusion, the objective function F satisfies the quadratic functional growth condition (3) on any sublevel set, that is, for any \(M>0\) there exists \(\mu (M)>0\) defined above such that:
The quadratic functional growth condition (3) is a relaxation of strong convexity notion, see [12] for a more detailed discussion. Clearly, any strongly convex function F satisfies (3) (see, e.g., [19]).
Rights and permissions
About this article
Cite this article
Necoara, I. General Convergence Analysis of Stochastic First-Order Methods for Composite Optimization. J Optim Theory Appl 189, 66–95 (2021). https://doi.org/10.1007/s10957-021-01821-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10957-021-01821-2
Keywords
- Stochastic composite convex optimization
- Stochastic bounded gradient
- Quadratic functional growth
- Stochastic first-order algorithms
- Convergence rates