Skip to main content
Log in

Stochastic proximal splitting algorithm for composite minimization

  • Original Paper
  • Published:
Optimization Letters Aims and scope Submit manuscript

Abstract

Supported by the recent contributions in multiple domains, the first-order splitting became algorithms of choice for structured nonsmooth optimization. The large-scale noisy contexts make available stochastic information on the objective function and thus, the extension of proximal gradient schemes to stochastic oracles is heavily based on the tractability of the proximal operator corresponding to nonsmooth component, which has been highly exploited in the literature. However, some questions remained about the complexity of the composite models with proximal untractable terms. In this paper we tackle composite optimization problems, assuming only the access to stochastic information on both smooth and nonsmooth components, with a stochastic proximal first-order scheme with stochastic proximal updates. We provide sublinear \(\mathcal {O}\left( \frac{1}{k} \right) \) convergence rates (in expectation of squared distance to the optimal set) under the strong convexity assumption on the objective function. Also, linear convergence is achieved for convex feasibility problems. The empirical behavior is illustrated by numerical tests on parametric sparse representation models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. Data generating code available athttps://github.com/pirofti/SSPG.

References

  1. Asi, H., Duchi, J.C.: Stochastic (approximate) proximal point methods: convergence, optimality, and adaptivity. SIAM J. Optim. 29(3), 2257–2290 (2019)

    Article  MathSciNet  Google Scholar 

  2. Bauschke, H., Deutsch, F., Hundal, H., Park, S.-H.: Accelerating the convergence of the method of alternating projections. Trans. Am. Math. Soc. 355(9), 3433–3461 (2003)

    Article  MathSciNet  Google Scholar 

  3. Bauschke, H.H., Borwein, J.M., Li, Wu: Strong conical hull intersection property, bounded linear regularity, jameson’s property (g), and error bounds in convex optimization. Math. Program. 86(1), 135–160 (1999)

    Article  MathSciNet  Google Scholar 

  4. Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)

    Article  MathSciNet  Google Scholar 

  5. Bianchi, P.: Ergodic convergence of a stochastic proximal point algorithm. SIAM J. Optim. 26(4), 2235–2260 (2016)

    Article  MathSciNet  Google Scholar 

  6. Davis, D., Drusvyatskiy, D.: Stochastic model-based minimization of weakly convex functions. SIAM J. Optim. 29(1), 207–239 (2019)

    Article  MathSciNet  Google Scholar 

  7. Elad, M.: Sparse and redundant representations: from theory to applications in signal and image processing. Springer, New York (2010)

    Book  Google Scholar 

  8. Hallac, D., Leskovec, J., Boyd, S.: Network lasso: clustering and optimization in large graphs. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 387–396 (2015)

  9. Koshal, J., Nedic, A., Shanbhag, U.V.: Regularized iterative stochastic approximation methods for stochastic variational inequality problems. IEEE Trans. Autom. Control 58(3), 594–609 (2012)

    Article  MathSciNet  Google Scholar 

  10. Moulines, E., Bach, F.R.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Advances in Neural Information Processing Systems, pp. 451–459 (2011)

  11. Nedić, A.: Random projection algorithms for convex set intersection problems. In: 49th IEEE Conference on Decision and Control (CDC), pp. 7655–7660. IEEE (2010)

  12. Nedić, A.: Random algorithms for convex minimization problems. Math. Program. 129(2), 225–253 (2011)

    Article  MathSciNet  Google Scholar 

  13. Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)

    Article  MathSciNet  Google Scholar 

  14. Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)

    Article  MathSciNet  Google Scholar 

  15. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, US (2013)

    MATH  Google Scholar 

  16. Nguyen, L.M., Nguyen, P.H., van Dijk, M., Richtárik, P., Scheinberg, K., Takáč, M.: Sgd and hogwild! convergence without the bounded gradients assumption. arXiv preprint arXiv:1802.03801 (2018)

  17. Pătraşcu, A.: New nonasymptotic convergence rates of stochastic proximal point algorithm for stochastic convex optimization. Optimization, 1–29 (2020)

  18. Patrascu, A., Necoara, I.: Nonasymptotic convergence of stochastic proximal point methods for constrained convex optimization. J. Mach. Learn. Res. 18(1), 7204–7245 (2017)

    MathSciNet  MATH  Google Scholar 

  19. Rockafellar, R.T., Wets, R.J.-B.: Variational Analysis, vol. 317. Springer, Berlin (2009)

    MATH  Google Scholar 

  20. Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton, New Jersey (1988)

    Google Scholar 

  21. Rockafellar, R.T., Wets, R.J.-B.: On the interchange of subdifferentiation and conditional expectation for convex functionals. Stochastics 7(1), 173–182 (1982)

    Article  MathSciNet  Google Scholar 

  22. Rosasco, L., Villa, S., Vũ, B.C.: Convergence of stochastic proximal gradient algorithm. Appl. Math. Optim. 82, 1–27 (2019)

    MathSciNet  MATH  Google Scholar 

  23. Ryu, E.K., Boyd, S.: Stochastic proximal iteration: a non-asymptotic improvement upon stochastic gradient descent. Author website, early draft (2016)

  24. Salim, A., Bianchi, P., Hachem, W.: Snake: a stochastic proximal gradient algorithm for regularized problems over large graphs. IEEE Trans. Autom. Control 64(5), 1832–1847 (2019)

    Article  MathSciNet  Google Scholar 

  25. Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A.: Pegasos: primal estimated sub-gradient solver for svm. Math. Program. 127(1), 3–30 (2011)

    Article  MathSciNet  Google Scholar 

  26. Shi, W., Ling, Q., Gang, W., Yin, W.: A proximal gradient algorithm for decentralized composite optimization. IEEE Trans. Signal Process. 63(22), 6013–6023 (2015)

    Article  MathSciNet  Google Scholar 

  27. Stoican, F., Irofti, P.: Aiding dictionary learning through multi-parametric sparse representation. Algorithms 12(7), 131 (2019)

    Article  MathSciNet  Google Scholar 

  28. Toulis, P., Tran, D., Airoldi, E.: Towards stability and optimality in stochastic gradient descent. In: Artificial Intelligence and Statistics, pp. 1290–1298 (2016)

  29. Varma, R., Lee, H., Kovacevic, J., Chi, Y.: Vector-valued graph trend filtering with non-convex penalties. IEEE Trans. Signal Inf. Process. Over Netw. 6, 48–62 (2019)

  30. Wang, M., Bertsekas, D.P.: Stochastic first-order methods with random constraint projection. SIAM J. Optim. 26(1), 681–717 (2016)

    Article  MathSciNet  Google Scholar 

  31. Wang, X., Wang, S., Zhang, H.: Inexact proximal stochastic gradient method for convex composite optimization. Comput. Optim. Appl. 68(3), 579–618 (2017)

    Article  MathSciNet  Google Scholar 

  32. Yankelevsky, Y., Elad, M.: Dual graph regularized dictionary learning. IEEE Trans. Signal Inf. Process. Over Netw. 2(4), 611–624 (2016)

    Article  MathSciNet  Google Scholar 

  33. Zhong, W., Kwok, J.: Accelerated stochastic gradient method for composite regularization. In: Artificial Intelligence and Statistics, pp. 1086–1094 (2014)

Download references

Acknowledgements

The research of A. Patrascu was supported by a grant of the Romanian Ministry of Education and Research, CNCS - UEFISCDI, project number PN-III-P1-1.1-PD-2019-1123, within PNCDI III. Also, the research work of P. Irofti was supported by a grant of the Romanian Ministry of Education and Research, CNCS - UEFISCDI, project number PN-III-P1-1.1-PD-2019-0825, within PNCDI III..

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrei Patrascu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Proof (of Corollary 3)

For simplicity denote \(\theta _k = (1 - \mu _k\sigma _{f})\) , then Theorem 2 implies that:

$$\begin{aligned} \mathbb {E}\left[ \Vert x^{k+1}-x^*\Vert ^2 \right]&\le \left( \prod _{i=0}^k \theta _i\right) \Vert x^0-x^*\Vert ^2 + \varSigma \sum \limits _{i=0}^k \left( \prod \limits _{j=i+1}^{k} \theta _j\right) \mu _i^2. \end{aligned}$$

By using the Bernoulli inequality \( 1- tx \le \frac{1}{1 + tx} \le (1 + x)^{-t}\) for \(t \in [0,1], x \ge 0\), then we have:

$$\begin{aligned} \prod \limits _{i=l}^u \theta _i&= \prod \limits _{i=l}^u \left( 1 - \frac{\mu _0}{i^{\gamma }} \sigma _{f}\right) \le \prod \limits _{i=l}^u (1 + \mu _0 \sigma _f)^{-1/i^{\gamma }} = (1 + \mu _0 \sigma _{f})^{- \sum \limits _{i=l}^u \frac{1}{i^{\gamma }}}. \end{aligned}$$
(18)

On the other hand, if we use the lower bound

$$\begin{aligned} \sum \limits _{i=l}^u \frac{1}{i^{\gamma }} \ge \int \limits _{l}^{u + 1} \frac{1}{\tau ^{\gamma }} d\tau = \varphi _{1-\gamma }(u+1) - \varphi _{1-\gamma }(l). \end{aligned}$$
(19)

then we can finally derive:

$$\begin{aligned}&\sum \limits _{i=0}^k \left( \prod \limits _{j=i+1}^{k} \theta _j\right) \mu _i^2 = \sum \limits _{i=0}^m \left( \prod \limits _{j=i+1}^{k} \theta _j\right) \mu _i^2 + \sum \limits _{i=m+1}^k \left( \prod \limits _{j=i+1}^{k} \theta _j\right) \mu _i^2\\&\quad \overset{(18) + (19)}{\le } \sum \limits _{i=0}^m (1 + \mu _0 \sigma _f)^{ \varphi _{1-\gamma }(i+1) - \varphi _{1-\gamma }(k) } \mu _i^2 + \mu _{m+1} \sum \limits _{i=m+1}^k \left[ \prod \limits _{j=i+1}^{k} (1 - \mu _j\sigma _f) \right] \mu _i \\&\quad \le (1 + \mu _0 \sigma _f)^{ \varphi _{1-\gamma }(m) - \varphi _{1-\gamma }(k) } \sum \limits _{i=0}^m \mu _i^2\\&\qquad + \frac{\mu _{m+1}}{\sigma _f} \sum \limits _{i=m+1}^k \left[ \prod \limits _{j=i+1}^{k} (1 - \mu _j\sigma _f) \right] (1 - (1- \sigma _f\mu _i)) \\&\quad = (1 + \mu _0 \sigma _f)^{ \varphi _{1-\gamma }(m) - \varphi _{1-\gamma }(k) } \mu _0^2 \sum \limits _{i=0}^m \frac{1}{i^{2\gamma }} \\&\qquad +\frac{\mu _{m+1}}{\sigma _f} \sum \limits _{i=m+1}^k \left[ \prod \limits _{j=i+1}^{k} (1 - \mu _j\sigma _f) - \prod \limits _{j=i}^{k} (1 - \mu _j\sigma _f) \right] \\&\quad \le (1 + \mu _0 \sigma _f)^{ \varphi _{1-\gamma }(m) - \varphi _{1-\gamma }(k) } \frac{m^{1- 2\gamma } - 1}{1 - 2\gamma } + \frac{\mu _{m+1}}{\sigma _f} \left[ 1 - \prod \limits _{j=m+1}^{k} (1 - \mu _j\sigma _f) \right] \\&\quad \le (1 + \mu _0 \sigma _f)^{ \varphi _{1-\gamma }(m) - \varphi _{1-\gamma }(k) } \varphi _{1 - 2\gamma }(m) + \frac{\mu _{m+1}}{\sigma _f}. \end{aligned}$$

By denoting the second constant \(\tilde{\theta }_0 = \frac{1}{1+\mu _0 \sigma _f}\), then the last relation implies the following bound:

$$\begin{aligned} \mathbb {E}\left[ \Vert x^{k+1}-x^*\Vert ^2\right] \le \tilde{\theta }_0^{\varphi _{1-\gamma }(k)} \Vert x^{0}-x^*\Vert ^2 + \tilde{\theta }_0^{ \varphi _{1-\gamma }(k) - \varphi _{1-\gamma }(m) } \varphi _{1 - 2\gamma }(m)\varSigma + \frac{\mu _{m+1}}{\sigma _f} \varSigma . \end{aligned}$$

Denote \(r_k^2 = \mathbb {E}[\Vert x^k-x^*\Vert ^2]\). To derive an explicit convergence rate order we analyze upper bounds on function \(\phi \).

(i) First assume that \(\gamma \in (0, \frac{1}{2})\). This implies that \(1 - 2\gamma > 0\) and that:

$$\begin{aligned} \varphi _{1-2\gamma }\left( \left\lfloor \frac{k}{2} \right\rfloor \right) \le \varphi _{1-2\gamma }\left( \frac{k}{2}\right) = \frac{\left( \frac{k}{2} \right) ^{1-2\gamma } - 1}{1-2\gamma }\le \frac{\left( \frac{k}{2} \right) ^{1-2\gamma }}{1-2\gamma }. \end{aligned}$$
(20)

On the other hand, by using the inequality \(e^{-x} \le \frac{1}{1 + x}\) for all \(x \ge 0\), we obtain:

$$\begin{aligned}&\tilde{\theta }_0^{\varphi _{1-\gamma }(k) - \varphi _{1-\gamma }(\frac{k-2}{2})} \varphi _{1-2\gamma }\left( \frac{k}{2}\right) = e^{(\varphi _{1-\gamma }(k) - \varphi _{1-\gamma }(\frac{k-2}{2}))\ln {\tilde{\theta }_0}} \varphi _{1-2\gamma }\left( \frac{k}{2} \right) \\&\quad \le \frac{\varphi _{1-2\gamma }\left( \frac{k}{2} \right) }{1 + [\varphi _{1-\gamma }(k) - \varphi _{1-\gamma }(\frac{k}{2}-1)]\ln {\frac{1}{\tilde{\theta }_0}}} \overset{(20)}{\le } \frac{\frac{k^{1-2\gamma }}{2^{1-2\gamma } (1-2\gamma )} }{\frac{1}{1-\gamma }[k^{1-\gamma } - (\frac{k}{2}-1)^{1-\gamma }]\ln {\frac{1}{\tilde{\theta }_0}}} \\&\quad = \frac{\frac{k^{1-2\gamma }}{2^{1-2\gamma } (1-2\gamma )}}{\frac{k^{1-\gamma }}{1-\gamma }[1 - (\frac{1}{6})^{1-\gamma }]\ln {\frac{1}{\tilde{\theta }_0}}} = \frac{1-\gamma }{1-2\gamma }\frac{2^{\gamma }k^{-\gamma }}{2^{1-2\gamma }[1 - (\frac{1}{6})^{1-\gamma }]\ln {\frac{1}{\theta _0}}} = \mathcal {O}\left( \frac{1}{k^{\gamma }}\right) . \end{aligned}$$

Therefore, in this case, the overall rate will be given by:

$$\begin{aligned} r_{k+1}^2 \le \theta _0^{\mathcal {O}(k^{1-\gamma })}r_0^2 + \mathcal {O}\left( \frac{1}{k^{\gamma }}\right) \approx \mathcal {O}\left( \frac{1}{k^{\gamma }}\right) . \end{aligned}$$

If \(\gamma = \frac{1}{2}\), then the definition of \(\varphi _{1-2\gamma }(\frac{k}{2})\) provides that:

$$\begin{aligned} r_{k+1}^2 \le \tilde{\theta }_0^{\mathcal {O}(\sqrt{k})}r_0^2 + \tilde{\theta }_0^{\mathcal {O}(\sqrt{k})}\mathcal {O}(\ln {k}) + \mathcal {O}\left( \frac{1}{\sqrt{k}}\right) \approx \mathcal {O}\left( \frac{1}{\sqrt{k}}\right) . \end{aligned}$$

When \(\gamma \in (\frac{1}{2}, 1)\), it is obvious that \(\varphi _{1-2\gamma }\left( \frac{k}{2}\right) \le \frac{1}{2\gamma - 1}\) and therefore the order of the convergence rate changes into:

$$\begin{aligned} r_{k+1}^2 \le \tilde{\theta }_0^{\mathcal {O}(k^{1-\gamma })}[r_0^2 + \mathcal {O}(1)] + \mathcal {O}\left( \frac{1}{k^{\gamma }}\right) \approx \mathcal {O}\left( \frac{1}{k^{\gamma }}\right) . \end{aligned}$$

(ii) Lastly, if \(\gamma = 1\), by using \(\tilde{\theta }_0^{\ln {k+1}} \le \left( \frac{1}{k}\right) ^{\ln {\frac{1}{\tilde{\theta }_0}}}\) we obtain the second part of our result. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Patrascu, A., Irofti, P. Stochastic proximal splitting algorithm for composite minimization. Optim Lett 15, 2255–2273 (2021). https://doi.org/10.1007/s11590-021-01702-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11590-021-01702-7

Keywords

Navigation