Stochastic proximal splitting algorithm for composite minimization

Abstract

Supported by the recent contributions in multiple domains, the first-order splitting became algorithms of choice for structured nonsmooth optimization. The large-scale noisy contexts make available stochastic information on the objective function and thus, the extension of proximal gradient schemes to stochastic oracles is heavily based on the tractability of the proximal operator corresponding to nonsmooth component, which has been highly exploited in the literature. However, some questions remained about the complexity of the composite models with proximal untractable terms. In this paper we tackle composite optimization problems, assuming only the access to stochastic information on both smooth and nonsmooth components, with a stochastic proximal first-order scheme with stochastic proximal updates. We provide sublinear \(\mathcal {O}\left( \frac{1}{k} \right) \) convergence rates (in expectation of squared distance to the optimal set) under the strong convexity assumption on the objective function. Also, linear convergence is achieved for convex feasibility problems. The empirical behavior is illustrated by numerical tests on parametric sparse representation models.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2

Notes

  1. 1.

    Data generating code available athttps://github.com/pirofti/SSPG.

References

  1. 1.

    Asi, H., Duchi, J.C.: Stochastic (approximate) proximal point methods: convergence, optimality, and adaptivity. SIAM J. Optim. 29(3), 2257–2290 (2019)

    MathSciNet  Article  Google Scholar 

  2. 2.

    Bauschke, H., Deutsch, F., Hundal, H., Park, S.-H.: Accelerating the convergence of the method of alternating projections. Trans. Am. Math. Soc. 355(9), 3433–3461 (2003)

    MathSciNet  Article  Google Scholar 

  3. 3.

    Bauschke, H.H., Borwein, J.M., Li, Wu: Strong conical hull intersection property, bounded linear regularity, jameson’s property (g), and error bounds in convex optimization. Math. Program. 86(1), 135–160 (1999)

    MathSciNet  Article  Google Scholar 

  4. 4.

    Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)

    MathSciNet  Article  Google Scholar 

  5. 5.

    Bianchi, P.: Ergodic convergence of a stochastic proximal point algorithm. SIAM J. Optim. 26(4), 2235–2260 (2016)

    MathSciNet  Article  Google Scholar 

  6. 6.

    Davis, D., Drusvyatskiy, D.: Stochastic model-based minimization of weakly convex functions. SIAM J. Optim. 29(1), 207–239 (2019)

    MathSciNet  Article  Google Scholar 

  7. 7.

    Elad, M.: Sparse and redundant representations: from theory to applications in signal and image processing. Springer, New York (2010)

    Book  Google Scholar 

  8. 8.

    Hallac, D., Leskovec, J., Boyd, S.: Network lasso: clustering and optimization in large graphs. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 387–396 (2015)

  9. 9.

    Koshal, J., Nedic, A., Shanbhag, U.V.: Regularized iterative stochastic approximation methods for stochastic variational inequality problems. IEEE Trans. Autom. Control 58(3), 594–609 (2012)

    MathSciNet  Article  Google Scholar 

  10. 10.

    Moulines, E., Bach, F.R.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Advances in Neural Information Processing Systems, pp. 451–459 (2011)

  11. 11.

    Nedić, A.: Random projection algorithms for convex set intersection problems. In: 49th IEEE Conference on Decision and Control (CDC), pp. 7655–7660. IEEE (2010)

  12. 12.

    Nedić, A.: Random algorithms for convex minimization problems. Math. Program. 129(2), 225–253 (2011)

    MathSciNet  Article  Google Scholar 

  13. 13.

    Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)

    MathSciNet  Article  Google Scholar 

  14. 14.

    Nesterov, Y.: Gradient methods for minimizing composite functions. Math. Program. 140(1), 125–161 (2013)

    MathSciNet  Article  Google Scholar 

  15. 15.

    Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, US (2013)

    MATH  Google Scholar 

  16. 16.

    Nguyen, L.M., Nguyen, P.H., van Dijk, M., Richtárik, P., Scheinberg, K., Takáč, M.: Sgd and hogwild! convergence without the bounded gradients assumption. arXiv preprint arXiv:1802.03801 (2018)

  17. 17.

    Pătraşcu, A.: New nonasymptotic convergence rates of stochastic proximal point algorithm for stochastic convex optimization. Optimization, 1–29 (2020)

  18. 18.

    Patrascu, A., Necoara, I.: Nonasymptotic convergence of stochastic proximal point methods for constrained convex optimization. J. Mach. Learn. Res. 18(1), 7204–7245 (2017)

    MathSciNet  MATH  Google Scholar 

  19. 19.

    Rockafellar, R.T., Wets, R.J.-B.: Variational Analysis, vol. 317. Springer, Berlin (2009)

    MATH  Google Scholar 

  20. 20.

    Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton, New Jersey (1988)

    Google Scholar 

  21. 21.

    Rockafellar, R.T., Wets, R.J.-B.: On the interchange of subdifferentiation and conditional expectation for convex functionals. Stochastics 7(1), 173–182 (1982)

    MathSciNet  Article  Google Scholar 

  22. 22.

    Rosasco, L., Villa, S., Vũ, B.C.: Convergence of stochastic proximal gradient algorithm. Appl. Math. Optim. 82, 1–27 (2019)

    MathSciNet  Google Scholar 

  23. 23.

    Ryu, E.K., Boyd, S.: Stochastic proximal iteration: a non-asymptotic improvement upon stochastic gradient descent. Author website, early draft (2016)

  24. 24.

    Salim, A., Bianchi, P., Hachem, W.: Snake: a stochastic proximal gradient algorithm for regularized problems over large graphs. IEEE Trans. Autom. Control 64(5), 1832–1847 (2019)

    MathSciNet  Article  Google Scholar 

  25. 25.

    Shalev-Shwartz, S., Singer, Y., Srebro, N., Cotter, A.: Pegasos: primal estimated sub-gradient solver for svm. Math. Program. 127(1), 3–30 (2011)

    MathSciNet  Article  Google Scholar 

  26. 26.

    Shi, W., Ling, Q., Gang, W., Yin, W.: A proximal gradient algorithm for decentralized composite optimization. IEEE Trans. Signal Process. 63(22), 6013–6023 (2015)

    MathSciNet  Article  Google Scholar 

  27. 27.

    Stoican, F., Irofti, P.: Aiding dictionary learning through multi-parametric sparse representation. Algorithms 12(7), 131 (2019)

    MathSciNet  Article  Google Scholar 

  28. 28.

    Toulis, P., Tran, D., Airoldi, E.: Towards stability and optimality in stochastic gradient descent. In: Artificial Intelligence and Statistics, pp. 1290–1298 (2016)

  29. 29.

    Varma, R., Lee, H., Kovacevic, J., Chi, Y.: Vector-valued graph trend filtering with non-convex penalties. IEEE Trans. Signal Inf. Process. Over Netw. 6, 48–62 (2019)

  30. 30.

    Wang, M., Bertsekas, D.P.: Stochastic first-order methods with random constraint projection. SIAM J. Optim. 26(1), 681–717 (2016)

    MathSciNet  Article  Google Scholar 

  31. 31.

    Wang, X., Wang, S., Zhang, H.: Inexact proximal stochastic gradient method for convex composite optimization. Comput. Optim. Appl. 68(3), 579–618 (2017)

    MathSciNet  Article  Google Scholar 

  32. 32.

    Yankelevsky, Y., Elad, M.: Dual graph regularized dictionary learning. IEEE Trans. Signal Inf. Process. Over Netw. 2(4), 611–624 (2016)

    MathSciNet  Article  Google Scholar 

  33. 33.

    Zhong, W., Kwok, J.: Accelerated stochastic gradient method for composite regularization. In: Artificial Intelligence and Statistics, pp. 1086–1094 (2014)

Download references

Acknowledgements

The research of A. Patrascu was supported by a grant of the Romanian Ministry of Education and Research, CNCS - UEFISCDI, project number PN-III-P1-1.1-PD-2019-1123, within PNCDI III. Also, the research work of P. Irofti was supported by a grant of the Romanian Ministry of Education and Research, CNCS - UEFISCDI, project number PN-III-P1-1.1-PD-2019-0825, within PNCDI III..

Author information

Affiliations

Authors

Corresponding author

Correspondence to Andrei Patrascu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Proof (of Corollary 3)

For simplicity denote \(\theta _k = (1 - \mu _k\sigma _{f})\) , then Theorem 2 implies that:

$$\begin{aligned} \mathbb {E}\left[ \Vert x^{k+1}-x^*\Vert ^2 \right]&\le \left( \prod _{i=0}^k \theta _i\right) \Vert x^0-x^*\Vert ^2 + \varSigma \sum \limits _{i=0}^k \left( \prod \limits _{j=i+1}^{k} \theta _j\right) \mu _i^2. \end{aligned}$$

By using the Bernoulli inequality \( 1- tx \le \frac{1}{1 + tx} \le (1 + x)^{-t}\) for \(t \in [0,1], x \ge 0\), then we have:

$$\begin{aligned} \prod \limits _{i=l}^u \theta _i&= \prod \limits _{i=l}^u \left( 1 - \frac{\mu _0}{i^{\gamma }} \sigma _{f}\right) \le \prod \limits _{i=l}^u (1 + \mu _0 \sigma _f)^{-1/i^{\gamma }} = (1 + \mu _0 \sigma _{f})^{- \sum \limits _{i=l}^u \frac{1}{i^{\gamma }}}. \end{aligned}$$
(18)

On the other hand, if we use the lower bound

$$\begin{aligned} \sum \limits _{i=l}^u \frac{1}{i^{\gamma }} \ge \int \limits _{l}^{u + 1} \frac{1}{\tau ^{\gamma }} d\tau = \varphi _{1-\gamma }(u+1) - \varphi _{1-\gamma }(l). \end{aligned}$$
(19)

then we can finally derive:

$$\begin{aligned}&\sum \limits _{i=0}^k \left( \prod \limits _{j=i+1}^{k} \theta _j\right) \mu _i^2 = \sum \limits _{i=0}^m \left( \prod \limits _{j=i+1}^{k} \theta _j\right) \mu _i^2 + \sum \limits _{i=m+1}^k \left( \prod \limits _{j=i+1}^{k} \theta _j\right) \mu _i^2\\&\quad \overset{(18) + (19)}{\le } \sum \limits _{i=0}^m (1 + \mu _0 \sigma _f)^{ \varphi _{1-\gamma }(i+1) - \varphi _{1-\gamma }(k) } \mu _i^2 + \mu _{m+1} \sum \limits _{i=m+1}^k \left[ \prod \limits _{j=i+1}^{k} (1 - \mu _j\sigma _f) \right] \mu _i \\&\quad \le (1 + \mu _0 \sigma _f)^{ \varphi _{1-\gamma }(m) - \varphi _{1-\gamma }(k) } \sum \limits _{i=0}^m \mu _i^2\\&\qquad + \frac{\mu _{m+1}}{\sigma _f} \sum \limits _{i=m+1}^k \left[ \prod \limits _{j=i+1}^{k} (1 - \mu _j\sigma _f) \right] (1 - (1- \sigma _f\mu _i)) \\&\quad = (1 + \mu _0 \sigma _f)^{ \varphi _{1-\gamma }(m) - \varphi _{1-\gamma }(k) } \mu _0^2 \sum \limits _{i=0}^m \frac{1}{i^{2\gamma }} \\&\qquad +\frac{\mu _{m+1}}{\sigma _f} \sum \limits _{i=m+1}^k \left[ \prod \limits _{j=i+1}^{k} (1 - \mu _j\sigma _f) - \prod \limits _{j=i}^{k} (1 - \mu _j\sigma _f) \right] \\&\quad \le (1 + \mu _0 \sigma _f)^{ \varphi _{1-\gamma }(m) - \varphi _{1-\gamma }(k) } \frac{m^{1- 2\gamma } - 1}{1 - 2\gamma } + \frac{\mu _{m+1}}{\sigma _f} \left[ 1 - \prod \limits _{j=m+1}^{k} (1 - \mu _j\sigma _f) \right] \\&\quad \le (1 + \mu _0 \sigma _f)^{ \varphi _{1-\gamma }(m) - \varphi _{1-\gamma }(k) } \varphi _{1 - 2\gamma }(m) + \frac{\mu _{m+1}}{\sigma _f}. \end{aligned}$$

By denoting the second constant \(\tilde{\theta }_0 = \frac{1}{1+\mu _0 \sigma _f}\), then the last relation implies the following bound:

$$\begin{aligned} \mathbb {E}\left[ \Vert x^{k+1}-x^*\Vert ^2\right] \le \tilde{\theta }_0^{\varphi _{1-\gamma }(k)} \Vert x^{0}-x^*\Vert ^2 + \tilde{\theta }_0^{ \varphi _{1-\gamma }(k) - \varphi _{1-\gamma }(m) } \varphi _{1 - 2\gamma }(m)\varSigma + \frac{\mu _{m+1}}{\sigma _f} \varSigma . \end{aligned}$$

Denote \(r_k^2 = \mathbb {E}[\Vert x^k-x^*\Vert ^2]\). To derive an explicit convergence rate order we analyze upper bounds on function \(\phi \).

(i) First assume that \(\gamma \in (0, \frac{1}{2})\). This implies that \(1 - 2\gamma > 0\) and that:

$$\begin{aligned} \varphi _{1-2\gamma }\left( \left\lfloor \frac{k}{2} \right\rfloor \right) \le \varphi _{1-2\gamma }\left( \frac{k}{2}\right) = \frac{\left( \frac{k}{2} \right) ^{1-2\gamma } - 1}{1-2\gamma }\le \frac{\left( \frac{k}{2} \right) ^{1-2\gamma }}{1-2\gamma }. \end{aligned}$$
(20)

On the other hand, by using the inequality \(e^{-x} \le \frac{1}{1 + x}\) for all \(x \ge 0\), we obtain:

$$\begin{aligned}&\tilde{\theta }_0^{\varphi _{1-\gamma }(k) - \varphi _{1-\gamma }(\frac{k-2}{2})} \varphi _{1-2\gamma }\left( \frac{k}{2}\right) = e^{(\varphi _{1-\gamma }(k) - \varphi _{1-\gamma }(\frac{k-2}{2}))\ln {\tilde{\theta }_0}} \varphi _{1-2\gamma }\left( \frac{k}{2} \right) \\&\quad \le \frac{\varphi _{1-2\gamma }\left( \frac{k}{2} \right) }{1 + [\varphi _{1-\gamma }(k) - \varphi _{1-\gamma }(\frac{k}{2}-1)]\ln {\frac{1}{\tilde{\theta }_0}}} \overset{(20)}{\le } \frac{\frac{k^{1-2\gamma }}{2^{1-2\gamma } (1-2\gamma )} }{\frac{1}{1-\gamma }[k^{1-\gamma } - (\frac{k}{2}-1)^{1-\gamma }]\ln {\frac{1}{\tilde{\theta }_0}}} \\&\quad = \frac{\frac{k^{1-2\gamma }}{2^{1-2\gamma } (1-2\gamma )}}{\frac{k^{1-\gamma }}{1-\gamma }[1 - (\frac{1}{6})^{1-\gamma }]\ln {\frac{1}{\tilde{\theta }_0}}} = \frac{1-\gamma }{1-2\gamma }\frac{2^{\gamma }k^{-\gamma }}{2^{1-2\gamma }[1 - (\frac{1}{6})^{1-\gamma }]\ln {\frac{1}{\theta _0}}} = \mathcal {O}\left( \frac{1}{k^{\gamma }}\right) . \end{aligned}$$

Therefore, in this case, the overall rate will be given by:

$$\begin{aligned} r_{k+1}^2 \le \theta _0^{\mathcal {O}(k^{1-\gamma })}r_0^2 + \mathcal {O}\left( \frac{1}{k^{\gamma }}\right) \approx \mathcal {O}\left( \frac{1}{k^{\gamma }}\right) . \end{aligned}$$

If \(\gamma = \frac{1}{2}\), then the definition of \(\varphi _{1-2\gamma }(\frac{k}{2})\) provides that:

$$\begin{aligned} r_{k+1}^2 \le \tilde{\theta }_0^{\mathcal {O}(\sqrt{k})}r_0^2 + \tilde{\theta }_0^{\mathcal {O}(\sqrt{k})}\mathcal {O}(\ln {k}) + \mathcal {O}\left( \frac{1}{\sqrt{k}}\right) \approx \mathcal {O}\left( \frac{1}{\sqrt{k}}\right) . \end{aligned}$$

When \(\gamma \in (\frac{1}{2}, 1)\), it is obvious that \(\varphi _{1-2\gamma }\left( \frac{k}{2}\right) \le \frac{1}{2\gamma - 1}\) and therefore the order of the convergence rate changes into:

$$\begin{aligned} r_{k+1}^2 \le \tilde{\theta }_0^{\mathcal {O}(k^{1-\gamma })}[r_0^2 + \mathcal {O}(1)] + \mathcal {O}\left( \frac{1}{k^{\gamma }}\right) \approx \mathcal {O}\left( \frac{1}{k^{\gamma }}\right) . \end{aligned}$$

(ii) Lastly, if \(\gamma = 1\), by using \(\tilde{\theta }_0^{\ln {k+1}} \le \left( \frac{1}{k}\right) ^{\ln {\frac{1}{\tilde{\theta }_0}}}\) we obtain the second part of our result. \(\square \)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Patrascu, A., Irofti, P. Stochastic proximal splitting algorithm for composite minimization. Optim Lett (2021). https://doi.org/10.1007/s11590-021-01702-7

Download citation

Keywords

  • Stochastic proximal gradient algorithm
  • Sublinear convergence rate
  • Parametric sparse representation
  • Linear convergence rate
  • Proximal point
  • Moreau envelope