Skip to main content
Log in

Dualize, Split, Randomize: Toward Fast Nonsmooth Optimization Algorithms

Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Cite this article

Abstract

We consider minimizing the sum of three convex functions, where the first one F is smooth, the second one is nonsmooth and proximable and the third one is the composition of a nonsmooth proximable function with a linear operator L. This template problem has many applications, for instance, in image processing and machine learning. First, we propose a new primal–dual algorithm, which we call PDDY, for this problem. It is constructed by applying Davis–Yin splitting to a monotone inclusion in a primal–dual product space, where the operators are monotone under a specific metric depending on L. We show that three existing algorithms (the two forms of the Condat–Vũ algorithm and the PD3O algorithm) have the same structure, so that PDDY is the fourth missing link in this self-consistent class of primal–dual algorithms. This representation eases the convergence analysis: it allows us to derive sublinear convergence rates in general, and linear convergence results in presence of strong convexity. Moreover, within our broad and flexible analysis framework, we propose new stochastic generalizations of the algorithms, in which a variance-reduced random estimate of the gradient of F is used, instead of the true gradient. Furthermore, we obtain, as a special case of PDDY, a linearly converging algorithm for the minimization of a strongly convex function F under a linear constraint; we discuss its important application to decentralized optimization.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Alghunaim, S.A., Ryu, E.K., Yuan, K., Sayed, A.H.: Decentralized proximal gradient algorithms with linear convergence rates. IEEE Trans. Autom. Control 66(6), 2787–2794 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  2. Alotaibi, A., Combettes, P.L., Shahzad, N.: Solving coupled composite monotone inclusions by successive Fejér approximations of their Kuhn-Tucker set. SIAM J. Optim. 24(4), 2076–2095 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  3. Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with sparsity-inducing penalties. Found. Trends Mach. Learn. 4(1), 1–106 (2012)

    Article  MATH  Google Scholar 

  4. Basu, D., Data, D., Karakus, C., Diggavi, S.N.: Qsparse-Local-SGD: distributed SGD with quantization, sparsification, and local computations. IEEE J. Select. Areas Inform. Theor. 1(1), 217–226 (2020)

    Article  Google Scholar 

  5. Bauschke, H.H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces, 2nd edn. Springer, New York (2017)

    Book  MATH  Google Scholar 

  6. Beck, A.: First-Order Methods in Optimization. SIAM, MOS-SIAM Series on Optimization (2017)

  7. Boţ, R.I., Csetnek, E.R., Hendrich, C.: Recent developments on primal-dual splitting methods with applications to convex minimization. In: Pardalos, P.M., Rassias, T.M. (eds.) Mathematics Without Boundaries: Surveys in Interdisciplinary Research, pp. 57–99. Springer, New York (2014)

    Google Scholar 

  8. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)

    MATH  Google Scholar 

  9. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press (2004)

  10. Bredies, K., Kunisch, K., Pock, T.: Total generalized variation. SIAM J. Imaging Sci. 3(3), 492–526 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  11. Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  12. Chambolle, A., Pock, T.: An introduction to continuous optimization for imaging. Acta Numer. 25, 161–319 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  13. Chambolle, A., Pock, T.: On the ergodic convergence rates of a first-order primal-dual algorithm. Math. Program. 159(1–2), 253–287 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  14. Chang, C.C., Lin, C.J.: LibSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)

    Google Scholar 

  15. Chen, P., Huang, J., Zhang, X.: A primal-dual fixed point algorithm for convex separable minimization with applications to image restoration. Inverse Probl. 29(2), 025011 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  16. Combettes, P.L., Condat, L., Pesquet, J.C., Vũ, B.C.: A forward–backward view of some primal–dual optimization methods in image recovery. In: Proc. of IEEE ICIP. Paris, France (2014)

  17. Combettes, P.L., Eckstein, J.: Asynchronous block-iterative primal-dual decomposition methods for monotone inclusions. Math. Program. 168(1–2), 645–672 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  18. Combettes, P.L., Glaudin, L.E.: Proximal activation of smooth functions in splitting algorithms for convex image recovery. SIAM J. Imaging Sci. 12(4), 1905–1935 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  19. Combettes, P.L., Pesquet, J.C.: Proximal splitting methods in signal processing. In: Bauschke, H.H., Burachik, R., Combettes, P.L., Elser, V., Luke, D.R., Wolkowicz, H. (eds.) Fixed-Point Algorithms for Inverse Problems in Science and Engineering. Springer-Verlag, New York (2010)

    Google Scholar 

  20. Combettes, P.L., Pesquet, J.C.: Primal-dual splitting algorithm for solving inclusions with mixtures of composite, Lipschitzian, and parallel-sum type monotone operators. Set-Val. Var. Anal. 20(2), 307–330 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  21. Combettes, P.L., Pesquet, J.C.: Fixed point strategies in data science. IEEE Trans. Signal Process. 69, 3878–3905 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  22. Condat, L.: A primal-dual splitting method for convex optimization involving Lipschitzian, proximable and linear composite terms. J. Optim. Theory Appl. 158(2), 460–479 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  23. Condat, L.: A generic proximal algorithm for convex optimization–application to total variation minimization. IEEE Signal Process. Lett. 21(8), 1054–1057 (2014)

    Google Scholar 

  24. Condat, L.: Discrete total variation: new definition and minimization. SIAM J. Imaging Sci. 10(3), 1258–1290 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  25. Condat, L., Kitahara, D., Contreras, A., Hirabayashi, A.: Proximal splitting algorithms for convex optimization: a tour of recent advances, with new twists. SIAM Review . To appear (2022)

  26. Condat, L., Malinovsky, G., Richtárik, P.: Distributed proximal splitting algorithms with rates and acceleration. Front. Signal Process. (2022). https://doi.org/10.3389/frsip.2021.776825

  27. Couprie, C., Grady, L., Najman, L., Pesquet, J.C., Talbot, H.: Dual constrained TV-based regularization on graphs. SIAM J. Imaging Sci. 6(3), 1246–1273 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  28. Davis, D., Yin, W.: A three-operator splitting scheme and its optimization applications. Set-Val. Var. Anal. 25, 829–858 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  29. Defazio, A., Bach, F., Lacoste-Julien, S.: Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In: Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, K. Weinberger (eds.) Advances in Neural Information Processing Systems, vol. 27. Curran Associates, Inc. (2014)

  30. Drori, Y., Sabach, S., Teboulle, M.: A simple algorithm for a class of nonsmooth convex concave saddle-point problems. Oper. Res. Lett. 43(2), 209–214 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  31. Duran, J., Moeller, M., Sbert, C., Cremers, D.: Collaborative total variation: A general framework for vectorial TV models. SIAM J. Imaging Sci. 9(1), 116–151 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  32. Eckstein, J., Bertsekas, D.P.: On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program. 55, 293–318 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  33. Eckstein, J., Svaiter, B.F.: A family of projective splitting methods for the sum of two maximal monotone operators. Math. Program. 111(1), 173–199 (2008)

    MathSciNet  MATH  Google Scholar 

  34. Gabay, D., Mercier, B.: A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Comput. Math. Appl. 2(1), 17–40 (1976)

    Article  MATH  Google Scholar 

  35. Glowinski, R., Marrocco, A.: Sur l’approximation par éléments finis d’ordre un, et la résolution par pénalisation-dualité d’une classe de problèmes de Dirichlet non linéaires. Revue Française d’Automatique, Informatique et Recherche Opérationnelle 9, 41–76 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  36. Gorbunov, E., Hanzely, F., Richtárik, P.: A unified theory of SGD: Variance reduction, sampling, quantization and coordinate descent. In: S. Chiappa, R. Calandra (eds.) Proc. of Int. Conf. Artif. Intell. Stat. (AISTATS), vol. PMLR 108, pp. 680–690 (2020)

  37. Gower, R.M., Loizou, N., Qian, X., Sailanbayev, A., Shulgin, E., Richtárik, P.: SGD: General analysis and improved rates. In: K. Chaudhuri, R. Salakhutdinov (eds.) Proc. of 36th Int. Conf. Machine Learning (ICML), vol. PMLR 97, pp. 5200–5209 (2019)

  38. Gower, R.M., Schmidt, M., Bach, F., Richtárik, P.: Variance-reduced methods for machine learning. Proc. IEEE 108(11), 1968–1983 (2020)

    Article  Google Scholar 

  39. Hofmann, T., Lucchi, A., Lacoste-Julien, S., McWilliams, B.: Variance reduced stochastic gradient descent with neighbors. In: C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, R. Garnett (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 2305–2313. Curran Associates, Inc. (2015)

  40. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: C. Burges, L. Bottou, M. Welling, Z. Ghahramani, K. Weinberger (eds.) Advances in Neural Information Processing Systems, vol. 26, pp. 315–323. Curran Associates, Inc. (2013)

  41. Johnstone, P.R., Eckstein, J.: Convergence rates for projective splitting. SIAM J. Optim. 29(3), 1931–1957 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  42. Johnstone, P.R., Eckstein, J.: Single-forward-step projective splitting: exploiting cocoercivity. Comput. Optim. Appl. 78(1), 125–166 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  43. Johnstone, P.R., Eckstein, J.: Projective splitting with forward steps. Math. Program. 191, 631–670 (2022)

    Article  MathSciNet  MATH  Google Scholar 

  44. Komodakis, N., Pesquet, J.C.: Playing with duality: an overview of recent primal-dual approaches for solving large-scale optimization problems. IEEE Signal Process. Mag. 32(6), 31–54 (2015)

    Article  Google Scholar 

  45. Kovalev, D., Horváth, S., Richtárik, P.: Don’t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop. In: A. Kontorovich, G. Neu (eds.) Proc. of Int. Conf. Algo. Learn. Theory (ALT), vol. PMLR 117, pp. 451–467 (2020)

  46. Kovalev, D., Salim, A., Richtárik, P.: Optimal and practical algorithms for smooth and strongly convex decentralized optimization. In: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin (eds.) Advances in Neural Information Processing Systems, vol. 33, pp. 18342–18352. Curran Associates, Inc. (2020)

  47. Lan, G.: First-order and Stochastic Optimization Methods for Machine Learning. Springer Cham (2020)

  48. LeCun, Y., Cortes, C.: MNIST handwritten digit database (2010). http://yann.lecun.com/exdb/mnist/

  49. Li, H., Lin, Z.: Revisiting EXTRA for smooth distributed optimization. SIAM J. Optim. 30(3), 1795–1821 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  50. Li, T., Sahu, A.K., Talwalkar, A., Smith, V.: Federated learning: Challenges, methods, and future directions. IEEE Signal Process. Mag. 3(37), 50–60 (2020)

    Google Scholar 

  51. Lions, P.L., Mercier, B.: Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer. Anal. 16(6), 964–979 (1979)

    Article  MathSciNet  MATH  Google Scholar 

  52. Loris, I., Verhoeven, C.: On a generalization of the iterative soft-thresholding algorithm for the case of non-separable penalty. Inverse Probl. 27(12) (2011)

  53. Mokhtari, A., Ribeiro, A.: DSA: Decentralized double stochastic averaging gradient algorithm. J. Mach. Learn. Res. 17(1), 2165–2199 (2016)

    MathSciNet  MATH  Google Scholar 

  54. Nesterov, Y.: Lectures on Convex Optimization, vol. 137. Springer (2018)

  55. O’Connor, D., Vandenberghe, L.: On the equivalence of the primal-dual hybrid gradient method and Douglas-Rachford splitting. Math. Program. 79, 85–108 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  56. Palomar, D.P., Eldar, Y.C. (eds.): Convex Optimization in Signal Processing and Communications. Cambridge University Press (2009)

  57. Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 3(1), 127–239 (2014)

    Article  Google Scholar 

  58. Pedregosa, F., Fatras, K., Casotto, M.: Proximal splitting meets variance reduction. In: K. Chaudhuri, M. Sugiyama (eds.) Proc. of Int. Conf. Artif. Intell. Stat. (AISTATS), vol. PMLR 89, pp. 1–10 (2019)

  59. Polson, N.G., Scott, J.G., Willard, B.T.: Proximal algorithms in statistics and machine learning. Statist. Sci. 30(4), 559–581 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  60. Pustelnik, N., Condat, L.: Proximity operator of a sum of functions; application to depth map estimation. IEEE Signal Process. Lett. 24(12), 1827–1831 (2017)

    Article  Google Scholar 

  61. Rudin, L., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Phys. D 60(1–4), 259–268 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  62. Ryu, E.K.: Uniqueness of DRS as the 2 operator resolvent-splitting and impossibility of 3 operator resolvent-splitting. Math. Program. 182, 233–273 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  63. Salim, A., Bianchi, P., Hachem, W.: Snake: a stochastic proximal gradient algorithm for regularized problems over large graphs. IEEE Trans. Automat. Contr. 64(5), 1832–1847 (2019)

  64. Salim, A., Condat, L., Kovalev, D., Richtárik, P.: An optimal algorithm for strongly convex minimization under affine constraints. In: G. Camps-Valls, F.J.R. Ruiz, I. Valera (eds.) Proc. of Int. Conf. Artif. Intell. Stat. (AISTATS), vol. PMLR 151, pp. 4482–4498 (2022)

  65. Sattler, F., Wiedemann, S., K.-R. Müller, Samek, W.: Robust and communication-efficient federated learning from non-i.i.d. data. IEEE Trans. Neural Networks and Learning Systems 31(9), 3400–3413 (2020)

  66. Scaman, K., Bach, F., Bubeck, S., Lee, Y.T., Massoulié, L.: Optimal algorithms for smooth and strongly convex distributed optimization in networks. In: D. Precup, Y.W. Teh (eds.) Proc. of 34th Int. Conf. Machine Learning (ICML), vol. PMLR 70, pp. 3027–3036 (2017)

  67. Shi, W., Ling, Q., Wu, G., Yin, W.: EXTRA: An exact first-order algorithm for decentralized consensus optimization. SIAM J. Optim. 25(2), 944–966 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  68. Starck, J.L., Murtagh, F., Fadili, J.: Sparse Image and Signal Processing: Wavelets, Curvelets. Cambridge University Press, Morphological Diversity (2010)

  69. Stathopoulos, G., Shukla, H., Szucs, A., Pu, Y., Jones, C.N.: Operator splitting methods in control. Found. Trends Syst. Control 3(3), 249–362 (2016)

    Article  Google Scholar 

  70. Svaiter, B.F.: On weak convergence of the Douglas-Rachford method. SIAM J. Control. Optim. 49(1), 280–287 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  71. Tay, J.K., Friedman, J., Tibshirani, R.: Principal component-guided sparse regression. Can. J. Stat. 49, 1222–1257 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  72. Vũ, B.C.: A splitting algorithm for dual monotone inclusions involving cocoercive operators. Adv. Comput. Math. 38(3), 667–681 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  73. Wright, S.J.: Coordinate descent algorithms. Math. Program. 151, 3–34 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  74. Xiao, L., Zhang, T.: A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24(4), 2057–2075 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  75. Xu, H., Ho, C.Y., Abdelmoniem, A.M., Dutta, A., Bergou, E.H., Karatsenidis, K., Canini, M., Kalnis, P.: GRACE: A compressed communication framework for distributed machine learning. In: Proc. of 41st IEEE Int. Conf. Distributed Computing Systems (ICDCS), pp. 561–572 (2021)

  76. Xu, J., Tian, Y., Sun, Y., Scutari, G.: Distributed algorithms for composite optimization: unified framework and convergence analysis. IEEE Trans. Signal Process. 69, 3555–3570 (2021)

    Article  MathSciNet  MATH  Google Scholar 

  77. Yan, M.: A new Primal-Dual algorithm for minimizing the sum of three functions with a linear operator. J. Sci. Comput. 76(3), 1698–1717 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  78. Yurtsever, A., Vu, B.C., Cevher, V.: Stochastic three-composite convex minimization. In: D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, R. Garnett (eds.) Advances in Neural Information Processing Systems, vol. 29, pp. 4329–4337. Curran Associates, Inc. (2016)

  79. Zhang, L., Mahdavi, M., Jin, R.: Linear convergence with condition number independent access of full gradients. In: C. Burges, L. Bottou, M. Welling, Z. Ghahramani, K. Weinberger (eds.) Advances in Neural Information Processing Systems, vol. 26, pp. 980–988. Curran Associates, Inc. (2013)

  80. Zhao, R., Cevher, V.: Stochastic three-composite convex minimization with a linear operator. In: A. Storkey, F. Perez-Cruz (eds.) Proc. of Int. Conf. Artif. Intell. Stat. (AISTATS), vol. PMLR 84, pp. 765–774 (2018)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Laurent Condat.

Additional information

Communicated by Shoham Sabach.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

1.1 A Lemmas A.1 and A.2

We state Lemma A.1 and Lemma A.2, which are used in the proofs of Theorem 5.1 and Theorem 5.2, respectively.

To simplify the notations, we use the following convention: when a set appears in an equation while a single element is expected, e.g. \(\partial R (x^k)\), this means that the equation holds for some element in this nonempty set.

Lemma A.1

Assume that F is \(\mu _F\)-strongly convex, for some \(\mu _F \ge 0\), and that \((g^k)_{k\in {\mathbb {N}}}\) satisfies Assumption 1. Then, the iterates of the Stochastic PD3O Algorithm satisfy

$$\begin{aligned}&{\mathbb {E}}_k \Vert v^{k+1} - v^\star \Vert _P^2 + \kappa \gamma ^2{\mathbb {E}}_k\sigma _{k+1}^2 \le \Vert v^k - v^\star \Vert _P^2 + \kappa \gamma ^2\left( 1-\rho +\frac{\beta }{\kappa }\right) \sigma _k^2 \nonumber \\&\quad -2\gamma (1-\gamma (\alpha +\kappa \delta )) D_F(x^{k},x^\star )-\gamma \mu _F\Vert x^{k}-x^\star \Vert ^2 \nonumber \\&\quad -2\gamma \langle \partial R(x^{k}) - \partial R(x^\star ),x^{k} - x^\star \rangle \nonumber \\&\quad -2\gamma {\mathbb {E}}_k\langle \partial H^*(d^{k+1}) - \partial H^*(d^\star ),d^{k+1} - d^\star \rangle \nonumber \\&\quad -\gamma ^2{\mathbb {E}}_k\big \Vert P^{-1}A(u^{k+1})+P^{-1}B(z^{k})- \left( P^{-1}A(u^{\star })+P^{-1}B(z^{\star })\right) \big \Vert _P^2. \end{aligned}$$
(22)

Proof

Applying Lemma 3.2 for \(\text {DYS}(P^{-1}A,P^{-1}B,P^{-1}C)\) using the norm induced by P, we have

$$\begin{aligned} \Vert v^{k+1} - v^\star \Vert _P^2&={} \Vert v^k - v^\star \Vert _P^2 -2\gamma \langle P^{-1}B(z^{k}) - P^{-1}B(z^\star ),z^{k} - z^\star \rangle _P\\&\quad -2\gamma \langle P^{-1}C(z^{k}) - P^{-1}C(z^\star ),z^{k} - z^\star \rangle _P\\&\quad +\gamma ^2\Vert P^{-1}C(z^{k}) - P^{-1}C(z^\star )\Vert _P^2\\&\quad -2\gamma \langle P^{-1}A(u^{k+1}) - P^{-1}A(u^\star ),u^{k+1} - u^\star \rangle _P\\&\quad -\gamma ^2\Vert P^{-1}A(u^{k+1})+P^{-1}B(z^{k}) - \left( P^{-1}A(u^{\star })+P^{-1}B(z^{\star })\right) \Vert _P^2\\&= \Vert v^k - v^\star \Vert _P^2 -2\gamma \langle B(z^{k}) - B(z^\star ),z^{k} - z^\star \rangle \\&\quad +\gamma ^2\Vert P^{-1}C(z^{k}) - P^{-1}C(z^\star )\Vert _P^2\\&\quad -2\gamma \langle C(z^{k}) - C(z^\star ),z^{k} - z^\star \rangle -2\gamma \langle A(u^{k+1}) - A(u^\star ),u^{k+1} - u^\star \rangle \\&\quad -\gamma ^2\Vert P^{-1}A(u^{k+1})+P^{-1}B(z^{k}) - \left( P^{-1}A(u^{\star })+P^{-1}B(z^{\star })\right) \Vert _P^2. \end{aligned}$$

Using \( A(u^{k+1}) = \big ( L^* d^{k+1} , -L s^{k+1}+\partial H^*(d^{k+1})\big ), B(z^k) = \big (\partial R(x^{k}) ,0 \big ), C(z^k) = \big ( g^{k+1},0\big ) \) and \( A(u^{\star }) = \big ( L^* d^\star , -L s^\star +\partial H^*(d^\star )\big ), B(z^\star ) = \big (\partial R(x^\star ) ,0 \big ), C(z^\star ) = \big ( \nabla F(x^\star ),0\big ) \), we have

$$\begin{aligned} \Vert v^{k+1} - v^\star \Vert _P^2&={} \Vert v^k - v^\star \Vert _P^2 -2\gamma \langle \partial R(x^{k}) - \partial R(x^\star ),x^{k} - x^\star \rangle \\&\quad +\gamma ^2\Vert g^{k+1} - \nabla F(x^\star )\Vert ^2\\&\quad -2\gamma \langle g^{k+1} - \nabla F(x^\star ),x^{k} - x^\star \rangle -2\gamma \langle \partial H^*(d^{k+1})\\&\quad - \partial H^*(d^\star ),d^{k+1} - d^\star \rangle \\&\quad -\gamma ^2\Vert P^{-1}A(u^{k+1})+P^{-1}B(z^{k})- \left( P^{-1}A(u^{\star })+P^{-1}B(z^{\star })\right) \Vert _P^2. \end{aligned}$$

Taking conditional expectation w.r.t. \({\mathscr {F}}_k\) and using Assumption 1,

$$\begin{aligned} {\mathbb {E}}_k \Vert v^{k+1} - v^\star \Vert _P^2&\le {} \Vert v^k - v^\star \Vert _P^2 -2\gamma \langle \partial R(x^{k}) - \partial R(x^\star ),x^{k} - x^\star \rangle \\&\quad -2\gamma \langle \nabla F(x^{k}) - \nabla F(x^\star ),x^{k} - x^\star \rangle \\&\quad -2\gamma {\mathbb {E}}_k\langle \partial H^*(d^{k+1}) - \partial H^*(d^\star ),d^{k+1} - d^\star \rangle \\&\quad +\gamma ^2 \left( 2\alpha D_F(x^{k},x^\star ) + \beta \sigma _k^2\right) \\&\quad -\gamma ^2{\mathbb {E}}_k\Vert P^{-1}A(u^{k+1})+P^{-1}B(z^{k}) \\&\quad - \left( P^{-1}A(u^{\star })+P^{-1}B(z^{\star })\right) \Vert _P^2. \end{aligned}$$

Using strong convexity of F,

$$\begin{aligned} {\mathbb {E}}_k \Vert v^{k+1} - v^\star \Vert _P^2 \le {}&\Vert v^k - v^\star \Vert _P^2 -\gamma \mu _F\Vert x^{k}-x^\star \Vert ^2-2\gamma D_F(x^{k},x^\star )\\&\quad +\gamma ^2 \left( 2\alpha D_F(x^{k},x^\star ) + \beta \sigma _k^2\right) \\&\quad -2\gamma \langle \partial R(x^{k}) - \partial R(x^\star ),x^{k} - x^\star \rangle \\&\quad -2\gamma {\mathbb {E}}_k\langle \partial H^*(d^{k+1}) - \partial H^*(d^\star ),d^{k+1} - d^\star \rangle \\&\quad -\gamma ^2{\mathbb {E}}_k\Vert P^{-1}A(u^{k+1})+P^{-1}B(z^{k})\\&\quad - \left( P^{-1}A(u^{\star })+P^{-1}B(z^{\star })\right) \Vert _P^2. \end{aligned}$$

Using Assumption 1,

$$\begin{aligned}&{\mathbb {E}}_k \Vert v^{k+1} - v^\star \Vert _P^2 + \kappa \gamma ^2{\mathbb {E}}_k\sigma _{k+1}^2 \le \Vert v^k - v^\star \Vert _P^2 + \kappa \gamma ^2\left( 1-\rho +\frac{\beta }{\kappa }\right) \sigma _k^2\\&\quad -\gamma \mu _F\Vert x^{k}-x^\star \Vert ^2\\&\quad -2\gamma (1-\gamma (\alpha +\kappa \delta )) D_F(x^{k},x^\star )-2\gamma \langle \partial R(x^{k}) \\&\quad - \partial R(x^\star ),x^{k} - x^\star \rangle \\&\quad -2\gamma {\mathbb {E}}_k\langle \partial H^*(d^{k+1}) - \partial H^*(d^\star ),d^{k+1} - d^\star \rangle \\&\quad -\gamma ^2{\mathbb {E}}_k\big \Vert P^{-1}A(u^{k+1})+P^{-1}B(z^{k}) \\&\quad - \left( P^{-1}A(u^{\star })+P^{-1}B(z^{\star })\right) \big \Vert _P^2. \end{aligned}$$

\(\square \)

Lemma A.2

Suppose that \((g^k)_{k\in {\mathbb {N}}}\) satisfies Assumption 1. Then, the iterates of the Stochastic PDDY Algorithm satisfy

$$\begin{aligned}&{\mathbb {E}}_k \Vert v^{k+1} - v^\star \Vert _P^2 + \kappa \gamma ^2{\mathbb {E}}_k\sigma _{k+1}^2 \le \Vert v^k - v^\star \Vert _P^2 + \kappa \gamma ^2\left( 1-\rho +\frac{\beta }{\kappa }\right) \sigma _k^2\\&\quad -2\gamma (1-\gamma (\alpha +\kappa \delta )) D_F(x^{k},x^\star )-2\gamma \langle \partial H^*(y^{k}) \\&\quad - \partial H^*(y^\star ),y^{k} - y^\star \rangle \\&\quad -2\gamma {\mathbb {E}}_k\langle \partial R(s^{k+1}) - \partial R(s^\star ),s^{k+1} - s^\star \rangle . \end{aligned}$$

Proof

Applying Lemma 3.2 for \(\text {DYS}(P^{-1}B,P^{-1}A,P^{-1}C)\) using the norm induced by P, we have

$$\begin{aligned} \Vert v^{k+1} - v^\star \Vert _P^2&={}\Vert v^k - v^\star \Vert _P^2 -2\gamma \langle P^{-1}A(z^{k}) - P^{-1}A(z^\star ),z^{k} - z^\star \rangle _P\\&\quad -2\gamma \langle P^{-1}C(z^{k}) - P^{-1}C(z^\star ),z^{k} - z^\star \rangle _P\\&\quad +\gamma ^2\Vert P^{-1}C(z^{k}) - P^{-1}C(z^\star )\Vert _P^2\\&\quad -2\gamma \langle P^{-1}B(u^{k+1}) - P^{-1}B(u^\star ),u^{k+1} - u^\star \rangle _P\\&\quad -\gamma ^2\Vert P^{-1}B(u^{k+1})+P^{-1}A(z^{k}) - \left( P^{-1}B(u^{\star })+P^{-1}A(z^{\star })\right) \Vert _P^2\\&={} \Vert v^k - v^\star \Vert _P^2-2\gamma \langle A(z^{k}) - A(z^\star ),z^{k} - z^\star \rangle -2\gamma \langle C(z^{k}) \\&\quad - C(z^\star ),z^{k} - z^\star \rangle \\&\quad -2\gamma \langle B(u^{k+1}) - B(u^\star ),u^{k+1} - u^\star \rangle \\&\quad +\gamma ^2\Vert P^{-1}C(z^{k}) - P^{-1}C(z^\star )\Vert _P^2\\&\quad -\gamma ^2\Vert P^{-1}B(u^{k+1})+P^{-1}A(z^{k}) - \left( P^{-1}B(u^{\star })+P^{-1}A(z^{\star })\right) \Vert _P^2. \end{aligned}$$

Using \(A(z^{k}) = \big ( L^* y^k ,-L x^k+\partial H^*(y^k)\big ), B(u^{k+1}) = \big (\partial R(s^{k+1}) ,0 \big ), C(z^k) = \big ( g^{k+1},0\big ) \) and \(A(z^{\star }) = \big ( L^* y^\star , -L x^\star +\partial H^*(y^\star )\big ), B(u^\star ) = \big (\partial R(s^\star ) ,0 \big ), C(z^\star ) = \big ( \nabla F(x^\star ),0\big )\), we have,

$$\begin{aligned} \Vert v^{k+1} - v^\star \Vert _P^2&\le \Vert v^k - v^\star \Vert _P^2-2\gamma \langle \partial H^*(y^{k}) - \partial H^*(y^\star ),y^{k} - y^\star \rangle \\&\quad +\gamma ^2\Vert g^{k+1} - \nabla F(x^\star )\Vert ^2\\&\quad -2\gamma \langle g^{k+1} - \nabla F(x^\star ),x^{k} - x^\star \rangle -2\gamma \langle \partial R(s^{k+1}) \\&\quad - \partial R(s^\star ),s^{k+1} - s^\star \rangle . \end{aligned}$$

Applying the conditional expectation w.r.t. \({\mathscr {F}}_k\) and using Assumption 1,

$$\begin{aligned} {\mathbb {E}}_k \Vert v^{k+1} - v^\star \Vert _P^2&\le \Vert v^k - v^\star \Vert _P^2-2\gamma \langle \partial H^*(y^{k}) - \partial H^*(y^\star ),y^{k} - y^\star \rangle \\&\quad -2\gamma \langle \nabla F(x^{k}) - \nabla F(x^\star ),x^{k} - x^\star \rangle \\&\quad +\gamma ^2 \left( 2\alpha D_F(x^{k},x^\star ) + \beta \sigma _k^2\right) \\&\quad -2\gamma {\mathbb {E}}_k\langle \partial R(s^{k+1}) - \partial R(s^\star ),s^{k+1} - s^\star \rangle . \end{aligned}$$

Using the convexity of F,

$$\begin{aligned} {\mathbb {E}}_k \Vert v^{k+1} - v^\star \Vert _P^2&\le \Vert v^k - v^\star \Vert _P^2-2\gamma \langle \partial H^*(y^{k}) - \partial H^*(y^\star ),y^{k} - y^\star \rangle \\&\quad -2\gamma D_F(x^{k},x^\star )\\&\quad -2\gamma {\mathbb {E}}_k\langle \partial R(s^{k+1}) - \partial R(s^\star ),s^{k+1} - s^\star \rangle \\&\quad +\gamma ^2 \left( 2\alpha D_F(x^{k},x^\star ) + \beta \sigma _k^2\right) . \end{aligned}$$

Using Assumption 1,

$$\begin{aligned}&{\mathbb {E}}_k \Vert v^{k+1} - v^\star \Vert _P^2 + \kappa \gamma ^2{\mathbb {E}}_k\sigma _{k+1}^2 \le \Vert v^k - v^\star \Vert _P^2 + \kappa \gamma ^2\left( 1-\rho +\frac{\beta }{\kappa }\right) \sigma _k^2\\&\quad -2\gamma (1-\gamma (\alpha +\kappa \delta )) D_F(x^{k},x^\star )\\&\quad -2\gamma \langle \partial H^*(y^{k}) - \partial H^*(y^\star ),y^{k} - y^\star \rangle \\&\quad -2\gamma {\mathbb {E}}_k\langle \partial R(s^{k+1}) - \partial R(s^\star ),s^{k+1} - s^\star \rangle . \end{aligned}$$

\(\square \)

B Linear Convergence Results

In this section, we provide linear convergence results for the stochastic PD3O and the stochastic PDDY algorithms, in addition to Theorem 6.2. For an operator splitting method like DYS\(({\tilde{A}},{\tilde{B}},{\tilde{C}})\) to converge linearly, it is necessary that \({\tilde{A}}+{\tilde{B}}+{\tilde{C}}\) is strongly monotone. But this is not sufficient, and in general,

to converge linearly, DYS\(({\tilde{A}},{\tilde{B}},{\tilde{C}})\) requires the stronger assumption that \({\tilde{A}}\) or \({\tilde{B}}\) or \({\tilde{C}}\) is strongly monotone, and in addition that \({\tilde{A}}\) or \({\tilde{B}}\) is cocoercive [28]. The PDDY algorithm is equivalent to DYS\((P^{-1}B,P^{-1}A,P^{-1}C)\) and the PD3O algorithm is equivalent to DYS\((P^{-1}A,P^{-1}B,P^{-1}C)\), see Sect. 4. However, \(P^{-1}A\), \(P^{-1}B\) and \(P^{-1}C\) are not strongly monotone. In spite of this, we will prove linear convergence of the (stochastic) PDDY and PD3O algorithms.

Thus, for both algorithms, we will make the assumption that \(P^{-1}A+P^{-1}B+P^{-1}C\) is strongly monotone. This is equivalent to assuming that \(M = A+B+C\) is strongly monotone; that is, that \(F+R\) is strongly convex and H is smooth. For instance, the Chambolle–Pock algorithm [11, 13], which is a particular case of the PD3O and the PDDY algorithms, requires R strongly convex and H smooth to converge linearly, in general. In fact, for primal–dual algorithms to converge linearly on Problem (1), for any L, it seems unavoidable that \(F+R\) is strongly convex and that the dual term \(H^*\) is strongly convex too, because the algorithm needs to be contractive in both the primal and the dual spaces. This means that H must be smooth. We can remark that if H is smooth, it is tempting to use its gradient instead of its proximity operator. We can then use the proximal gradient algorithm to solve Problem (1) with \(\nabla (F+H\circ L)(x)=\nabla F(x) + L^*\nabla H (Lx)\). However, in practice, it is often faster to use the proximity operator instead of the gradient, see a recent analysis of this topic in [18].

For the PD3O algorithm, we will add a cocoercivity assumption, as suggested by the general linear convergence theory of DYS. More precisely, we will assume that R is smooth, so that \(P^{-1}B\) is cocoercive. Our result on the PD3O is therefore an extension of [77][Theorem 3] to the stochastic setting. For the PDDY algorithm, this assumption is not needed to prove linear convergence, which is an advantage over the PD3O algorithm.

We denote by \(\Vert \cdot \Vert _{\gamma ,\tau }\) the norm induced by \(\frac{\gamma }{\tau }I - \gamma ^2 L L^*\) on \({\mathcal {Y}}\).

Theorem B.1

(Linear convergence of the Stochastic PD3O Algorithm) Suppose that Assumption 1 holds. Suppose that H is \(1/\mu _{H^*}\)-smooth, for some \(\mu _{H^*} >0\), F is \(\mu _F\)-strongly convex, for some \(\mu _F\ge 0\), and R is \(\mu _R\)-strongly convex, for some \(\mu _R\ge 0\), with \(\mu :=\mu _F + 2\mu _R >0\). Also, suppose that R is \(\lambda \)-smooth, for some \(\lambda >0\). Suppose that the parameters \(\gamma >0\) and \(\tau >0\) satisfy \(\gamma \le 1/(\alpha +\kappa \delta )\), for some \(\kappa > \beta /\rho \), and \(\gamma \tau \Vert L\Vert ^2 < 1\). Define, for every \(k\in {\mathbb {N}}\),

$$\begin{aligned} V^k :=\Vert p^{k} - p^\star \Vert ^2+ \left( 1+2\tau \mu _{H^*}\right) \Vert y^{k} - y^\star \Vert _{\gamma ,\tau }^2 + \kappa \gamma ^2 \sigma _{k}^2, \end{aligned}$$
(23)

and

$$\begin{aligned} r :=\max \left( 1-\frac{\gamma \mu }{(1+\gamma \lambda )^2},\left( 1-\rho +\frac{\beta }{\kappa }\right) ,\frac{1}{1+2\tau \mu _{H^*}}\right) . \end{aligned}$$
(24)

Then, for every \(k\in {\mathbb {N}}\), \( {\mathbb {E}}V^{k} \le r^k V^0 \).

Proof

We first use Lemma A.1 along with the strong convexity of \(R,H^*\). Note that \(y^{k} = q^k\) and therefore \(q^{k+1} = q^k + d^{k+1} - q^{k} = d^{k+1}\). We have

$$\begin{aligned}&{\mathbb {E}}_k \Vert p^{k+1} - p^\star \Vert ^2 + {\mathbb {E}}_k \Vert q^{k+1} - q^\star \Vert _{\gamma ,\tau }^2 + 2\gamma \mu _{H^*}{\mathbb {E}}_k\Vert q^{k+1} - q^\star \Vert ^2 + \kappa \gamma ^2{\mathbb {E}}_k\sigma _{k+1}^2 \\&\le \Vert p^{k} - p^\star \Vert ^2 + \Vert q^{k} - q^\star \Vert _{\gamma ,\tau }^2 -\gamma \mu \Vert x^{k}-x^\star \Vert ^2 + \kappa \gamma ^2\left( 1-\rho +\frac{\beta }{\kappa }\right) \sigma _k^2\\&\quad -2\gamma (1-\gamma (\alpha +\kappa \delta )) D_F(x^{k},x^\star ). \end{aligned}$$

Noting that for every \(q \in {\mathcal {Y}}\), \(\Vert q\Vert _{\gamma ,\tau }^2 = \frac{\gamma }{\tau }\Vert q\Vert ^2 - \gamma ^2\Vert L^* q\Vert ^2 \le \frac{\gamma }{\tau }\Vert q\Vert ^2\), and taking \(\gamma \le 1/(\alpha +\kappa \delta )\), we have

$$\begin{aligned}&{\mathbb {E}}_k \Vert p^{k+1} - p^\star \Vert ^2+ \left( 1+2\tau \mu _{H^*}\right) {\mathbb {E}}_k\Vert q^{k+1} - q^\star \Vert _{\gamma ,\tau }^2 + \kappa \gamma ^2{\mathbb {E}}_k\sigma _{k+1}^2\\&\le \Vert p^{k} - p^\star \Vert ^2 + \Vert q^{k} - q^\star \Vert _{\gamma ,\tau }^2 -\gamma \mu \Vert x^{k}-x^\star \Vert ^2 + \kappa \gamma ^2\left( 1-\rho +\frac{\beta }{\kappa }\right) \sigma _k^2. \end{aligned}$$

Finally, since R is \(\lambda \)-smooth, \(\Vert p^k - p^\star \Vert ^2 \le (1+2 \gamma \lambda + \gamma ^2 \lambda ^2)\Vert x^{k}-x^\star \Vert ^2\). Indeed, in this case, applying Lemma 3.2 with \({\tilde{A}} =0\), \({\tilde{C}} = 0\) and \({\tilde{B}} = \nabla R\), we obtain that if \(x^k = {{\,\mathrm{prox}\,}}_{\gamma R}(p^k)\) and \(x^\star = {{\,\mathrm{prox}\,}}_{\gamma R}(p^\star )\), then

$$\begin{aligned} \Vert x^k - x^\star \Vert ^2&= \Vert p^k - p^\star \Vert ^2-2\gamma \langle \nabla R(x^k) - \nabla R(x^\star ),x^k - x^\star \rangle \\&\quad - \gamma ^2\Vert \nabla R(x^k) - \nabla R(x^\star )\Vert ^2\\&\ge \Vert p^k - p^\star \Vert ^2-2\gamma \lambda \Vert x^k - x^\star \Vert ^2 - \gamma ^2\lambda ^2\Vert x^{k} - x^\star \Vert ^2. \end{aligned}$$

Hence,

$$\begin{aligned}&{\mathbb {E}}_k \Vert p^{k+1} - p^\star \Vert ^2+ \left( 1+2\tau \mu _{H^*}\right) {\mathbb {E}}_k\Vert q^{k+1} - q^\star \Vert _{\gamma ,\tau }^2 + \kappa \gamma ^2{\mathbb {E}}_k\sigma _{k+1}^2 \\&\quad \le \Vert p^{k} - p^\star \Vert ^2 + \Vert q^{k} - q^\star \Vert _{\gamma ,\tau }^2 -\frac{\gamma \mu }{(1+\gamma \lambda )^2}\Vert p^{k}-p^\star \Vert ^2\\&\quad + \kappa \gamma ^2\left( 1-\rho +\frac{\beta }{\kappa }\right) \sigma _k^2. \end{aligned}$$

Thus, by setting \(V^k\) as in (23) and r as in (24), we have \({\mathbb {E}}_k V^{k+1} \le r V^k \). \(\square \)

Thus, under smoothness and strong convexity assumptions, Theorem B.1 implies linear convergence of the dual variable \(y^k\) to \(y^\star \), with convergence rate given by r. Since \(\Vert x^k - x^\star \Vert \le \Vert p^k - p^\star \Vert \), it also implies linear convergence of the variable \(x^k\) to \(x^\star \), with same rate.

If \(g^{k+1} = \nabla F(x^k)\), the Stochastic PD3O Algorithm reverts to the PD3O Algorithm and Theorem B.1 provides a convergence rate similar to Theorem 3 in [77]. In this case, by taking \(\kappa = 1\), we obtain

$$\begin{aligned} r = \max \left( 1-\gamma \frac{\mu _F + 2\mu _R}{(1+\gamma \lambda )^2},\frac{1}{1+2\tau \mu _{H^*}}\right) , \end{aligned}$$

whereas Theorem 3 in [77] provides the rate

$$\begin{aligned} \max \left( 1-\gamma \frac{2(\mu _F+\mu _R) - \gamma \alpha \mu _F}{(1+\gamma \lambda )^2},\frac{1}{1+2\tau \mu _{H^*}}\right) \end{aligned}$$

(the reader might not recognize the rate given in Theorem 3 of [77] because of some typos in Eqn. 39 of [77]).

Theorem B.2

(Linear convergence of the Stochastic PDDY Algorithm) Suppose that Assumption 1 holds. Also, suppose that H is \(1/\mu _{H^*}\)-smooth and R is \(\mu _R\)-strongly convex, for some \(\mu _R >0\) and \(\mu _{H^*} >0\). Suppose that the parameters \(\gamma >0\) and \(\tau >0\) satisfy \(\gamma \le 1/(\alpha +\kappa \delta )\), for some \(\kappa > \beta /\rho \), \(\gamma \tau \Vert L\Vert ^2 < 1\), and \(\gamma ^2 \le \frac{\mu _{H^*}}{\Vert L\Vert ^2 \mu _R}\). Define \(\eta :=2\left( \mu _{H^*} -\gamma ^2\Vert L\Vert ^2\mu _R\right) \ge 0\) and, for every \(k\in {\mathbb {N}}\),

$$\begin{aligned} V^k :=(1+\gamma \mu _R)\Vert p^{k} - p^\star \Vert ^2 + (1+\tau \eta )\Vert y^{k} - y^\star \Vert _{\gamma ,\tau }^2 + \kappa \gamma ^2\sigma _{k}^2, \end{aligned}$$
(25)

and

$$\begin{aligned} r :=\max \left( \frac{1}{1+\gamma \mu _R},1-\rho +\frac{\beta }{\kappa },\frac{1}{1+\tau \eta }\right) . \end{aligned}$$
(26)

Then, for every \(k\in {\mathbb {N}}\), \({\mathbb {E}}V^{k} \le r^k V^0\).

Proof

We first use Lemma A.2 along with the strong convexity of R and \(H^*\). Note that \(y^{k} = q^{k+1}\). We have

$$\begin{aligned} {\mathbb {E}}_k \Vert v^{k+1} - v^\star \Vert _P^2 + \kappa \gamma ^2{\mathbb {E}}_k\sigma _{k+1}^2&\le \Vert v^k - v^\star \Vert _P^2 + \kappa \gamma ^2\left( 1-\rho +\frac{\beta }{\kappa }\right) \sigma _k^2\\&\quad -2\gamma \mu _{H^*}{\mathbb {E}}_k \Vert q^{k+1} - q^\star \Vert ^2\\&\quad -2\gamma \mu _{R}{\mathbb {E}}_k\Vert s^{k+1} - s^\star \Vert ^2. \end{aligned}$$

Note that \(s^{k+1} = p^{k+1} - \gamma L^* y^k\). Therefore, \(s^{k+1} - s^\star = (p^{k+1} - p^\star ) - \gamma L^* (y^k - y^\star )\). Using Young’s inequality \(-\Vert a+b\Vert ^2 \le -\frac{1}{2}\Vert a\Vert ^2 + \Vert b\Vert ^2\), we have \( -{\mathbb {E}}_k\Vert s^{k+1} - s^\star \Vert ^2 \le -\frac{1}{2}{\mathbb {E}}_k\Vert p^{k+1} - p^\star \Vert ^2 + \gamma ^2\Vert L\Vert ^2 {\mathbb {E}}_k\Vert q^{k+1} - q^\star \Vert ^2 \). Hence, using \(\tau \Vert q\Vert _{\gamma ,\tau }^2 \le \gamma \Vert q\Vert ^2\),

$$\begin{aligned} {\mathbb {E}}_k \Vert v^{k+1} - v^\star \Vert _P^2 + \kappa \gamma ^2{\mathbb {E}}_k\sigma _{k+1}^2 \le {}&\Vert v^k - v^\star \Vert _P^2 + \kappa \gamma ^2\left( 1-\rho +\frac{\beta }{\kappa }\right) \sigma _k^2\\&-2\gamma \left( \mu _{H^*} -\gamma ^2\Vert L\Vert ^2\mu _R\right) {\mathbb {E}}_k \Vert q^{k+1}- q^\star \Vert ^2 \\&- \gamma \mu _{R}{\mathbb {E}}_k\Vert p^{k+1} - p^\star \Vert ^2\\ \le {}&\Vert v^k - v^\star \Vert _P^2 + \kappa \gamma ^2\left( 1-\rho +\frac{\beta }{\kappa }\right) \sigma _k^2\\&-2\tau {\mathbb {E}}_k \Vert q^{k+1}- q^\star \Vert _{\gamma ,\tau }^2\left( \mu _{H^*} -\gamma ^2\Vert L\Vert ^2\mu _R\right) \\&- \gamma \mu _{R}{\mathbb {E}}_k\Vert p^{k+1} - p^\star \Vert ^2. \end{aligned}$$

Set \(\eta :=2\left( \mu _{H^*} -\gamma ^2\Vert L\Vert ^2\mu _R\right) \ge 0\). Then,

$$\begin{aligned}&(1+\gamma \mu _R){\mathbb {E}}_k \Vert p^{k+1} - p^\star \Vert ^2 + (1+\tau \eta ){\mathbb {E}}_k \Vert q^{k+1} - q^\star \Vert _{\gamma ,\tau }^2 + \kappa \gamma ^2{\mathbb {E}}_k\sigma _{k+1}^2\\&\quad \le \Vert v^k - v^\star \Vert _P^2 + \kappa \gamma ^2\left( 1-\rho +\frac{\beta }{\kappa }\right) \sigma _k^2. \end{aligned}$$

Thus, by setting \(V^k\) as in (25) and r as in (26), we have \( {\mathbb {E}}_k V^{k+1} \le r V^k.\) \(\square \)

C PriLiCoSGD and Application to Decentralized Optimization

figure c

In decentralized optimization, a network of computing agents aims at jointly minimizing an objective function by performing local computations and exchanging information along the edges [1, 46, 66, 67]. It is a particular case of linearly constrained optimization, as detailed below.

First, let us set \(W:=L^*L\) and \(c :=L^*b\). Replacing the variable \(y^k\) by the variable \(a^k :=L^*y^k\) in LiCoSGD, we can write the algorithm using W and c instead of L, \(L^*\) and b, with primal variables in \({\mathcal {X}}\) only. This yields the new algorithm PriLiCoSGD, shown above, to minimize F(x) subject to \(Wx=c\). The convergence results for LiCoSGD apply to PriLiCoSGD, with \((a^k)_{k\in {\mathbb {N}}}\) converging to \(a^\star =-\nabla F(x^\star )\).

We can apply PriLiCoSGD to decentralized optimization as follows. Consider a connected undirected graph \(G = (V,E)\), where \(V = \{1,\ldots ,N\}\) is the set of nodes and E the set of edges. Consider a family \((f_i)_{i \in V}\) of \(\mu \)-strongly convex and \(\nu \)-smooth functions \(f_i\), for some \(\mu \ge 0\) and \(\nu >0\). The problem is:

$$\begin{aligned} \min _{x \in {\mathcal {X}}} \,\sum _{i \in V} f_i(x). \end{aligned}$$
(27)

Consider a gossip matrix of the graph G; that is, a \(N \times N\) symmetric positive semidefinite matrix \({\widehat{W}} = ({\widehat{W}}_{i,j})_{i,j \in V}\), such that \(\ker ({\widehat{W}}) = \mathop {\mathrm {span}}\nolimits ([1\ \cdots \ 1]^\mathrm {T})\) and \({\widehat{W}}_{i,j} \ne 0\) if and only if \(i=j\) or \(\{i,j\} \in E\) is an edge of the graph. \({\widehat{W}}\) can be the Laplacian matrix of G, for instance. Set \(W :={\widehat{W}} \otimes I\), where \(\otimes \) is the Kronecker product; then decentralized communication in the network G is modeled by an application of the positive self-adjoint linear operator W on \({\mathcal {X}}^V\). Moreover, \(W(x_1,\ldots ,x_N) = 0\) if and only if \(x_1 = \ldots = x_N\). Therefore, Problem (27) is equivalent to the lifted problem

$$\begin{aligned} \min _{{\tilde{x}} \in {\mathcal {X}}^V} F({\tilde{x}}) \quad \text {such that} \quad W {\tilde{x}} = 0, \end{aligned}$$
(28)

where for every \({\tilde{x}}=(x_1,\ldots ,x_N) \in {\mathcal {X}}^V\), \(F({\tilde{x}}) = \sum _{i=1}^N f_i(x_i)\). Let us apply PriLiCoSGD to Problem (28); we obtain the Decentralized Stochastic Optimization Algorithm (DESTROY). It generates the sequence \(({\tilde{x}}^k)_{k\in {\mathbb {N}}}\), where \({\tilde{x}}^k = (x_1^k,\ldots ,x_N^k) \in {\mathcal {X}}^V\). The update of each \(x_i^k\) consists in evaluating \(g_i^{k+1}\), an estimate of \(\nabla f_i(x_i^k)\) satisfying Assumption 1,

and communication steps involving \(x_j^k\), for every neighbor j of i. For instance, the variance-reduced estimator \(g_i^k\) can be the loopless SVRG estimator seen in Proposition 5.1, when \(f_i\) is itself a sum of functions, or a compressed version of \(\nabla f_i\) [4, 50, 65, 75].

As an application of the convergence results for LiCoSGD, we obtain the following results for DESTROY. Theorem 4.1 becomes:

Theorem C.1

(Convergence of DESTROY, deterministic case \(g_i^{k+1}=\nabla f_i(x_i^k)\)) Suppose that \(\gamma \in (0,2/\nu )\) and that \(\tau \gamma \Vert {\widehat{W}}\Vert < 1\). Then, in DESTROY, each \((x_i^k)_{k\in {\mathbb {N}}}\) converges to the same solution \(x^\star \) to the problem (27) and each \((a_i^k)_{k\in {\mathbb {N}}}\) converges to \(a_i^\star =-\nabla f_i(x^\star )\).

Theorem 6.1 can be applied to the stochastic case, stating \({\mathcal O}(1/k)\) convergence of the Lagrangian gap, by setting \({\mathcal {Y}}={\mathcal {X}}\) and \(L=L^* = W^{1/2}\). Similarly, Theorem 6.2 yields linear convergence of DESTROY in the strongly convex case \(\mu >0\), with \(L^*L\) replaced by W and \(\Vert L\Vert ^2\) replaced by \(\Vert W\Vert =\Vert {\widehat{W}}\Vert \). In particular, in the deterministic case, with \(\gamma =1/\nu \) and \(\tau \gamma =\aleph /\Vert W\Vert \) for some fixed \(\aleph \in (0,1)\), \(\varepsilon \)-accuracy is reached after \({\mathcal O}\Big (\max \big (\frac{\nu }{\mu },\frac{ \Vert W\Vert }{\omega (W)}\big )\log \big (\frac{1}{\varepsilon }\big )\Big )\) iterations. This rate is better or equivalent to the one of recently proposed decentralized algorithms, like EXTRA, DIGing, NIDS, NEXT, Harness, Exact Diffusion, see Table 1 of [76, 49][Theorem 1] and [1]. With a stochastic gradient, the rate of our algorithm is also better than [53][Equation 99].

In follow-up papers, the authors used Nesterov acceleration to propose accelerated versions of DESTROY [46] and PriLiCoSGD [64].

Rights and permissions

Reprints and Permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Salim, A., Condat, L., Mishchenko, K. et al. Dualize, Split, Randomize: Toward Fast Nonsmooth Optimization Algorithms. J Optim Theory Appl 195, 102–130 (2022). https://doi.org/10.1007/s10957-022-02061-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10957-022-02061-8

Navigation