Skip to main content
Log in

Unified Analysis of Stochastic Gradient Methods for Composite Convex and Smooth Optimization

  • Published:
Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Abstract

We present a unified theorem for the convergence analysis of stochastic gradient algorithms for minimizing a smooth and convex loss plus a convex regularizer. We do this by extending the unified analysis of Gorbunov et al. (in: AISTATS, 2020) and dropping the requirement that the loss function be strongly convex. Instead, we rely only on convexity of the loss function. Our unified analysis applies to a host of existing algorithms such as proximal SGD, variance reduced methods, quantization and some coordinate descent-type methods. For the variance reduced methods, we recover the best known convergence rates as special cases. For proximal SGD, the quantization and coordinate-type methods, we uncover new state-of-the-art convergence rates. Our analysis also includes any form of sampling or minibatching. As such, we are able to determine the minibatch size that optimizes the total complexity of variance reduced methods. We showcase this by obtaining a simple formula for the optimal minibatch size of two variance reduced methods (L-SVRG and SAGA). This optimal minibatch size not only improves the theoretical total complexity of the methods but also improves their convergence in practice, as we show in several experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Algorithm 2
Fig. 1

Similar content being viewed by others

References

  1. Alistarh, D., Grubic, D., Li, J., Tomioka, R., Vojnovic, M.: QSGD: communication-efficient SGD via gradient quantization and encoding. Adv. Neural Inf. Process. Syst. 30, 1709–1720 (2017)

    Google Scholar 

  2. Alistarh, D., Hoefler, T., Johansson, M., Konstantinov, N., Khirirat, S., Renggli, C.: The convergence of sparsified gradient methods. Adv. Neural Inf. Process. Syst. 31, 5977–5987 (2018)

    Google Scholar 

  3. Allen-Zhu, Z.: Katyusha: the first direct acceleration of stochastic gradient methods. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC, pp. 1200–1205 (2017)

  4. Allen-Zhu, Z., Hazan, E.: Variance reduction for faster non-convex optimization. In: Proceedings of the 33nd International Conference on Machine Learning (2016)

  5. Atchadé, Y.F., Fort, G., Moulines, E.: On perturbed proximal gradient algorithms. J. Mach. Learn. Res. 18(1), 310–342 (2017)

    MathSciNet  MATH  Google Scholar 

  6. Beck, A.: First-Order Methods in Optimization. MOS-SIAM Series on Optimization, Society for Industrial and Applied Mathematics, Philadelphia (2017)

    Book  MATH  Google Scholar 

  7. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 1–27 (2011)

    Article  Google Scholar 

  8. Defazio, A., Bach, F.R., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. Adv. Neural Inf. Process. Syst. 27, 1646–1654 (2014)

    Google Scholar 

  9. Gazagnadou, N., Gower, R.M., Salmon, J.: Optimal mini-batch and step sizes for SAGA. In: Proceedings of the 36th International Conference on Machine Learning, vol. 97, pp. 2142–2150 (2019)

  10. Ghadimi, S., Lan, G.: Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  11. Gorbunov, E., Hanzely, F., Richtárik, P.: A unified theory of SGD: variance reduction, sampling, quantization and coordinate descent. In: AISTATS (2020)

  12. Gower, R.M., Loizou, N., Qian, X., Sailanbayev, A., Shulgin, E., Richtárik, P.: SGD: general analysis and improved rates. In: Proceedings of the 36th International Conference on Machine Learning, vol. 97, pp. 5200–5209 (2019)

  13. Gower, R.M., Richtárik, P., Bach, F.: Stochastic quasi-gradient methods: variance reduction via Jacobian sketching. Math. Program. 188, 135–192 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  14. Grimmer, B.: Convergence rates for deterministic and stochastic subgradient methods without Lipschitz continuity. SIAM J. Optim. 29(2), 1350–1365 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  15. Gupta, S., Agrawal, A., Gopalakrishnan, K., Narayanan, P.: Deep learning with limited numerical precision. In: Proceedings of the 32nd International Conference on Machine Learning, vol. 37, pp. 1737–1746 (2015)

  16. Hanzely, F., Mishchenko, K., Richtárik, P.: SEGA: variance reduction via gradient sketching. Adv. Neural Inf. Process. Syst. 31, 2086–2097 (2018)

    Google Scholar 

  17. Hofmann, T., Lucchi, A., Lacoste-Julien, S., McWilliams, B.: Variance reduced stochastic gradient descent with neighbors. Adv. Neural Inf. Process. Syst. 28, 2305–2313 (2015)

    Google Scholar 

  18. Horváth, S., Kovalev, D., Mishchenko, K., Stich, S., Richtárik, P.: Stochastic distributed learning with gradient quantization and variance reduction. arXiv:1904.05115 (2019)

  19. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural Inf. Process. Syst. 26, 315–323 (2013)

    Google Scholar 

  20. Khaled, A., Richtárik, P.: Better theory for SGD in the nonconvex world. arXiv:2002.03329 (2020)

  21. Konečný, J., Liu, J., Richtárik, P., Takác, M.: Mini-batch semi-stochastic gradient descent in the proximal setting. J. Sel. Top. Signal Process. 10(2), 242–255 (2016)

    Article  Google Scholar 

  22. Konečný, J., Richtárik, P.: Randomized distributed mean estimation: accuracy vs communication. arXiv:1611.07555 (2016)

  23. Kovalev, D., Horváth, S., Richtárik, P.: Don’t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop. In: Proceedings of the 31st International Conference on Algorithmic Learning Theory, vol. 117, pp. 451–467 (2020)

  24. Lei, Y., Ting, H., Li, G., Tang, K.: Stochastic gradient descent for nonconvex learning without bounded gradient assumptions. IEEE Trans. Neural Netw. Learn. Syst. 31(10), 4394–4400 (2020)

    Article  MathSciNet  Google Scholar 

  25. Loizou, N., Vaswani, S., Laradji, I.H., Lacoste-Julien, S.: Stochastic polyak step-size for SGD: an adaptive learning rate for fast convergence. In: Banerjee, A., Fukumizu, K. (eds.) Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pp. 1306–1314. PMLR (2021)

  26. Mishchenko, K., Gorbunov, E., Takáč, M., Richtárik, P.: Distributed learning with compressed gradient differences. arXiv:1901.09269 (2019)

  27. Mishchenko, K., Hanzely, F., Richtárik, P.: 99% of worker-master communication in distributed optimization is not needed. In: Adams, R.P., Gogate, V. (eds.) Proceedings of the Thirty-Sixth Conference on Uncertainty in Artificial Intelligence, UAI 2020, virtual online, August 3–6, 2020, volume 124 of Proceedings of Machine Learning Research, pp. 979–988. AUAI Press (2020)

  28. Needell, D., Srebro, N., Ward, R.: Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Math. Program. Ser. A 155(1), 549–573 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  29. Nemirovski, A., Juditsky, A.B., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  30. Nesterov, Y.E.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  31. Nguyen, L., Nguyen, P.H., van Dijk, M., Richtárik, P., Scheinberg, K., Takáč, M.: SGD and Hogwild! Convergence without the bounded gradients assumption. In: Proceedings of the 35th International Conference on Machine Learning, vol. 80, pp. 3750–3758 (2018)

  32. Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: SARAH: a novel method for machine learning problems using stochastic recursive gradient. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 2613–2621 (2017)

  33. Ravikumar, P., Wainwright, M.J., Lafferty, J.D.: High-dimensional Ising model selection using \(\ell _1\)-regularized logistic regression. Ann. Stat. 38(3), 1287–1319 (2010)

    Article  MATH  Google Scholar 

  34. Reddi, S.J., Hefny, A., Sra, S., Póczos, B., Smola, A.J.: Stochastic variance reduction for nonconvex optimization. In: Proceedings of the 33nd International Conference on Machine Learning, vol. 48, pp. 314–323 (2016)

  35. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)

    Article  MathSciNet  MATH  Google Scholar 

  36. Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  37. Sebbouh, O., Gazagnadou, N., Jelassi, S., Bach, F., Gower, R.M.: Towards closing the gap between the theory and practice of SVRG. Adv. Neural Inf. Process. Syst. 32, 646–656 (2019)

    Google Scholar 

  38. Seide, F., Fu, H., Droppo, J., Li, G., Yu, D.: 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNS. In: INTERSPEECH, 15th Annual Conference of the International Speech Communication Association, pp. 1058–1062 (2014)

  39. Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, Cambridge (2014)

    Book  MATH  Google Scholar 

  40. Stich, S.U.: Unified optimal analysis of the (stochastic) gradient method. arXiv:1907.04232 (2019)

  41. Tibshirani, R.J.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996)

    MathSciNet  MATH  Google Scholar 

  42. Vaswani, S., Bach, F., Schmidt, M.: Fast and faster convergence of SGD for over-parameterized models and an accelerated perceptron. In: The 22nd International Conference on Artificial Intelligence and Statistics, vol. 89, pp. 1195–1204 (2019)

  43. Wangni, J., Wang, J., Liu, J., Zhang, T.: Gradient sparsification for communication-efficient distributed optimization. Adv. Neural Inf. Process. Syst. 31, 1306–1316 (2018)

    Google Scholar 

  44. Wright, S.J.: Coordinate descent algorithms. Math. Program. 151(1), 3–34 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  45. Zhang, H., Li, J., Kara, K., Alistarh, D., Liu, J., Zhang, C.: Zipml: training linear models with end-to-end low precision, and a little bit of deep learning. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 4035–4043 (2017)

  46. Zhao, P., Zhang, T.: Stochastic optimization with importance sampling for regularized loss minimization. In: Proceedings of the 32nd International Conference on Machine Learning, vol. 37, pp. 1–9 (2015)

  47. Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67(2), 301–320 (2005)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ahmed Khaled.

Additional information

Communicated by Alexander Vladimirovich Gasnikov.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Outline of the Appendix

The appendix is organized as follows:

  • Section A: we present the arbitrary sampling framework for stochastic gradient methods introduced by [13], which will be used for the analysis of SGD and L-SVRG.

  • Section B: we present specializations of Theorem 3.1 to the algorithms we discuss: SGD, DIANA, L-SVRG, SAGA and SEGA.

  • Section C: we present the proof of our main Theorem 3.1.

  • Section D: we present the proof of Corollary 4.2.

  • Section E, we present the proof of Proposition 5.1, and the detailed analysis of the optimal minibatch results for b-SAGA and b-L-SVRG, in addition to an analysis for the optimal miniblock size for b-SEGA.

  • Section F: we present some technical lemmas which we use in our analysis.

Appendix A: Arbitrary Sampling

In this section, we recall the arbitrary sampling framework [12] which allows us to analyze our algorithms for minibatching, importance sampling and virtually all possible forms of sampling.

1.1 Appendix A.1: Stochastic reformulation

To see importance sampling and minibatch variants of stochastic gradient methods all through the same lens, we introduce a sampling vector which we will use to re-write (1).

Definition A.1

We say that a random element-wise positive vector \(v \in \mathbb {R}^n_+\) drawn from some distribution \(\mathcal {D}\) is a sampling vector if its expectation is the vector of all ones:

$$\begin{aligned} \mathbb {E}_{\mathcal {D}} \left[ v_i\right] = 1, \text{ for } \text{ all } i \in [n]. \end{aligned}$$
(22)

For a given distribution \(\mathcal {D}\), we introduce a stochastic reformulation of (1) as follows

$$\begin{aligned} \min _{x \in \mathbb {R}^d} \ \left\{ \mathbb {E}_{\mathcal {D}} \left[ f_v (x) \overset{\text {def}}{=}\frac{1}{n} \sum _{i=1}^{n} v_i f_i (x)\right] + R(x) \right\} . \end{aligned}$$
(23)

By definition of the sampling vector, \(f_v (x)\) and \(\nabla f_v (x)\) are unbiased estimators of f(x) and \(\nabla f(x)\), respectively, and hence problem (23) is indeed equivalent (i.e., a reformulation) of the original problem (1). In the case of the gradient, for instance, we get

$$\begin{aligned} \mathbb {E}_{\mathcal {D}} \left[ \nabla f_v (x)\right] \overset{(23)}{=} \frac{1}{n} \sum _{i=1}^{n} \mathbb {E}_{\mathcal {D}} \left[ v_i\right] \nabla f_i (x) \overset{(22)}{=} \nabla f(x). \end{aligned}$$

Reformulation (23) can be solved using proximal stochastic gradient descent via

$$\begin{aligned} x_{k+1} = \textrm{prox}_{\gamma _k R} \left( x_k - \gamma \nabla f_{v_k} (x_k) \right) , \end{aligned}$$
(24)

where \(v_k \sim \mathcal {D}\) is sampled i.i.d. at each iteration and \(\gamma > 0\) is a stepsize. By substituting specific choices of \(\mathcal {D}\), we obtain specific variants of SGD for solving (1). We further show that (24) is a special case of (2) with a sequence of vectors \(g_k = \nabla f_{v_k} (x_k)\) and use the unified analysis in Theorem 3.1 to obtain convergence rates for (24).

1.2 Appendix A.2: Expected Smoothness and Gradient Noise

In order to analyze (24), we will make use of the following result, which characterizes the smoothness of the subsampled functions \(f_v\).

Lemma A.1

(Expected Smoothness) If for all \(i \in [n], f_i\) is convex and \(L_i-\)smooth, then there exists a constant \(\mathcal {L}\ge 0\) such that

$$\begin{aligned} \mathbb {E}_{\mathcal {D}} \left[ \left\Vert \nabla f_v (x) - \nabla f_v (x_*)\right\Vert ^2\right] \le 2 \mathcal {L}\, D_f(x,x_*), \end{aligned}$$
(25)

for all \(x \in \mathbb {R}^d\) and where \(x_*\) is any minimizer of (1).

The proof of this result follows closely that of Lemma 1 in [9].

Proof

Since for all \(i \in [n]\), \(f_i\) is \(L_i\)-smooth and convex, we have that each realization \(f_v\) (defined in (23)) is \(L_v\)-smooth and convex. Thus, from Lemma F.1, we have that for all \(x \in \mathbb {R}^d\),

$$\begin{aligned} {\left\Vert \nabla f_v(x) - f_v(x_*)\right\Vert }^2\le & {} 2L_v \left( f_v(x) - f_v(x_*) - \left\langle \nabla f_v(x_*), x - x_* \right\rangle \right) \\= & {} \frac{2}{n}\sum _{i=1}^n L_vv_i \left( f_i(x) - f_i(x_*) - \left\langle \nabla f_i(x_*), x - x_* \right\rangle \right) . \end{aligned}$$

Taking expectation over the samplings,

$$\begin{aligned} \mathbb {E}_{\mathcal {D}} \left[ {\left\Vert \nabla f_v(x) - f_v(x_*)\right\Vert }^2\right]\le & {} \frac{2}{n}\sum _{i=1}^n \mathbb {E}_{\mathcal {D}} \left[ v_iL_v\right] \left( f_i(x) - f_i(x_*) - \left\langle \nabla f_i(x_*), x - x_* \right\rangle \right) \\\le & {} 2\max _{j=1,\dots ,n}\mathbb {E}_{\mathcal {D}} \left[ L_vv_j\right] \left( f(x) - f(x_*) - \left\langle \nabla f(x_*), x - x_* \right\rangle \right) \\= & {} 2 \max _{j=1,\dots ,n}\mathbb {E}_{\mathcal {D}} \left[ L_vv_j\right] D_f(x, x_*). \end{aligned}$$

\(\square \)

Next, we define the gradient noise.

Definition A.2

(Gradient Noise) The gradient noise \(\sigma ^2 = \sigma ^2(f, \mathcal {D})\) is defined by

$$\begin{aligned} \sigma ^2 \overset{\text {def}}{=}\mathbb {E}_{\mathcal {D}} \left[ \left\Vert \nabla f_v (x_*)-\nabla f(x_*) \right\Vert ^2\right] . \end{aligned}$$
(26)

1.3 Appendix A.3: Minibatching Elements Without Replacement

Since analyzing minibatching for variance reduced methods is one of the main focuses of our work, we present minibatching without replacement as an example of the use of arbitrary sampling.

First, we define samplings.

Definition A.3

(Sampling) A sampling \(S\subseteq [n]\) is any random set-valued map which is uniquely defined by the probabilities \(\sum _{B \subseteq [n]} p_B =1\) where \(p_{B} \;\overset{\text {def}}{=}\; \mathbb {P}(S=B), \quad \forall B \subseteq [n].\) A sampling S is called proper if for every \(i \in [n]\), we have that \(p_i \overset{\text {def}}{=}\mathbb {P}(i \in S) = \underset{C:i\in C}{\sum }p_C > 0\).

We can build a sampling vector using a sampling as follows.

Lemma A.2

(Sampling vector, Lemma 3.3 in [12]) Let S be a proper sampling. Let \(p_i \overset{\text {def}}{=}\mathbb {P}(i \in S)\) and \(\textbf{P} \overset{\text {def}}{=}\textrm{diag }\left( p_1,\dots ,p_n\right) \). Let \(v = v(S)\) be a random vector defined by

$$\begin{aligned} v(S) \;= \; \textbf{P}^{-1}\sum _{i \in S}e_i \;\overset{\text {def}}{=}\; \textbf{P}^{-1}e_S. \end{aligned}$$
(27)

It follows that v is a sampling vector.

Proof

The ith coordinate of v(S) is \(v_i(S) = \mathbb {1}(i \in S) / p_i\) and thus

$$\begin{aligned} \mathbb {E}_{} \left[ v_i(S)\right] \; =\; \frac{\mathbb {E}_{} \left[ \mathbb {1}(i \in S)\right] }{ p_i} \;=\; \frac{\mathbb {P}(i \in S)}{p_i} \;= \;1. \end{aligned}$$

\(\square \)

Next, we define b-nice sampling, also known as minibatching without replacement.

Definition A.4

(b-nice sampling) S is a b-nice sampling if it is a sampling such that

$$\begin{aligned} \mathbb {P}(S = B) = \frac{1}{\left( {\begin{array}{c}n\\ b\end{array}}\right) }, \quad \forall B \subseteq [n],\; \text { with } \;|B| = b. \end{aligned}$$

To construct such a sampling vector based on the b–nice sampling, note that \(p_i = \tfrac{b}{n}\) for all \(i \in [n]\) and thus we have that \(v(S) = \tfrac{n}{b}\sum _{i\in S}e_i\) according to Lemma A.2. The resulting subsampled function is then \(f_v(x) = \tfrac{1}{|S|}\sum _{i\in S}f_i(x)\), which is simply the minibatch average over S.

A remarkable result for b-nice sampling is that when all the functions \(f_i, i\in [n]\) are \(L_i\)-smooth and convex, then the expected smoothness constant (25) nicely interpolates between L, the smoothness constant of f, and \(L_{\max } = \underset{i\in [n]}{\max }\,L_i\).

Lemma A.3

(\(\mathcal {L}\) for \(b-\)nice sampling, Proposition 3.8 in [12]) Let v be a sampling vector based on the b-nice sampling defined in A.4. If for all \(i \in [n], f_i\) is convex and \(L_i\)-smooth, then (25) holds with

$$\begin{aligned} \mathcal {L}(b) = \frac{1}{b}\frac{n-b}{n-1}L_{\max } + \frac{n}{b}\frac{b-1}{n-1}L, \end{aligned}$$

where L is the smoothness constant of f and \(L_{\max } = \underset{i\in [n]}{\max }\ \, L_i\).

Appendix B: Notable Corollaries of Theorem 3.1

In this section, we present corollaries of Theorem 3.1 for five algorithms:

  • SGD with arbitrary sampling (Algorithm 3).

  • DIANA (Algorithm 4).

  • L-SVRG with arbitrary sampling (Algorithm 5), and minibatch L-SVRG as a special case (Algorithm 2).

  • Minibatch SAGA (Algorithm 1).

  • Miniblock SEGA (Algorithm 6).

This means that for each method, we will present the constants which satisfy Assumption 2 and specialize Theorem 3.1 using these constants.

1.1 Appendix B.1: SGD with Arbitrary Sampling

Algorithm 3
figure c

SGD-AS

Lemma B.1

The iterates of Algorithm 3 satisfy Assumption 2 with

$$\begin{aligned} \sigma _k^2 = 0 \end{aligned}$$

and constants:

$$\begin{aligned} A = 2\mathcal {L}, \; B = 0, \; \rho = 1, \; C = 0, \; D_1 = 2\sigma ^2, \; D_2 = 0, \end{aligned}$$

where \(\mathcal {L}\) is defined in (25) and \(\sigma ^2\) in (26).

Proof

See Lemma A.2 in [11]. \(\square \)

Using the constants given in the above lemma, we have the following immediate corollary of Theorem 3.1.

Corollary B.1

Assume that f has a finite sum structure (4) and that Assumption 1 holds. Let \((\gamma _k)_{k\ge 0}\) be a decreasing, strictly positive sequence of step sizes chosen such that

$$\begin{aligned} 0< \gamma _0 < \min \left\{ \frac{1}{4\mathcal {L}}, \frac{1}{L} \right\} . \end{aligned}$$

Then, from Theorem 3.1 and Lemma B.1, we have that the iterates given by Algorithm 3 verify

$$\begin{aligned} \mathbb {E}_{} \left[ F(\bar{x}_t) - F(x_*)\right] \le \frac{{\left\Vert x_0 - x_*\right\Vert }^2 + 2\gamma _0\left( F(x_0) - F(x_*) \right) + 4\sigma ^2\sum _{k=0}^{t-1}\gamma _k^2}{2\sum _{i=0}^{t-1}\left( 1 - 4\gamma _i\mathcal {L} \right) \gamma _i}, \end{aligned}$$

where \(\bar{x}_t \overset{\text {def}}{=}\sum \nolimits _{k=0}^{t-1} \frac{\left( 1 - 4\gamma _k\mathcal {L} \right) \gamma _k}{\sum _{i=0}^{t-1}\left( 1 - 4\gamma _i\mathcal {L} \right) \gamma _i}x_k\).

1.2 Appendix B.2: DIANA

A complete description of the DIANA algorithm can be found in [26].

To analyze the DIANA algorithm (Algorithm 4), we introduce quantization operators.

Definition B.1

(w-quantization operator, Definition 4 in[26]) Let \(w > 0\). A random operator \(Q: \mathbb {R}^d \rightarrow \mathbb {R}\) with the properties:

$$\begin{aligned} \mathbb {E}_{} \left[ Q(x)\right] = x, \quad \mathbb {E}_{} \left[ {\left\Vert Q(x)\right\Vert }^2\right] \le (1+w){\left\Vert x\right\Vert }^2, \end{aligned}$$
(28)

for all \(x \in \mathbb {R}^d\) is called a w-quantization operator.

Several examples of quantization operators can be found in [26].

Algorithm 4
figure d

DIANA

For convenience, we repeat the statement of Lemma 4.2 below.

Lemma B.2

Assume that f has a finite sum structure and that Assumption 1 holds. The iterates of DIANA (Algorithm 4) satisfy Assumption 2 with constants:

$$\begin{aligned} A = \left( 1+\frac{2w}{n} \right) L_{\max }, \; B = \frac{2w}{n}, \; \rho = \alpha , \; C = L_{\max }\alpha , \; D_1 = \frac{(1+w)\sigma ^2}{n}, \; D_2 = \alpha \sigma ^2, \end{aligned}$$

where \(w > 0\) and \(\alpha \le \frac{1}{1+w}\) are parameters of Algorithm 4 and \(\sigma ^2\) is such that

$$\forall k \in \mathbb {N}, \quad \frac{1}{n}\sum _{i=1}^n\mathbb {E}_{} \left[ {\left\Vert g_i^k - \nabla f(x_k)\right\Vert }^2\right] \le \sigma ^2.$$

Proof

See Lemma A.12 in [11]. \(\square \)

Now using the constants given in the above lemma in Theorem 3.1 gives the following corollary.

Corollary B.2

Assume that f has a finite sum structure (4) and that Assumption 1 holds. Let \((\gamma _k)_{k\ge 0}\) be a decreasing, strictly positive sequence of step sizes chosen such that

$$\begin{aligned} 0< \gamma _0 < \frac{1}{2 (1 + \frac{4w}{n})L_{\max }}. \end{aligned}$$

Then, from Theorem 3.1 and Lemma B.2, we have that the iterates given by Algorithm 4 verify

$$\begin{aligned}&\mathbb {E}_{} \left[ F(\bar{x}_t) - F(x_*)\right] \\&\le \quad \frac{{\left\Vert x_0 - x_*\right\Vert }^2 + 2\gamma _0\left( F(x_0) - F(x_*)+ \frac{2w\gamma _0}{\alpha n}\sigma _0^2 \right) + \frac{2\left( 1+5w \right) \sigma ^2}{n} \sum _{k=0}^{t-1}\gamma _k^2}{2\sum _{i=0}^{t-1}\left( 1 - \gamma _i\eta \right) \gamma _i}, \end{aligned}$$

where \(\eta \overset{\text {def}}{=}2(1+\frac{4w}{n})L_{\max }\), \(\bar{x}_t \overset{\text {def}}{=}\sum \nolimits _{k=0}^{t-1} \frac{\left( 1 - \gamma _k\eta \right) \gamma _k}{\sum _{i=0}^{t-1}\left( 1 - \gamma _i\eta \right) \gamma _i}x_k\) and \({\delta _0 \overset{\text {def}}{=}F(x_0) - F(x_*)}\).

1.3 Appendix B.3: L-SVRG with Arbitrary Sampling

Algorithm 5
figure e

L-SVRG-AS

Lemma B.3

If Assumption 1 holds then the iterates of Algorithm 5 satisfy

$$\begin{aligned} \mathbb {E}_{k} \left[ {\left\Vert g_k - \nabla f(x_*)\right\Vert }^2\right]\le & {} 4\mathcal {L}D_f(x_k, x_*) + 2\sigma _k^2 \\ \mathbb {E}_{k} \left[ \sigma _{k+1}^2\right]\le & {} (1-p)\sigma _k^2 + 2p\mathcal {L}D_f(x_k, x_*), \end{aligned}$$

where

$$\begin{aligned} \sigma _k^2 = \mathbb {E}_{\mathcal {D}} \left[ {\left\Vert \nabla f_{v_k}(w_k) - \nabla f_{v_{k}}(x_*) - \left( \nabla f(w_k) - \nabla f(x_*) \right) \right\Vert }^2\right] \end{aligned}$$

and \(\mathcal {L}\) is defined in (25).

Proof

By Lemma A.1 we have that (25) holds with \(\mathcal {L}>0.\) Furthermore

$$\begin{aligned} \mathbb {E}_{k} \left[ {\left\Vert g_k\right\Vert }^2\right]= & {} \mathbb {E}_{k} \left[ {\left\Vert \nabla f_{v_k}(x_k) - \nabla f_{v_k}(w_k) + \nabla f(w_k) - \nabla f(x_*)\right\Vert }^2\right] \\\le & {} 2 \mathbb {E}_{k} \left[ {\left\Vert \nabla f_{v_k}(x_k) - \nabla f_{v_k}(x_*)\right\Vert }^2\right] \\{} & {} +\, 2 \mathbb {E}_{k} \left[ {\left\Vert \nabla f_{v_k}(w_k) - \nabla f_{v_{k}}(x_*) - \left( \nabla f(w_k) - \nabla f(x_*) \right) \right\Vert }^2\right] , \end{aligned}$$

where we used in the inequality that for all \(a, b \in \mathbb {R}^d, {\left\Vert a + b\right\Vert }^2 \le 2{\left\Vert a\right\Vert }^2 + 2{\left\Vert b\right\Vert }^2\). Thus,

$$\begin{aligned} \mathbb {E}_{k} \left[ {\left\Vert g_k\right\Vert }^2\right] \overset{(25)}{\le } 4\mathcal {L}D_f\left( x_k, x_* \right) + 2 \sigma _k^2. \end{aligned}$$

Moreover,

$$\begin{aligned} \mathbb {E}_{k} \left[ \sigma _{k+1}\right]= & {} (1-p)\sigma _k^2 + p\mathbb {E}_{k} \left[ {\left\Vert \nabla f_{v_k}(x_k) - \nabla f_{v_{k}}(x_*) - \left( \nabla f(x_k) - \nabla f(x_*) \right) \right\Vert }^2\right] \\\overset{(25)}{\le } & {} (1-p)\sigma _k^2 + 2p\mathcal {L}D_f\left( x_k, x_* \right) , \end{aligned}$$

where we also used in the last inequality that \(\mathbb {E}_{} \left[ \left\Vert X - \mathbb {E}_{} \left[ X\right] \right\Vert ^2\right] = \mathbb {E}_{} \left[ \left\Vert X\right\Vert ^2\right] - \left\Vert \mathbb {E}_{} \left[ X\right] \right\Vert ^2 \le \mathbb {E}_{} \left[ \left\Vert X\right\Vert ^2\right] \). \(\square \)

We have the following immediate consequence of the previous lemma.

Lemma B.4

If Assumption 1 holds then the iterates of Algorithm 5 satisfy Assumption 2 with

$$\begin{aligned} \sigma _k^2 = \mathbb {E}_{\mathcal {D}} \left[ {\left\Vert \nabla f_v(x_k) - \nabla f_v(w_k) + \nabla f(w_k)\right\Vert }^2\right] \end{aligned}$$

and constants

$$\begin{aligned} A = 2\mathcal {L}, \; B = 2, \; \rho = p, \; C = p\mathcal {L}, \; D_1 = D_2 = 0, \end{aligned}$$

where \(\mathcal {L}\) is defined in (25).

Using the constant derived in Lemma B.4 in Theorem 3.1 gives the following corollary.

Corollary B.3

Assume that f has a finite sum structure (4) and that Assumption 1 holds. Let \(\gamma _k = \gamma \) for all \(k \in \mathbb {N}\), where

$$\begin{aligned} 0< \gamma < \min \left\{ \frac{1}{8\mathcal {L}}, \frac{1}{L} \right\} . \end{aligned}$$

Then, from Theorem 3.1 and Lemma B.4, we have that the iterates given by Algorithm 5 verify

$$\begin{aligned} \mathbb {E}_{} \left[ F(\bar{x}_t) - F(x_*)\right] \le \frac{{\left\Vert x_0 - x_*\right\Vert }^2 + 2\gamma \left( F(x_0) - F(x_*) + \frac{2\gamma }{p}\sigma _0^2 \right) }{2\gamma \left( 1 - 8\gamma \mathcal {L} \right) t}, \end{aligned}$$

where \(\bar{x}_t \overset{\text {def}}{=}\frac{1}{t}\sum \nolimits _{k=0}^{t-1} x_k\) and where \(\mathcal {L}\) is defined in (25).

1.3.1 Appendix B.3.1: b-L-SVRG

As we demonstrated in Section A.3, we can specialize the results derived for arbitrary sampling to minibatching without replacement by using a \(b-\)nice sampling defined in Definition A.4 and the corresponding sampling vector (27).

Indeed, using Algorithm 5 with b-nice sampling is equivalent to using Algorithm 2. Thus, we have the following lemma.

Corollary B.4

From Lemma B.4, we have that the iterates of Algorithm 2 satisfy Assumption 2 with constants:

$$\begin{aligned} A = 2\mathcal {L}(b), \; B = 2, \; \rho = p, \; C = p\mathcal {L}(b), \; D_1 = D_2 = 0, \end{aligned}$$

where \(\mathcal {L}(b)\) is defined in (17).

A convergence result for Algorithm 2 can be easily concluded from Corollary B.3, with \(\mathcal {L}(b)\) in place of \(\mathcal {L}\).

1.4 Appendix B.4: b-SAGA

Lemma 5.1 in the main text is a consequence of the following lemma.

Lemma B.5

Consider the iterates of Algorithm 1. We have:

$$\begin{aligned} \mathbb {E}_{k} \left[ {\left\Vert g_k\right\Vert }^2\right]\le & {} 4\mathcal {L}(b)\left( f(x_k) - f(x_*) \right) + 2\sigma _k^2 \end{aligned}$$
(29)
$$\begin{aligned} \mathbb {E}_{k} \left[ \sigma _{k+1}^2\right]\le & {} \left( 1-\frac{b}{n} \right) \sigma _k^2 + 2\frac{b\zeta (b)}{n} \left( f(x_k) - f(x_*) \right) , \end{aligned}$$
(30)

where:

$$\begin{aligned} \sigma _k^2 = \frac{1}{nb} \frac{n-b}{n-1}\left\Vert J_k - \nabla H(x_*)\right\Vert _{\textrm{Tr}}^2 \quad \text{ and } \quad \zeta (b) \overset{\text {def}}{=}\frac{1}{b}\frac{n-b}{n-1}L_{\max }, \end{aligned}$$

with \(\left\Vert Z\right\Vert _{\textrm{Tr}}^2 = \textrm{tr}(Z^\top Z)\) for any \(Z \in \mathbb {R}^{d \times n}\).

Proof

The inequality (29) corresponds to Lemma 3.10 and (30) to Lemma 3.9 in [13]. \(\square \)

The previous Lemma gives us the constants for Assumption 2 for Algorithm 1.

Lemma B.6

The iterates of Algorithm 1 satisfy Assumption 2 with

$$\begin{aligned} \sigma _k^2 = \frac{1}{nb} \frac{n-b}{n-1}\left\Vert J_k - \nabla H(x_*)\right\Vert _{\textrm{Tr}}^2 \end{aligned}$$

and constants

$$\begin{aligned} A = 2\mathcal {L}(b), \; B = 2, \; \rho = \frac{b}{n}, \; C = \frac{b\zeta (b)}{n}, \; D_1 = D_2 = 0. \end{aligned}$$

Using the constant derived in Lemma B.6 in Theorem 3.1 gives the following corollary.

Corollary B.5

Assume that f has a finite sum structure (4) and that Assumption 1 holds. Choose for all \(k \in \mathbb {N}\) \(\gamma _k = \gamma \), where

$$\begin{aligned} 0< \gamma < \frac{1}{2 (2\mathcal {L}(b) + \zeta (b))}. \end{aligned}$$

Then, from Theorem 3.1 and Lemma B.6, we have that the iterates given by Algorithm 1 verify

$$\begin{aligned} \mathbb {E}_{} \left[ F(\bar{x}_t) - F(x_*)\right] \le \frac{{\left\Vert x_0 - x_*\right\Vert }^2 + 2\gamma \left( F(x_0) - F(x_*) + \frac{2n\gamma }{b} \sigma _0^2 \right) }{2\gamma \left( 1 - 2\gamma \left( 2\mathcal {L}(b)+2\zeta (b) \right) \right) t}, \end{aligned}$$

where \(\bar{x}_t \overset{\text {def}}{=}\frac{1}{t}\sum \limits _{k=0}^{t-1} x_k\).

1.5 Appendix B.5: b-SEGA

Lemma B.7

Consider the iterates of Algorithm 6. We have:

$$\begin{aligned} \mathbb {E}_{k} \left[ {\left\Vert g_k\right\Vert }^2\right]\le & {} \frac{4dL}{b}D_f\left( x_k, x_* \right) + 2\left( \frac{d}{b} - 1 \right) \sigma _k^2 \\ \mathbb {E}_{k} \left[ \sigma _{k+1}^2\right]\le & {} \left( 1-\frac{b}{d} \right) \sigma _k^2 + \frac{2bL}{d}D_f\left( x_k, x_* \right) , \end{aligned}$$

where:

$$\begin{aligned} \sigma _k^2 = {\left\Vert h_k - \nabla f(x_*)\right\Vert }^2. \end{aligned}$$

Proof

Let S be a random miniblock s.t. \(\mathbb {P}(S = B) = \frac{1}{\left( {\begin{array}{c}n\\ b\end{array}}\right) }\) for any \(B \subseteq [n]\) s.t. \(|B| = b\). Then, for any vector \(a = [a_1,\dots ,a_n] \in \mathbb {R}^d\), we have:

$$\begin{aligned} \mathbb {E}_{} \left[ {\left\Vert I_Sa\right\Vert }^2\right] = \frac{b}{d}{\left\Vert a\right\Vert }^2 \quad \text{ and } \quad \mathbb {E}_{} \left[ {\left\Vert (I-\frac{d}{b}I_S)a\right\Vert }^2\right] = \left( \frac{d}{b} - 1 \right) {\left\Vert a\right\Vert }^2. \end{aligned}$$
(31)

Indeed,

$$\begin{aligned} \mathbb {E}_{} \left[ {\left\Vert I_S a\right\Vert }^2\right]= & {} \mathbb {E}_{} \left[ \sum _{i \in S} a_i^2\right] = \sum _{B \subseteq [d], |B| = b}\mathbb {P}(S = B)\sum _{i \in B} a_i^2 = \frac{1}{\left( {\begin{array}{c}d\\ b\end{array}}\right) }\sum _{B \subseteq [d], |B| = b}\sum _{i = 1}^d a_i^2 \mathbb {1}_B(i)\\= & {} \frac{1}{\left( {\begin{array}{c}d\\ b\end{array}}\right) }\sum _{i = 1}^d a_i^2 \sum _{B \subseteq [d], |B| = b} \mathbb {1}_B(i) = \frac{\left( {\begin{array}{c}d-1\\ b-1\end{array}}\right) }{\left( {\begin{array}{c}d\\ b\end{array}}\right) }\sum _{i = 1}^d a_i^2 = \frac{b}{d}{\left\Vert a\right\Vert }^2, \end{aligned}$$

where we used that \(|B \in [d]: |B|=b \wedge i \in B| = \left( {\begin{array}{c}d-1\\ b-1\end{array}}\right) \). And

$$\begin{aligned} {\left\Vert (I-\frac{d}{b}I_S)a\right\Vert }^2= & {} \sum _{i\in S}\left( 1-\frac{d}{b} \right) ^2 a_i^2 + \sum _{i \notin S} a_i^2 = \frac{d^2 - 2bd}{b^2}\sum _{i \in S}a_i^2 + {\left\Vert a\right\Vert }^2 \\= & {} \frac{d^2 - 2bd}{b^2} {\left\Vert I_{S}a\right\Vert }^2 + {\left\Vert a\right\Vert }^2. \end{aligned}$$

Thus,

$$\begin{aligned} \mathbb {E}_{} \left[ \left\Vert (I-\frac{d}{b}I_S)a\right\Vert ^2\right] = \left( \frac{d^2 - 2bd}{b^2}\frac{b}{d} + 1 \right) {\left\Vert a\right\Vert }^2 = \left( \frac{d}{b} - 1 \right) {\left\Vert a\right\Vert }^2. \end{aligned}$$

We have

$$\begin{aligned}{} & {} \mathbb {E}_{k} \left[ {\left\Vert g_k - \nabla f(x_*)\right\Vert }^2\right] \\{} & {} \quad = \mathbb {E}_{k} \left[ {\left\Vert \frac{d}{b}I_{B_k}(\nabla f(x_k)- \nabla f(x_*)) + \left( I - \frac{d}{b}I_{B_k} \right) (h_k- \nabla f(x_*))\right\Vert }^2\right] \\{} & {} \quad \le \frac{2d^2}{b^2} \mathbb {E}_{k} \left[ {\left\Vert I_{B_k}(\nabla f(x_k)- \nabla f(x_*))\right\Vert }^2\right] \\{} & {} \qquad + 2 \ \mathbb {E}_{k} \left[ {\left\Vert \left( I - \frac{d}{b}I_{B_k} \right) (h_k- \nabla f(x_*))\right\Vert }^2\right] \\{} & {} \quad \overset{(31)}{=} \frac{2d}{b}{\left\Vert \nabla f(x_k) - \nabla f(x_*)\right\Vert }^2 + 2\left( \frac{d}{b} - 1 \right) {\left\Vert h_k - \nabla f(x_*)\right\Vert }^2. \end{aligned}$$

where we used in the first inequality that for all \(a, b \in \mathbb {R}^d, {\left\Vert a + b\right\Vert }^2 \le 2{\left\Vert a\right\Vert }^2 + 2{\left\Vert b\right\Vert }^2\). Thus, using the fact that f is L-smooth, we have

$$\begin{aligned} \mathbb {E}_{k} \left[ {\left\Vert g_k\right\Vert }^2\right] \le \frac{4dL}{b}D_f\left( x_k, x_* \right) + 2\left( \frac{d}{b} - 1 \right) \sigma _k^2. \end{aligned}$$

Moreover,

$$\begin{aligned} \mathbb {E}_{k} \left[ \sigma _{k+1}^2\right]= & {} \mathbb {E}_{k} \left[ {\left\Vert h_{k+1}- \nabla f(x_*)\right\Vert }^2\right] \\= & {} \mathbb {E}_{k} \left[ {\left\Vert I_{B_k^c}(h_k- \nabla f(x_*)) + I_{B_k}(\nabla f(x_k)- \nabla f(x_*))\right\Vert }^2\right] \\\overset{(31)}{=} & {} \left( 1 - \frac{b}{d} \right) {\left\Vert h_k- \nabla f(x_*)\right\Vert }^2 + \frac{b}{d}{\left\Vert \nabla f(x_k)- \nabla f(x_*)\right\Vert }^2 \\{} & {} + 2\left\langle I_{B_k^c}(h_k- \nabla f(x_*)), I_{B_k}(\nabla f(x_k)- \nabla f(x_*)) \right\rangle \\= & {} \left( 1 - \frac{b}{d} \right) {\left\Vert h_k- \nabla f(x_*)\right\Vert }^2 + \frac{b}{d}{\left\Vert \nabla f(x_k)- \nabla f(x_*)\right\Vert }^2 \\{} & {} + 2\left\langle \underbrace{I_{B_k}I_{B_k^c}}_{=0}(h_k- \nabla f(x_*)), \nabla f(x_k)- \nabla f(x_*) \right\rangle \\\le & {} \left( 1 - \frac{b}{d} \right) {\left\Vert h_k - \nabla f(x_*)\right\Vert }^2 + \frac{2bL}{d}D_f\left( x_k, x_* \right) , \end{aligned}$$

where we used in the last inequality the \(L-\)smoothness of f. \(\square \)

Lemma B.8

From Lemma B.7, we have that the iterates of Algorithm 6 satisfy Assumption 2 and Eq. (14) with

$$\begin{aligned} \sigma _k^2 = {\left\Vert h_k - \nabla f(x_*)\right\Vert }^2 \end{aligned}$$

and constants:

$$\begin{aligned} A = \frac{2dL}{b}, \; B = 2\left( \frac{d}{b}-1 \right) , \; \rho = \frac{b}{d}, \; C = \frac{bL}{d}, \; D_1 = D_2 = 0, \; G=0. \end{aligned}$$

Using the constant derived in Lemma B.8 in Theorem 3.1 gives the following corollary.

Corollary B.6

Assume that f satisfies Assumption 1. Choose for all \(k \in \mathbb {N}\), \(\gamma _k = \gamma \), where

$$\begin{aligned} 0< \gamma < \frac{1}{4(\frac{2d}{b} - 1)L}. \end{aligned}$$

Then, from Theorem 3.1 and Lemma B.8, we have that the iterates given by Algorithm 6 verify

$$\begin{aligned} \mathbb {E}_{} \left[ F(\bar{x}_t) - F(x_*)\right] \le \frac{{\left\Vert x_0 - x_*\right\Vert }^2 + 2\gamma \left( F(x_0) - F(x_*) + \frac{2d}{b}\left( \frac{d}{b} - 1 \right) \gamma \sigma ^2 \right) }{2\gamma \left( 1 - 4\gamma \left( \frac{2d}{b}-1 \right) \right) t}, \end{aligned}$$

where \(\bar{x}_t \overset{\text {def}}{=}\frac{1}{t}\sum \nolimits _{k=0}^{t-1} x_k\).

Appendix C: Proofs for Sect. 3

1.1 Appendix C.1: Proof of Theorem 3.1

Before proving Theorem 3.1, we present several useful lemmas.

Lemma C.1

(Bounding the gradient variance) Assuming that the \(g_k\) are unbiased and that Assumption 2 holds, we have

$$\begin{aligned} \mathbb {E}_{} \left[ \left\Vert g_k - \nabla f(x_k)\right\Vert ^2\right] \le 2 A D_{f} (x_k, x_*) + B \sigma _k^2 + D_1. \end{aligned}$$
(32)

Proof

Starting from the left-hand side of (32), we have

$$\begin{aligned} \mathbb {E}_{} \left[ \left\Vert g_k - \nabla f(x_k)\right\Vert ^2\right]&= \mathbb {E}_{} \left[ \left\Vert g_k - \nabla f(x_*) - \left( \nabla f(x_k) - \nabla f(x_*) \right) \right\Vert ^2\right] \\&= \mathbb {E}_{} \left[ \left\Vert g_k - \nabla f(x_*) - \mathbb {E}_{} \left[ g_k - \nabla f(x_*) \right] \right\Vert ^2\right] \\&\le \mathbb {E}_{} \left[ \left\Vert g_k - \nabla f(x_*)\right\Vert ^2\right] \le 2 A D_{f} (x_k, x_*) + B \sigma _k^2 + D_1, \end{aligned}$$

where we used that \(\mathbb {E}_{} \left[ \left\Vert X - \mathbb {E}_{} \left[ X\right] \right\Vert ^2\right] = \mathbb {E}_{} \left[ \left\Vert X\right\Vert ^2\right] - \left\Vert \mathbb {E}_{} \left[ X\right] \right\Vert ^2 \le \mathbb {E}_{} \left[ \left\Vert X\right\Vert ^2\right] \) for any random variable X. \(\square \)

Lemma C.2

(Lemma 8 in [5]) Suppose that Assumption 1 holds and let \(\gamma \in \left( 0, \frac{1}{L} \right] \), then for all \(x, y \in \mathbb {R}^d\) and \(p = \textrm{prox}_{\gamma g}(y)\) we have,

$$\begin{aligned} - 2 \gamma \left( F(p) - F(x_*) \right) \ge {\left\Vert p - z \right\Vert }^2 + 2 \left\langle p - x_*, x - \gamma \nabla f(x) - y \right\rangle - {\left\Vert x_*- x\right\Vert }^2.\nonumber \\ \end{aligned}$$
(33)

Proof

We leave the proof to Section F.3. \(\square \)

Lemma C.3

For any \(x \in \mathbb {R}^d\) and minimizer \(x_*\) of F, we have,

$$\begin{aligned} D_{f} (x, x_*) \le F(x) - F(x_*). \end{aligned}$$
(34)

Proof

Because \(x_*\) is a minimizer of F, we have that \(- \nabla f(x_*) \in \partial R(x_*)\). By the definition of subgradients, we have

$$\begin{aligned} R(x_*) + \left\langle - \nabla f(x_*), x - x_* \right\rangle \le R(x). \end{aligned}$$

Rearranging gives

$$\begin{aligned} -\left\langle \nabla f(x_*), x - x_* \right\rangle \le R(x) - R(x_*). \end{aligned}$$

Adding \(f(x) - f(x_*)\) to both sides we have,

$$\begin{aligned} f(x) - f(x_*) - \left\langle \nabla f(x_*), x - x_* \right\rangle \le f(x) + R(x) - \left( f(x_*) + R(x_*) \right) = F(x) - F(x_*). \end{aligned}$$

Now note that the on the left-hand side we have the Bregman divergence \(D_{f} (x, x_*)\).

\(\square \)

Definition C.1

Given a stepsize \(\gamma > 0\), the prox-grad mapping is defined as:

$$\begin{aligned} T_{\gamma } (x) \overset{\text {def}}{=}\textrm{prox}_{\gamma R} \left( x - \gamma \nabla f(x) \right) . \end{aligned}$$

For the ease of exposition, we restate Theorem 3.1.

Theorem C.1

Suppose that Assumptions 2 and 1 hold. Let \(M \overset{\text {def}}{=}B/\rho \) and let \((\gamma _k)_{k\ge 0}\) be a decreasing, strictly positive sequence of step sizes chosen such that

$$\begin{aligned} 0< \gamma _0 < \frac{1}{2 (A + MC)}. \end{aligned}$$

The iterates given by (2) converge according to

$$\begin{aligned} \mathbb {E}_{} \left[ F(\bar{x}_t) - F(x_*)\right] \le \frac{V_0 + 2\gamma _0 \delta _0 + 2\left( D_1 + 2 M D_2 \right) \sum _{k=0}^{t-1}\gamma _k^2}{2\sum _{i=0}^{t-1}\left( 1 - 2\gamma _i\left( A+MC \right) \right) \gamma _i}, \end{aligned}$$

where \(\bar{x}_t \overset{\text {def}}{=}\sum \nolimits _{k=0}^{t-1} \frac{\left( 1 - \gamma _k \eta \right) \gamma _k}{\sum _{i=0}^{t-1}\left( 1 - \gamma _i\eta \right) \gamma _i}x_k\) and \(V_0 \overset{\text {def}}{=}{\left\Vert x_0 - x_*\right\Vert }^2 + 2 \gamma _0^2\,M \sigma _0^2\) and \({\delta _0 \overset{\text {def}}{=}F(x_0) - F(x_*)}\).

Proof

Let \(x_*\) be a minimizer of F. Using (33) from Lemma C.2 with \(y = x_k - \gamma _k g_k\), \(x = x_k\) and \(\gamma = \gamma _k\) gives

$$\begin{aligned} -2 \gamma _k \left( F(x_{k+1}) - F(x_*) \right)&\ge \left\Vert x_{k+1} - x_*\right\Vert ^2 - \left\Vert x_k - x_*\right\Vert ^2\\&\quad + 2 \gamma _k \left\langle x_{k+1} - x_*, g_k - \nabla f(x_k) \right\rangle . \end{aligned}$$

Multiplying both sides by \(-1\) results in

$$\begin{aligned} \begin{aligned} 2 \gamma _k \left( F(x_{k+1}) - F(x_*) \right)&\le {\left\Vert x_k - x_*\right\Vert }^2 - {\left\Vert x_{k+1} - x_*\right\Vert }^2 \\&\qquad + 2 \gamma _k \left\langle x_{k+1} - x_*, \nabla f(x_k) - g_k \right\rangle . \end{aligned} \end{aligned}$$
(35)

Now focusing on the last term in the above and consider the straightforward decomposition

$$\begin{aligned} \begin{aligned} \left\langle x_{k+1} - x_*, \nabla f(x_k) - g_k \right\rangle&= \left\langle x_{k+1} - T_{\gamma _k} (x_k), \nabla f(x_k) - g_k \right\rangle \\&\qquad + \left\langle T_{\gamma _k} (x_k) - x_*, \nabla f(x_k) - g_k \right\rangle . \end{aligned} \end{aligned}$$
(36)

By Cauchy Schwartz we have that

$$\begin{aligned} \left\langle x_{k+1} - T_{\gamma _k} (x_k), \nabla f(x_k) - g_k \right\rangle \le \left\Vert x_{k+1} - T_{\gamma _k} (x_k) \right\Vert \left\Vert g_k - \nabla f(x_k)\right\Vert . \end{aligned}$$
(37)

Now using the nonexpansivity of the proximal operator

$$\begin{aligned} \left\Vert x_{k+1} - T_{\gamma _k} (x_k)\right\Vert&= \left\Vert \textrm{prox}_{\gamma _k R} \left( x_k - \gamma _k g_k \right) - \textrm{prox}_{\gamma _k R} \left( x_k - \gamma _k \nabla f(x_k) \right) \right\Vert \\&\le \left\Vert \left( x_k - \gamma _k g_k \right) - \left( x_k - \gamma _k \nabla f(x_k) \right) \right\Vert = \gamma _k \left\Vert g_k - \nabla f(x_k)\right\Vert . \end{aligned}$$

Using this in (37), we have

$$\begin{aligned} \left\langle x_{k+1} - T_{\gamma _k} (x_k), \nabla f(x_k) - g_k \right\rangle \le \gamma _k \left\Vert g_k - \nabla f(x_k)\right\Vert ^2. \end{aligned}$$
(38)

Using (38) in (36) and taking expectation conditioned on \(x_k\), and using \(\mathbb {E}_{k} \left[ \cdot \right] \overset{\text {def}}{=}\mathbb {E}_{} \left[ \cdot \; | \; x_k\right] \) for shorthand, we have

$$\begin{aligned} \begin{aligned} \mathbb {E}_{k} \left[ \left\langle x_{k+1} - x_*, g_k - \nabla f(x_k) \right\rangle \right]&\le \gamma _k \cdot \mathbb {E}_{k} \left[ \left\Vert g_k - \nabla f(x_k)\right\Vert ^2\right] \\&\qquad + \left\langle T_{\gamma _k} (x_k) - x_*, \underbrace{\mathbb {E}_{k} \left[ \nabla f(x_k) - g_k\right] }_{= 0} \right\rangle \\&= \gamma _k \cdot \mathbb {E}_{k} \left[ \left\Vert g_k - \nabla f(x_k)\right\Vert ^2\right] . \end{aligned} \end{aligned}$$
(39)

Let \(r_k \overset{\text {def}}{=}x_k-x_*.\) Taking expectation conditioned on \(x_k\) in (35) and using (39), we have

$$\begin{aligned} 2 \gamma _k \mathbb {E}_{k} \left[ F(x_{k+1} - F(x_*)) \right]&\le {\left\Vert r_k\right\Vert }^2 - \mathbb {E}_{k} \left[ \left\Vert r_{k+1}\right\Vert ^2\right] + 2 \gamma _k^2 \mathbb {E}_{k} \left[ \left\Vert g_k - \nabla f(x_k)\right\Vert ^2\right] . \end{aligned}$$

Using (8) from Assumption 2, we have

$$\begin{aligned}&2 \gamma _k \mathbb {E}_{k} \left[ F(x_{k+1}) - F(x_*) \right] \\&\quad \le {\left\Vert r_k\right\Vert }^2 - \mathbb {E}_{k} \left[ \left\Vert r_{k+1}\right\Vert ^2\right] + 2 \gamma _k^2 \left( 2 A D_{f} (x_k, x_*) + B \sigma _k^2 + D_1 \right) . \end{aligned}$$

Let \(V_k \overset{\text {def}}{=}{\left\Vert r_k\right\Vert }^2 + 2\,M \gamma _k^2 \sigma _{k}^2\) where \(M = \frac{B}{\rho }\), then

$$\begin{aligned} \begin{aligned} 2 \gamma _k \mathbb {E}_{k} \left[ F(x_{k+1}) - F(x_*) \right] \le V_k&- \mathbb {E}_{k} \left[ V_{k+1}\right] + 4 \gamma _k^2 A D_{f} (x_k, x_*) + 2 \gamma _k^2 D_1 \\&+ \gamma _k^2 \left( 2B - 2 M \right) \sigma _{k}^2 + 2 M \gamma _{k+1}^2 \mathbb {E}_{} \left[ \sigma _{k+1}^2\right] . \end{aligned} \end{aligned}$$
(40)

Since \(\gamma _{k+1} \le \gamma _k\), we have that

$$\begin{aligned} \begin{aligned} 2 \gamma _{k+1} \mathbb {E}_{k} \left[ F(x_{k+1}) - F(x_*) \right] \le V_k&- \mathbb {E}_{k} \left[ V_{k+1}\right] + 4 \gamma _k^2 A D_{f} (x_k, x_*) + 2 \gamma _k^2 D_1 \\&+ \gamma _k^2 \left( 2B - 2 M \right) \sigma _{k}^2 + 2 M \gamma _k^2 \mathbb {E}_{} \left[ \sigma _{k+1}^2\right] . \end{aligned} \end{aligned}$$

Using (9) from Assumption 2, we have

$$\begin{aligned} 2 \gamma _k^2 \left( B - M \right) \sigma _k^2 + 2 M \gamma _k^2 \mathbb {E}_{k} \left[ \sigma _{k+1}^2\right]&\le 2 \gamma _k^2 \left( B - M + M (1 - \rho ) \right) \sigma _k^2 + 4 M \gamma _k^2 C D_{f} (x_k, x_*)\nonumber \\&\quad + 2 M \gamma _k^2 D_2 \nonumber \\&= 2 \gamma _k^2 \underbrace{\left( B - \rho M \right) }_{= 0} \sigma _k^2 + 4 M \gamma _k^2 C D_{f} (x_k, x_*) +2 M \gamma _k^2 D_2 \nonumber \\&\le 4 M \gamma _k^2 C D_{f} (x_k, x_*) +2 M \gamma _k^2 D_2. \end{aligned}$$
(41)

Using (41) in (40) gives

$$\begin{aligned} \begin{aligned} 2 \gamma _{k+1} \mathbb {E}_{k} \left[ F(x_{k+1}) - F(x_*) \right] \le V_k - \mathbb {E}_{k} \left[ V_{k+1}\right]&+ 2 \gamma _k^2 \left( 2 A + 2 M C \right) D_{f} (x_k, x_*) \\&+ 2 \gamma _k^2 \left( D_1 + M D_2 \right) . \end{aligned} \end{aligned}$$
(42)

Let \(\eta \overset{\text {def}}{=}2A + 2 M C\). Using (34) in (42) we have,

$$\begin{aligned}&2 \gamma _{k+1} \mathbb {E}_{k} \left[ F(x_{k+1}) - F(x_*) \right] \\&\quad \le V_k - \mathbb {E}_{k} \left[ V_{k+1}\right] + 2 \gamma _k^2 \eta \left( F(x_k) - F(x_*) \right) + 2 \gamma _k^2 \left( D_1 + M D_2 \right) . \end{aligned}$$

Using the abbreviation \(\delta _k = F(x_k) - F(x_*)\) gives

$$\begin{aligned} 2 \gamma _{k+1} \mathbb {E}_{k} \left[ \delta _{k+1}\right]&\le V_{k} - \mathbb {E}_{k} \left[ V_{k+1}\right] + 2 \gamma _k^2 \eta \delta _k + 2 \gamma _k^2 \left( D_1 + M D_2 \right) . \end{aligned}$$

Taking expectation,

$$\begin{aligned} 2 \gamma _{k+1} \mathbb {E}_{} \left[ \delta _{k+1}\right]&\le \mathbb {E}_{} \left[ V_{k}\right] - \mathbb {E}_{} \left[ V_{k+1}\right] + 2 \gamma _k^2 \eta \mathbb {E}_{} \left[ \delta _k\right] + 2 \gamma _k^2 \left( D_1 + M D_2 \right) , \end{aligned}$$

summing over \(k =0,\ldots , t-1\) and using telescopic cancellation gives

$$\begin{aligned} 2 \sum _{k=1}^{t}\gamma _k\mathbb {E}_{} \left[ \delta _{k}\right]&\le V_{0} - \mathbb {E}_{} \left[ V_{t}\right] + 2 \eta \sum _{k=0}^{t-1}\gamma _k^2 \mathbb {E}_{} \left[ \delta _k\right] + 2 \left( D_1 + M D_2 \right) \sum _{k=0}^{t-1}\gamma _k^2. \end{aligned}$$

Adding \(2\gamma _0\delta _0\) to both sides of the above inequality and rearranging,

$$\begin{aligned} 2 \sum _{k=0}^{t-1}\gamma _k(1-\eta \gamma _k)\mathbb {E}_{} \left[ \delta _{k}\right]&\le V_{0} - \mathbb {E}_{} \left[ V_{t}\right] + 2 \gamma _0 \delta _0 + 2 \left( D_1 + M D_2 \right) \sum _{k=0}^{t-1}\gamma _k^2 \end{aligned}$$

where we also used that \(V_t \ge 0\) and \(\delta _t \ge 0.\) By the choice of \(\gamma _0\) we have \(1 - \gamma _0 \eta > 0\), and since \((\gamma _i)_i\) is a decreasing sequence, we have \(1 - \gamma _i \eta > 0\) for all i. Hence, dividing both sides by \(2\sum \nolimits _{i=0}^{t-1}\left( 1 - \gamma _i\eta \right) \gamma _i\), we have

$$\begin{aligned} \sum _{k=0}^{t-1} w_k \mathbb {E}_{} \left[ \delta _k\right] \le \frac{V_0 + 2 \gamma _0 \delta _0}{2\sum _{i=0}^{t-1}\left( 1 - \gamma _i\eta \right) \gamma _i} + \left( D_1 + 2 M D_2 \right) \frac{\sum _{k=0}^{t-1}\gamma _k^2}{\sum _{i=1}^{t}\left( 1-\gamma _i\eta \right) \gamma _i}, \end{aligned}$$

where \(w_k \overset{\text {def}}{=}\frac{\left( 1 - \gamma _k \eta \right) \gamma _k}{\sum _{i=0}^{t-1}\left( 1 - \gamma _i\eta \right) \gamma _i}\) for all \(k \in \left\{ 0,\dots ,t-1\right\} \). Note that \(\sum _{k=0}^{t-1} w_k = 1\) and \(w_k \ge 0\) for all \(k \in \left\{ 0,\dots ,t-1\right\} \). Hence, since F is convex, we can use Jensen’s inequality to conclude

$$\begin{aligned} \mathbb {E}_{} \left[ F(\bar{x}^k) - F(x_*)\right]&= \mathbb {E}_{} \left[ F\left( \sum \limits _{k=0}^{t-1} w_k x_k \right) - F(x_*)\right] \\&\le \sum _{k=0}^{t-1}w_k \mathbb {E}_{} \left[ \delta _k\right] \le \frac{V_0 + 2\gamma _0 \delta _0}{2\sum _{i=0}^{t-1}\left( 1 - \gamma _i\eta \right) \gamma _i} + \frac{\left( D_1 + 2 M D_2 \right) \sum _{k=0}^{t-1}\gamma _k^2}{\sum _{i=0}^{t-1}\left( 1-\gamma _i\eta \right) \gamma _i}. \end{aligned}$$

Writing out the definition of \(\delta _0\) yields the theorem’s statement. \(\square \)

Appendix D: Proofs for Sect. 4

1.1 Appendix D.2: Proof of Corollary 4.2

Proof

Note that, using the integral bound, we have:

$$\begin{aligned} \sum _{k=0}^{t-1}\gamma _k^2\le & {} \gamma ^2\left( \log (t) + 1 \right) \\ \sum _{k=0}^{t-1}\gamma _k\ge & {} 2\gamma \left( \sqrt{t} - 1 \right) . \end{aligned}$$

Moreover, note that since \(\gamma _k \le \frac{1}{4\left( A+MC \right) }\), we have \(1 - 2\gamma _k(A+MC) \ge \frac{1}{2}\) for all \(k \in \mathbb {N}\). Thus

$$\begin{aligned} \frac{1}{2\sum _{k=0}^{t-1} \gamma _k\left( 1 - \eta \gamma _k \right) } \le \frac{1}{2\gamma \left( \sqrt{t} - 1 \right) }, \end{aligned}$$

where \(\eta \overset{\text {def}}{=}2(A+MC)\). Corollary 4.2 follows from using these bounds in Equation (10). \(\square \)

Appendix E: Roofs for Sect. 5

1.1 Appendix E.1: Proof of Proposition 5.1

Proof

We start by expanding the square:

$$\begin{aligned} {\left\Vert x_{k+1} - x_*\right\Vert }^2= & {} {\left\Vert x_k - x_*\right\Vert }^2 - 2\gamma \left\langle g_k, x_k - x_* \right\rangle + \gamma ^2{\left\Vert g_k\right\Vert }^2. \end{aligned}$$

Thus, taking expectation conditioned on \(x_k\), and using \(\mathbb {E}_{k} \left[ \cdot \right] \overset{\text {def}}{=}\mathbb {E}_{} \left[ \cdot \; | \; x_k\right] \) for shorthand, we have

$$\begin{aligned} \mathbb {E}_{k} \left[ {\left\Vert x_{k+1} - x_*\right\Vert }^2\right]= & {} {\left\Vert x_k - x_*\right\Vert }^2 - 2\gamma \left\langle \nabla f(x_k), x_k - x_* \right\rangle + \gamma ^2\mathbb {E}_{k} \left[ {\left\Vert g_k\right\Vert }^2\right] \\\overset{(6)+(7)+(8)}{\le } & {} {\left\Vert x_k - x_*\right\Vert }^2 - 2\gamma (1-2\gamma A)\left( f(x_k) - f(x_*) \right) + B\sigma _k^2. \end{aligned}$$

Thus, using (9),

$$\begin{aligned}&\mathbb {E}_{k} \left[ {\left\Vert x_{k+1} - x_*\right\Vert }^2\right] + 2M\gamma ^2\mathbb {E}_{k} \left[ \sigma _{k+1}^2\right] \\&\quad \le {\left\Vert x_k - x_*\right\Vert }^2 - 2\gamma (1-2\gamma (A+MC))\left( f(x_k) - f(x_*) \right) + 2M\gamma ^2\sigma _k^2. \end{aligned}$$

Thus, rearranging and taking the expectation, we have:

$$\begin{aligned} 2\gamma (1-2\gamma (A+MC))\mathbb {E}_{} \left[ f(x_k) - f(x_*)\right]\le & {} \mathbb {E}_{} \left[ \left\Vert x_k - x_*\right\Vert ^2\right] - \mathbb {E}_{} \left[ \left\Vert x_{k+1} - x_*\right\Vert ^2\right] \\{} & {} +\, 2M\gamma ^2\left( \mathbb {E}_{} \left[ \sigma _k^2\right] - \mathbb {E}_{} \left[ \sigma _{k+1}^2\right] \right) . \end{aligned}$$

Summing over \(k =0,\ldots , t-1\) and using telescopic cancellation gives

$$\begin{aligned} 2\gamma (1-2\gamma (A+MC))\sum _{k=0}^{t-1}\mathbb {E}_{} \left[ f(x_k) - f(x_*)\right]\le & {} {\left\Vert x_0 - x_*\right\Vert }^2 - \mathbb {E}_{} \left[ \left\Vert x_{k} - x_*\right\Vert ^2\right] \\{} & {} + 2M\gamma ^2\left( \mathbb {E}_{} \left[ \sigma _0^2\right] - \mathbb {E}_{} \left[ \sigma _{k+1}^2\right] \right) . \end{aligned}$$

Ignoring the negative terms in the upper bound, and using Jensen’s inequality, we have

$$\begin{aligned} \mathbb {E}_{} \left[ f(\bar{x}_t) - f(x_*)\right]\le & {} \frac{{\left\Vert x_0 - x_*\right\Vert }^2 + 2M\gamma ^2\sigma _0^2}{2\gamma (1-2\gamma (A+MC))t}. \end{aligned}$$

Moreover, notice that if \(\gamma \le \frac{1}{4(A+MC)}\), then \(2(1-2\gamma (A+MC)) \ge 1\), which gives (13). \(\square \)

1.2 Appendix E.2: Optimal Minibatch Size for b-SAGA (Algorithm 1)

In this section, we present the proofs for Sect. 5.1.

1.2.1 Appendix E.2.1: Proof of Lemma 5.1

Proof

For constant \(A, B, \rho , C, D_1, D_2\), see Lemma B.5. Moreover,

$$\begin{aligned} \sigma _0^2= & {} \frac{1}{nb} \frac{n-b}{n-1}\left\Vert \nabla H(x_0) - \nabla H(x_*)\right\Vert _{\textrm{Tr}}^2 = \frac{1}{nb} \frac{n-b}{n-1} \sum _{i=1}^n {\left\Vert \nabla f_i(x_0) - \nabla f_i(x_*)\right\Vert }^2\\= & {} \frac{1}{b} \frac{n-b}{n-1} \frac{1}{n}\sum _{i=1}^n {\left\Vert \nabla f_i(x_0) - \nabla f_i(x_*)\right\Vert }^2 \\\overset{(52)}{\le } & {} \frac{1}{b} \frac{n-b}{n-1}L_{\max } \left( f(x_0) - f(x_*) \right) \\\overset{(2) + (18)}{\le } & {} \zeta (b) L {\left\Vert x_0 - x_*\right\Vert }^2. \end{aligned}$$

Thus, (14) holds with \(G = \zeta (b)\). \(\square \)

1.2.2 Appendix E.2.2: Proof of Proposition 5.2

Proof

First, since \(\frac{{\left\Vert x_0 - x_*\right\Vert }^2}{\epsilon }\) does not depend on b, the variations of K(b) are the same as those of

$$\begin{aligned} Q(b) = \frac{4\left( 3(n-b)L_{\max } + 2n(b-1)L \right) }{b(n-1)} + \frac{n(n-b)L_{\max }L}{2b\left( 3(n-b)L_{\max } + 2n(b-1)L \right) }. \end{aligned}$$

Let’s determine the sign of \(Q^{'}(b)\). We have:

$$\begin{aligned} Q^{'}(b) = \frac{W_1 b^2 + W_2 b + W_3}{4(n-1)\left( \left( 2nL - 3L_{\max } \right) b + \left( \frac{3L_{\max }}{2} - L \right) n \right) ^2}, \end{aligned}$$

where

$$\begin{aligned} W_1= & {} 4\left( 2nL - 3L_{\max } \right) ^3, \\ W_2= & {} 8n\left( 3L_{\max } - 2L \right) \left( 2nL - 3L_{\max } \right) ^2, \\ W_3= & {} n^2\left( -108L_{\max }^3 + 72\left( n+2 \right) L_{\max }^2L - \left( n^2 + 94n + 49 \right) L^2L_{\max } + 32nL^3 \right) . \end{aligned}$$

And we have:

$$\begin{aligned} W_2^2 - 4W_1W_2 = 16n^2(n-1)^2L^2L_{\max }\left( 2nL - 3L_{\max } \right) ^3. \end{aligned}$$

Case 1 \(L_{\max } > \frac{2nL}{3}\). We have \(2nL - 3L_{\max } < 0\). Hence, \(W_2^2 - 4W_1W_2 < 0\).

Moreover, since \(W_1 < 0\), we have

$$\begin{aligned} L_{\max } > \frac{2nL}{3} \implies K'(b) < 0. \end{aligned}$$

Thus,

$$\begin{aligned} L_{\max } > \frac{2nL}{3} \implies b^* = n \end{aligned}$$

Case 2 \(L_{\max } \le \frac{2nL}{3}\). Then, \(W_2^2 - 4W_1W_2 \ge 0\) and \(K'(b) = 0\) has at least one solution. We are now going to examine whether or not K(b) is convex. We have:

$$\begin{aligned} Q^{''}(b) = \frac{2n^2(n-1)L_{\max }L^2\left( 2nL - 3L_{\max } \right) }{\left( \left( 2nL - 3L_{\max } \right) b + \left( 3L_{\max } - 2L \right) n \right) ^3} \ge 0. \end{aligned}$$

Thus, K(b) is convex. \(K^{'}(b) = 0\) has two solutions:

$$\begin{aligned} b_1= & {} \frac{n\left( (n-1)L\sqrt{L_{\max }} - 2\sqrt{2nL - 3L_{\max }}(3L_{\max } - 2L) \right) }{2(2nL - 3L_{\max })^{\frac{3}{2}}}, \\ b_2= & {} \frac{-n\left( (n-1)L\sqrt{L_{\max }} + 2\sqrt{2nL - 3L_{\max }}(3L_{\max } - 2L) \right) }{2(2nL - 3L_{\max })^{\frac{3}{2}}}. \end{aligned}$$

But since \(b_2 \le 0\), we have that:

$$\begin{aligned} {L_{\max } \le \frac{2nL}{3} \implies b^* = \left\{ \begin{array}{ll} 1 &{} \text{ if } b_1< 2 \\ \left\lfloor b_1 \right\rfloor &{} \text{ if } 2 \le b_1 < n \\ n &{} \text{ if } b_1 \ge n \end{array} \right. }. \end{aligned}$$

\(\square \)

1.3 Appendix E.3: Optimal Minibatch Size for b-L-SVRG (Algorithm 2)

In this section, we present a detailed analysis of the optimal minibatch size derived in Sect. 5.2.

Lemma E.1

We have that the iterates of Algorithm 2 satisfy Assumption 2 and Eq. (14) with

$$\begin{aligned} \sigma _k^2 = \mathbb {E}_{} \left[ \left\Vert {\left\Vert \nabla f_{B}(w_k) - \nabla f_{B}(x_*) - (\nabla f(w_k) - \nabla f(x_*))\right\Vert }^2\right\Vert ^2\right] , \end{aligned}$$

and constants

$$\begin{aligned} A = 2\mathcal {L}(b), \; B = 2, \; \rho = p, \; C = p\mathcal {L}(b), \; D_1 = D_2 = 0, \; G = \mathcal {L}(b)L, \end{aligned}$$
(43)

where \(\mathcal {L}(b)\) is defined in (17).

Proof

For constant \(A, B, \rho , C, D_1, D_2\), see Lemma B.4 and Corollary B.4.

Moreover,

$$\begin{aligned} \mathbb {E}_{} \left[ {\left\Vert \nabla f_{v_0}(x_0) - \nabla f_{v_0}(x_*) - \left( \nabla f(x_0) - \nabla f(x_*) \right) \right\Vert }^2\right]\le & {} \mathbb {E}_{} \left[ {\left\Vert \nabla f_{v_0}(x_0) - \nabla f_{v_0}(x_*)\right\Vert }^2\right] \\\overset{(25)}{\le } & {} 2\mathcal {L}(b)D_f\left( x_0, x_* \right) \\\overset{(5)}{\le } & {} \mathcal {L}(b) L{\left\Vert x_0 - x_*\right\Vert }^2. \end{aligned}$$

where we used in the first inequality that \(\mathbb {E}_{} \left[ \left\Vert X - \mathbb {E}_{} \left[ X\right] \right\Vert ^2\right] = \mathbb {E}_{} \left[ \left\Vert X\right\Vert ^2\right] - \left\Vert \mathbb {E}_{} \left[ X\right] \right\Vert ^2 \le \mathbb {E}_{} \left[ \left\Vert X\right\Vert ^2\right] \). Thus, (14) holds with \(G = \mathcal {L}(b)L\). \(\square \)

In the next corollary, we will give the iteration complexity for Algorithm 2 in the case where \(p = 1/n\), which is the usual choice for p in practice. A justification for this choice can be found in [23, 37].

Corollary E.1

(Iteration complexity of L-SVRG) Consider the iterates of Algorithm 2. Let \(p=1/n\) and \(\gamma = \frac{1}{12\mathcal {L}(b)}\). Given the constants obtained for Algorithm 2 in (43), we have, using Corollary 5.1, that if

$$\begin{aligned} k \ge \left( 12\mathcal {L}(b) + \frac{nL}{6} \right) \frac{{\left\Vert x_0 - x_*\right\Vert }^2}{\epsilon }, \end{aligned}$$

then, \(\mathbb {E}_{} \left[ f(\bar{x}_k) - f(x_*)\right] \le \epsilon \).

The usual definition for the total complexity is the expected number of gradients computed per iteration, times the iteration complexity, required to reach an \(\epsilon \)-approximate solution in expectation. However, since L-SVRG computes the full gradient every n iterations in expectation, we can say that L-SVRG computes roughly \(2b + 1\) gradients every iteration, so that after n iteration, it will have computed \(n + 2bn\) gradient. Thus, the total complexity for SVRG is:

$$\begin{aligned} K(b){} & {} \overset{\text {def}}{=}\left( 1+2b \right) \left( 12\mathcal {L}(b) + \frac{nL}{6} \right) \frac{{\left\Vert x_0 - x_*\right\Vert }^2}{\epsilon } \end{aligned}$$
(44)
$$\begin{aligned}= & {} \left( 1+2b \right) \left( \frac{12\left( (n-b)L_{\max } + n(b-1)L \right) }{b(n-1)}+\frac{nL}{6} \right) \frac{{\left\Vert x_0 - x_*\right\Vert }^2}{\epsilon }. \end{aligned}$$
(45)

1.3.1 Appendix E.3.1: Proof of Proposition 5.3

Proof

Since the factor \(\frac{{\left\Vert x_0 - x_*\right\Vert }^2}{\epsilon }\) which appears in (44) does not depend on the minibatch size, minimizing the total complexity in the minibatch size corresponds to minimizing the following quantity:

$$\begin{aligned} Q(b) = (1+2b)\left( 12\mathcal {L}(b) + \frac{nL}{6} \right) . \end{aligned}$$

We have

$$\begin{aligned} (n-1)Q(b)= & {} 12(n-1)\mathcal {L}(b) + 24(n-1)b\mathcal {L}(b) + \frac{n(n-1)L b}{3} + \frac{nL}{6}\\= & {} \frac{12n(L_{\max } - L)}{b} + \left( 24(nL - L_{\max }) + \frac{n(n-1)L}{3} \right) b + \xi , \end{aligned}$$

where \(\xi \) is a constant independent of b. Differentiating, we have:

$$\begin{aligned} (n-1)Q'(b) = -\frac{12n(L_{\max } - L)}{b^2} + 24(nL - L_{\max }) + \frac{n(n-1)L}{3}. \end{aligned}$$

Since \(L_{\max } \ge L\) and \(nL \ge L_{\max }\) (see for example Lemma A.6 in [37]), C(b) is a convex function of b. Thus, Q(b) is minimized when \(Q'(b) = 0\). Hence:

$$\begin{aligned} b^* = 6\sqrt{\frac{n\left( L_{\max } - L \right) }{72\left( nL-L_{\max } \right) + n(n-1)L}}. \end{aligned}$$

Since \(L_{\max }\) can take any value in the interval [LnL], we have \(b^* \in [0, 6]\). \(\square \)

1.4 Appendix E.4: Optimal Miniblock Size for b-SEGA (Algorithm 6)

In this section, we define for any \(j \in [d]\) the matrix \(I_j \in \mathbb {R}^{d\times d}\) such that

$$\begin{aligned} (I_j)_{pq} \overset{\text {def}}{=}\left\{ \begin{array}{ll} 1 &{} \text{ if } p=q=j \\ 0 &{} \text{ otherwise } \end{array} \right. , \end{aligned}$$

and we consequently define for any subset \(B \subseteq [d]\),

$$\begin{aligned} I_B \overset{\text {def}}{=}\sum _{j \in B} I_j. \end{aligned}$$
Algorithm 6
figure f

b-SEGA

Corollary E.2

From Lemma B.8, we have that the iterates of Algorithm 6 satisfy Assumption 2 and Eq. (14) with

$$\begin{aligned} \sigma _k^2 = {\left\Vert h_k - \nabla f(x_*)\right\Vert }^2 \end{aligned}$$

and constants:

$$\begin{aligned} A = \frac{2dL}{b}, \; B = 2\left( \frac{d}{b}-1 \right) , \; \rho = \frac{b}{d}, \; C = \frac{bL}{d}, \; D_1 = D_2 = 0, \; G=0. \end{aligned}$$
(46)

Proof

For the constants \(A, B, \rho , C, D_1, D_2\), see Lemma B.8. Moreover, in Algorithm 6, \(h_0 = 0\). Thus, \(\sigma _0^2 = {\left\Vert h_0\right\Vert }^2 = 0\). Thus, (14) holds with \(G = 0\). \(\square \)

In the next corollary, we will give the iteration complexity for Algorithm 6.

Corollary E.3

(Iteration complexity of b-SEGA) Consider the iterates of Algorithm 6. Let \(\gamma = \frac{b}{4\left( 3d - b \right) L}\). Given the constants obtained for Algorithm 6 in (46), we have, using Corollary 5.1, that if

$$\begin{aligned} k \ge \frac{4(3d-b)L}{b}\frac{{\left\Vert x_0 - x_*\right\Vert }^2}{\epsilon }, \end{aligned}$$

then, \(\mathbb {E}_{} \left[ F(\bar{x}_k) - F(x_*)\right] \le \epsilon \).

Here, we define the total complexity as the number of coordinates of the gradient that we sample at each iteration times the iteration complexity. Since at each iteration, we sample b coordinates of the gradient, the total complexity for Algorithm 6 to reach an \(\epsilon \)-approximate solution is

$$\begin{aligned} K(b)&\overset{\text {def}}{=}&4\left( 3d-b \right) L\frac{{\left\Vert x_0 - x_*\right\Vert }^2}{\epsilon }. \end{aligned}$$
(47)

Thus, we immediately have the following proposition.

Proposition E.1

Let \(b^* = \underset{b \in [d]}{{{\,\mathrm{\arg \!\min }\,}}}\, K(b)\), where K(b) is defined in (47). Then,

$$\begin{aligned} b^* = d. \end{aligned}$$

The consequence of this proposition is that when using Algorithm 6, one should always use as big a miniblock as possible if the cost of a single iteration is proportional to the miniblock size.

Appendix F: Auxiliary Lemmas

1.1 Appendix F.1: Smoothness and Convexity Lemma

We now develop an immediate consequence of each \(f_i\) being convex and smooth based on the follow lemma.

Lemma F.1

Let \(g: \mathbb {R}^d \mapsto \mathbb {R}\) be a convex function

$$\begin{aligned} g(z) -g(x)&\le \left\langle \nabla g(z), z-x \right\rangle , \quad \forall x,z\in \mathbb {R}^d, \end{aligned}$$
(48)

and \(L_g\)-smooth

$$\begin{aligned} g(z) -g(x)&\le \left\langle \nabla g(x), z-x \right\rangle +\frac{L_g}{2}\left\Vert z-x\right\Vert _2^2, \quad \forall x,z\in \mathbb {R}^d. \end{aligned}$$
(49)

It follows that

$$\begin{aligned} \left\Vert \nabla g(x) - \nabla g(z)\right\Vert ^2 \le L_g (g(x) -g(z) - \left\langle \nabla g(z), x-z \right\rangle ), \quad \forall x\in \mathbb {R}^d. \end{aligned}$$
(50)

Proof

Fix \(i\in \{1,\ldots , n\}\) and let

$$\begin{aligned} z = x - \frac{1}{L_g}(\nabla g(x) -\nabla g(x^*)). \end{aligned}$$

To prove (50), it follows that

$$\begin{aligned} g(x^*) -g(x)= & {} g(x^*) -g(z)+g(z) - g(x)\nonumber \\\overset{(48)+(49) }{\le } & {} \left\langle \nabla g(x^*), x^*-z \right\rangle + \left\langle \nabla g(x), z-x \right\rangle +\frac{L_g}{2}\left\Vert z-x\right\Vert _2^2.\nonumber \\ \end{aligned}$$
(51)

Substituting this in z into (51) gives

$$\begin{aligned} g(x^*) -g(x)= & {} \left\langle \nabla g(x^*), x^*-x + \frac{1}{L_g}(\nabla g(x) -\nabla g(x^*)) \right\rangle \nonumber \\{} & {} - \frac{1}{L_g}\left\langle \nabla g(x), \nabla g(x) -\nabla g(x^*) \right\rangle +\frac{1}{2L_g}\left\Vert \nabla g(x) -\nabla g(x^*)\right\Vert _2^2 \nonumber \\= & {} \left\langle \nabla g(x^*), x^*-x \right\rangle \\{} & {} - \frac{1}{L_g}\left\Vert \nabla g(x)-\nabla g(x^*)\right\Vert _2^2 +\frac{1}{2L_g}\left\Vert \nabla g(x) -\nabla g(x^*)\right\Vert _2^2 \nonumber \\= & {} \left\langle \nabla g(x^*), x^*-x \right\rangle - \frac{1}{2L_g}\left\Vert \nabla g(x)-\nabla g(x^*)\right\Vert _2^2.\nonumber \end{aligned}$$

\(\square \)

Lemma F.2

Suppose that for all \(i \in [n]\), \(f_i\) is convex and \(L_i\)-smooth, and let \(L_{\max } = \max _{i \in [n]} L_i\). Then

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^n {\left\Vert \nabla f_i(x) - \nabla f_i(x_*)\right\Vert }^2 \le 2L_{\max }\left( f(x) - f(x_*) \right) . \end{aligned}$$
(52)

Proof

From (50), we have for all \(i \in [n]\),

$$\begin{aligned} {\left\Vert \nabla f_i(x) - \nabla f_i(x_*)\right\Vert }^2\le & {} 2L_i \left( f_i (x) - f_i (x_*) - \left\langle \nabla f_i(x_*), x - x_* \right\rangle \right) \end{aligned}$$

Thus,

$$\begin{aligned} \frac{1}{n}\sum _{i=1}^n{\left\Vert \nabla f_i(x) - \nabla f_i(x_*)\right\Vert }^2\le & {} 2L_{\max } \left( f(x) - f(x_*) - \left\langle \nabla f(x_*), x - x_* \right\rangle \right) \\= & {} 2L_{\max } \left( f(x) - f(x_*) \right) . \end{aligned}$$

\(\square \)

1.2 Appendix F.2: Proximal Lemma

Lemma F.3

Let \(R: \mathbb {R}^d \mapsto \mathbb {R}\) be a convex lower semi-continuous function. For \(z,y \in \mathbb {R}^d\) and \(\gamma >0\). With \(p = \textrm{prox}_{\gamma g} (y)\) we have that for

$$\begin{aligned} g(p) - g(z) \le -\frac{1}{\gamma } \left\langle p - y, p -z \right\rangle . \end{aligned}$$
(53)

Proof

This is classic result, see, for example, the “Second Prox Theorem” in Section 6.5 in [6]. \(\square \)

1.3 Appendix F.3: Proof of Lemma C.2

This proof follows the proof of Lemma 8 in [5], and we reproduce it for completeness. Indeed, using the convexity of f

$$\begin{aligned} f(x) -f(x_*) \ge - \langle \nabla f(x), x_*- x \rangle \end{aligned}$$

in combination with (53) where \(z = x_*\) gives

$$\begin{aligned} f(x)+g(p)- F(x_*)&\le -\frac{1}{\gamma } \left\langle p - y, p -x_* \right\rangle - \left\langle \nabla f(x), x_*-x \right\rangle . \end{aligned}$$

Now using smoothness

$$\begin{aligned} f(p) -f(x) \le \left\langle \nabla f(x),p -x \right\rangle + \frac{1}{2\gamma } \left\Vert p-x\right\Vert ^2, \end{aligned}$$

gives

$$\begin{aligned} F(p)- F(x_*)&\le -\frac{1}{\gamma } \left\langle p - y, p -x_* \right\rangle - \left\langle \nabla f(x), x_*-x \right\rangle +\left\langle \nabla f(x),p -x \right\rangle \nonumber \\&\quad + \frac{1}{2\gamma } \left\Vert p-x\right\Vert ^2 \nonumber \\&= -\frac{1}{\gamma } \left\langle p - y, p -x_* \right\rangle +\left\langle \nabla f(x),p -x_* \right\rangle + \frac{1}{2\gamma } \left\Vert p-x\right\Vert ^2 \nonumber \\&= -\frac{1}{\gamma } \left\langle p -\gamma \nabla f(x)- y, p -x_* \right\rangle + \frac{1}{2\gamma } \left\Vert p-x\right\Vert ^2 \nonumber \\&= -\frac{1}{\gamma } \left\langle p-x+x -\gamma \nabla f(x)- y, p -x_* \right\rangle + \frac{1}{2\gamma } \left\Vert p-x\right\Vert ^2 \nonumber \\&= -\frac{1}{\gamma } \left\langle p-x, p -x_* \right\rangle -\frac{1}{\gamma } \left\langle x -\gamma \nabla f(x)- y, p -x_* \right\rangle + \frac{1}{2\gamma } \left\Vert p-x\right\Vert ^2 . \end{aligned}$$
(54)

Using that

$$\begin{aligned} -2 \left\langle p-x, p -x_* \right\rangle + \left\Vert p-x\right\Vert ^2 = -\left\Vert p-x_*\right\Vert ^2 + \left\Vert z-x\right\Vert ^2, \end{aligned}$$

in combination with (54) gives

$$\begin{aligned} F(p) - F(x_*) \le -\frac{1}{2\gamma }{\left\Vert p - x_*\right\Vert }^2 - \frac{1}{\gamma } \left\langle x - \gamma \nabla f(x) - y, p-x_* \right\rangle + \frac{1}{2\gamma }{\left\Vert x_*- x\right\Vert }^2. \end{aligned}$$

Now it remains to multiply both sides by \(-2\gamma \) to arrive at (33).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khaled, A., Sebbouh, O., Loizou, N. et al. Unified Analysis of Stochastic Gradient Methods for Composite Convex and Smooth Optimization. J Optim Theory Appl 199, 499–540 (2023). https://doi.org/10.1007/s10957-023-02297-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10957-023-02297-y

Keywords

Navigation