Skip to main content
Log in

SAAGs: Biased stochastic variance reduction methods for large-scale learning

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Stochastic approximation is one of the effective approach to deal with the large-scale machine learning problems and the recent research has focused on reduction of variance, caused by the noisy approximations of the gradients. In this paper, we have proposed novel variants of SAAG-I and II (Stochastic Average Adjusted Gradient) (Chauhan et al. 2017), called SAAG-III and IV, respectively. Unlike SAAG-I, starting point is set to average of previous epoch in SAAG-III, and unlike SAAG-II, the snap point and starting point are set to average and last iterate of previous epoch in SAAG-IV, respectively. To determine the step size, we have used Stochastic Backtracking-Armijo line Search (SBAS) which performs line search only on selected mini-batch of data points. Since backtracking line search is not suitable for large-scale problems and the constants used to find the step size, like Lipschitz constant, are not always available so SBAS could be very effective in such cases. We have extended SAAGs (I, II, III and IV) to solve non-smooth problems and designed two update rules for smooth and non-smooth problems. Moreover, our theoretical results have proved linear convergence of SAAG-IV for all the four combinations of smoothness and strong-convexity, in expectation. Finally, our experimental studies have proved the efficacy of proposed methods against the state-of-art techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. experimental results can be reproduced using the code available at link: https://sites.google.com/site/jmdvinodjmd/code

  2. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

  3. For the simplification of proof, we take f(w) = F(w), i.e., fi(w) = fi(w) + g(w)∀i and then g(w) ≡ 0

References

  1. Allen-Zhu Z (2017) Katyusha: The first direct acceleration of stochastic gradient methods. Journal of Machine Learning Research (to appear) Full version available at arXiv:1603.05953

  2. Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Lechevallier Y, Saporta G (eds) Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010). http://leon.bottou.org/papers/bottou-2010. Springer, Paris, pp 177–187

  3. Bottou L, Bousquet O (2007) The tradeoffs of large scale learning. In: Proceedings of the 20th International Conference on Neural Information Processing Systems, Curran Associates Inc., USA, NIPS’07, pp 161–168, http://dl.acm.org/citation.cfm?id=2981562.2981583

  4. Bottou L, Curtis FE, Nocedal J (2016) Optimization Methods for Large-Scale Machine Learning. arXiv:1606.04838

  5. Cauchy A L (1847) Méthode générale pour la résolution des systèmes d’équations simultanées. Compte Rendu des S’eances de L’Acad’emie des Sciences XXV S’erie A(25):536–538

  6. Chauhan VK, Dahiya K, Sharma A (2017) Mini-batch block-coordinate based stochastic average adjusted gradient methods to solve big data problems. In: Proceedings of the Ninth Asian Conference on Machine Learning, PMLR, vol 77, pp 49–64, http://proceedings.mlr.press/v77/chauhan17a.html

  7. Chauhan VK, Dahiya K, Sharma A (2018a) Problem formulations and solvers in linear svm: a review. Artificial Intelligence Review. https://doi.org/10.1007/s10462-018-9614-6

  8. Chauhan V K, Sharma A, Dahiya K (2018b) Faster learning by reduction of data access time. Appl Intell 48(12):4715–4729. https://doi.org/10.1007/s10489-018-1235-x

    Article  Google Scholar 

  9. Chauhan VK, Sharma A, Dahiya K (2018c) Stochastic Trust Region Inexact Newton Method for Large-scale Machine Learning. arXiv:1812.10426

  10. Csiba D, Richt P (2016) Importance Sampling for Minibatches pp 1–19, arXiv:1602.02283v1

  11. Defazio A, Bach F, Lacoste-Julien S (2014) Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, NIPS’14, MIT Press, Cambridge, pp 1646–1654

  12. Fanhua S, Zhou K, Cheng J, Tsang IW, Zhang L, Tao D (2018) Vr-sgd: A simple stochastic variance reduction method for machine learning. arXiv:https://arxiv.org/abs/1802.09932

  13. Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems, vol 26, Curran Associates, Inc., pp 315–323

  14. Kiefer J, Wolfowitz J (1952) Stochastic estimation of the maximum of a regression function. Ann Math Stat 23:462–466

    Article  MathSciNet  MATH  Google Scholar 

  15. Konečnẏ J, Richtȧrik P (2013) Semi-Stochastic Gradient Descent Methods 1:19. arXiv:1312.1666

  16. Konečnẏ J, Liu J, Richtȧrik P, Takȧč M (2016) Mini-Batch Semi-Stochastic Gradient descent in the proximal setting. IEEE J Sel Top Signal Process 10(2):242–255

    Article  Google Scholar 

  17. Lan G (2012) An optimal method for stochastic composite optimization. Math Program 133(1):365–397

    Article  MathSciNet  MATH  Google Scholar 

  18. Le Roux N, Schmidt M, Bach F (2012) A Stochastic Gradient Method with an Exponential Convergence Rate for Strongly-Convex Optimization with Finite Training Sets. Technical Report, INRIA

  19. Lin H, Mairal J, Harchaoui Z (2015) A universal catalyst for first-order optimization. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS’15, pp 3384–3392

  20. Parikh N, Boyd S (2014) Proximal algorithms. Found Trends Optim 1(3):127–239

    Article  Google Scholar 

  21. Rakhlin A, Shamir O, Sridharan K (2012) Making gradient descent optimal for strongly convex stochastic optimization. In: Proceedings of the 29th International Conference on International Conference on Machine Learning, ICML’12, pp 1571–1578

  22. Robbins H, Monro S (1951) A Stochastic Approximation Method. Ann Math Statist 22(3):400–407. Retrieved from http://www.jstor.org/stable/2236626

  23. Shalev-Shwartz S, Zhang T (2013) Stochastic dual coordinate ascent methods for regularized loss. J Mach Learn Res 14(1):567–599

    MathSciNet  MATH  Google Scholar 

  24. Shalev-Shwartz S, Singer Y, Srebro N (2007) Pegasos: Primal estimated sub-gradient solver for svm. In: Proceedings of the 24th International Conference on Machine Learning, ICML ’07. ACM, New York, pp 807–814

  25. Wang H, Banerjee A (2014) Randomized block coordinate descent for online and stochastic optimization, pp 1–19. arXiv:1407.0107

  26. Wright S J (2015) Coordinate descent algorithms. Math Program 151(1):3–34

    Article  MathSciNet  MATH  Google Scholar 

  27. Xiao L, Zhang T (2014) A proximal stochastic gradient method with progressive variance reduction. SIAM J Optim 24(4):2057–2075

    Article  MathSciNet  MATH  Google Scholar 

  28. Xu Y, Yin W (2015) Block stochastic gradient iteration for convex and nonconvex optimization. SIAM J Optim 25(3):1686– 1716

    Article  MathSciNet  MATH  Google Scholar 

  29. Yang Y, Chen J, Zhu J (2016) Distributing the stochastic gradient sampler for large-scale lda. In: Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16. ACM, New York, pp 1975– 1984

  30. Yang Z, Wang C, Zhang Z, Li J (2018) Random barzilai–borwein step size for mini-batch algorithms. Eng Appl Artif Intell 72:124–135

    Article  Google Scholar 

  31. Shen Z, Qian H, Mu T, Zhang C (2017) Accelerated doubly stochastic gradient algorithm for large-scale empirical risk minimization. In: IJCAI

  32. Zhang T (2004) Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the Twenty-first International Conference on Machine Learning, ICML’04. ACM, New York, pp 116–

  33. Zhang Y, Xiao L (2015) Stochastic primal-dual coordinate method for regularized empirical risk minimization. In: Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp 353–361

  34. Zhao T, Yu M, Wang Y, Arora R, Liu H (2014) Accelerated Mini-batch Randomized Block Coordinate Descent Method. Advances in Neural Information Processing Systems, pp 3329–3337

  35. Zhou L, Pan S, Wang J, Vasilakos AV (2017) Machine learning on big data: Opportunities and challenges. Neurocomputing 237:350–361. https://doi.org/10.1016/j.neucom.2017.01.026

    Article  Google Scholar 

Download references

Acknowledgements

First author is thankful to Ministry of Human Resource Development, Government of INDIA, to provide fellowship (University Grants Commission - Senior Research Fellowship) to pursue his PhD.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vinod Kumar Chauhan.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: More Experiments

1.1 A.1 Results with Support Vector Machine (SVM)

This subsection compares SAAGs against SVRG and VR-SGD on SVM problem with mushroom and gisette datasets. Methods use stochastic backtracking line search method to find the step size. Figure 8 presents the results and compares the suboptimality against the training time (in seconds). Results are similar to experiments with logistic regression but are not that smooth. SAAGs outperform other methods on mushroom dataset (first row) and gisette dataset (second row) for suboptimality against training time and accuracy against time but all methods give almost similar results on accuracy versus training time for mushroom dataset. SAAG-IV outperforms other method and SAAG-III sometimes lags behind VR-SGD method. It is also observed that results with logistic regression are better than the results with the SVM problem. The optimization problem for SVM is given below:

$$ \underset{w}{\min} F(w) = \frac{1}{n} \sum\limits_{i = 1}^{n} \max\left( 0, 1 - y_{i} w^{T} x_{i} \right)^{2} + \frac{\lambda}{2} \|w\|^{2}, $$
(13)

where λ is the regularization coefficient (also penalty parameter) which balances the trade off between margin size and error [7].

Fig. 8
figure 8

Results with SVM using mini-batch of 1000 data points on mushroom (first row) and gisette (second row) datasets

1.2 A.2 Comparison of SAAGs (I, II, III and IV) for non-smooth problem

Comparison of SAAGs for non-smooth problem is depicted in Fig. 9 using Adult dataset with mini-batch of 32 data points. As it is clear from the figure, just like the smooth problem, results with SAAG-III and IV are stable and better or equal to SAAG-I and II.

Fig. 9
figure 9

Comparison of SAAG-I, II, III and IV on non-smooth problem (elastic-net-regularized logistic regression) using Adult dataset with mini-batch size of 32 data points. First row compares accuracy against epochs, gradients/n and time, and second row compares suboptimality against epochs, gradients/n and time

1.3 A.3 Effect of mini-batch size on SAAG-III, IV, SVRG and VR-SGD for non-smooth problem

Effect of mini-batch size on SAAG-III, IV, SVRG and VR-SGD for non-smooth problem is depicted in Fig. 10 using rcv1 binary dataset with mini-batch of 32, 64 and 128 data points. Similar to smooth problem, proposed methods outperform SVRG and VR-SGD methods. SAAG-IV gives the best result in terms of time and epochs but in terms of gradients/n, SAAG-III gives best results.

Fig. 10
figure 10

Study of effect of mini-batch size on SAAG-III, IV, SVRG and VR-SGD for non-smooth problem, using rcv1 dataset with mini-batch sizes of 32, 64 and 128. First row compares accuracy against epochs, gradients/n and time, and second row compares suboptimality against epochs, gradients/n and time

1.4 A.4 Effect of mini-batch size on SAAGs (I, II, III, IV) for smooth problem

Effect of mini-batch size on SAAGs (I, II, III, IV) for smooth problem is depicted in Fig. 11 using Adult dataset with mini-batch sizes of 32, 64 and 128 data points. The results are similar to non-smooth problem.

Fig. 11
figure 11

Study of effect of mini-batch size on SAAGs (I, II, III, IV) for smooth problem, using Adult dataset with mini-batch sizes of 32, 64 and 128. First row compares accuracy against epochs, gradients/n and time, and second row compares suboptimality against epochs, gradients/n and time

1.5 A.5 Effect of regularization coefficient for non-smooth problem

Figure 12 depicts effect of regularization coefficient on SAAG-III, IV, SVRG and VR-SGD for non-smooth problem using rcv1 dataset. It considers regularization coefficient values as 10− 3, 10− 5 and 10− 7. The results are similar to smooth problem. As it is clear from the figure, for larger values, 10− 3, all the methods do not perform well but once the coefficient is sufficiently small, it does not make much difference, and in all the cases our proposed methods outperform SVRG and VR-SGD.

Fig. 12
figure 12

Study of effect of regularization coefficient on SAAG-III, IV, SVRG and VR-SGD for non-smooth problem using rcv1 dataset and taking regularization coefficient values as 10− 3, 10− 5 and 10− 7. First row compares accuracy against epochs, gradients/n and time, and second row compares suboptimality against epochs, gradients/n and time

B Proofs

Following assumptions are considered in the paper:

Assumption 1 (Smoothness)

Suppose function\(f_{i}: \mathbb {R}^{n} \rightarrow \mathbb {R}\)isconvex and differentiable, and that gradientfi,∀iis L-Lipschitz-continuous, whereL > 0 isLipschitz constant, then, we have,

$$ \| \nabla f_{i}(y) - \nabla f_{i}(x)\| \le L \|y-x\|, $$
(14)
$$ \begin{array}{ll} \text{and},\quad f_{i}(y) \le f_{i}(x) +\nabla f_{i}(x)^{T}(y-x)+ \frac{L}{2} \|y-x\|^{2}. \end{array} $$
(15)

Assumption 2 (Strong Convexity)

Suppose function\(F: \mathbb {R}^{n} \rightarrow \mathbb {R}\)isμ-stronglyconvex function forμ > 0 andFisthe optimal value of F, then, we have,

$$ F(y) \ge F(x) +\nabla F(x)^{T}(y-x) + \frac{\mu}{2} \|y-x\|^{2}, $$
(16)
$$ \begin{array}{ll} \text{and},\quad & F(x) - F^{*} \le \frac{1}{2\mu}\|\nabla F(x)||^{2} \end{array} $$
(17)

Assumption 3 (Assumption 3 in [12])

For alls = 1, 2,...,S,the following inequality holds

$$ \mathbb{E}\left[F({w^{s}_{0}}) - F(w^{*}) \right] \le c\mathbb{E}\left[F(\tilde{w}^{s-1}) - F(w^{*}) \right] $$
(18)

where 0 < cmis a constant.

We derive our proofs by taking motivation from [12] and [27]. Before providing the proofs, we provide certain lemmas, as given below:

Lemma 1 (3-Point Property [17])

Let\(\hat {z}\)bethe optimal solution of the following problem:\(\underset {z\in \mathbb {R}^{d}}{\min }\quad \frac {\tau }{2} \|z-z_{0}\|^{2} + r(z), \)whereτ ≥ 0 andr(z) isa convex function (but possibly non-differentiable). Then for any\(z\in \mathbb {R}^{d}\),then the following inequality holds,

$$ \frac{\tau}{2} \|\hat{z}-z_{0}\|^{2} + r(\hat{z}) \le r(z) + \frac{\tau}{2} \left( \|z-z_{0}\|^{2} - \|z - \hat{z}\|^{2} \right) $$
(19)

Lemma 2 (Theorem 4 in [16])

For non-smooth problems, taking \(\tilde {\nabla }^{\prime }_{s,k} = \frac {1}{b} {\sum }_{i \in B_{k}} \nabla f_{i} ({w^{s}_{k}}) - \frac {1}{b} {\sum }_{i \in B_{k}} \nabla f_{i} (\tilde {w}^{s-1}) + \frac {1}{n} {\sum }_{i = 1}^{n}f_{i}(\tilde {w}^{s-1})\) , we have \(\mathbb {E} \left [\tilde {\nabla }^{\prime }_{s,k}\right ] = \nabla f({w^{s}_{k}})\) and the variance satisfies following inequality,

$$ \begin{array}{@{}rcl@{}} \mathbb{E} \left[\|\tilde{\nabla}^{\prime}_{s,k} - \nabla f({w^{s}_{k}})\|^{2}\right] &\le& 4L\alpha(b) \left[ F({w^{s}_{k}}) - F(w^{*})\right. \\ &&\left.+ F(\tilde{w}^{s-1}) - F(w^{*}) \right], \end{array} $$
(20)

whereα(b) = (nb)/(b(n − 1)).

Following the Lemma 2 for non-smooth problems, one can easily prove the following results for the smooth problems,

Lemma 3

For smooth problems, taking \(\tilde {\nabla }^{\prime }_{s,k} = \frac {1}{b} {\sum }_{i \in B_{k}} \) \(\nabla f_{i} ({w^{s}_{k}}) - \frac {1}{b} {\sum }_{i \in B_{k}} \nabla f_{i} (\tilde {w}^{s-1}) + \frac {1}{n} {\sum }_{i = 1}^{n}f_{i}(\tilde {w}^{s-1})\) , we have \(\mathbb {E} \left [\tilde {\nabla }^{\prime }_{s,k}\right ] = \nabla f({w^{s}_{k}})\) and the variance satisfies following inequality,

$$ \begin{array}{@{}rcl@{}} \mathbb{E} \left[\|\tilde{\nabla}^{\prime}_{s,k} - \nabla f({w^{s}_{k}})\|^{2}\right] &\le& 4L\alpha(b) \left[ f({w^{s}_{k}}) - f^{*} \right.\\ &&\left.+ f(\tilde{w}^{s-1}) - f^{*}\right], \end{array} $$
(21)

whereα(b) = (nb)/(b(n − 1)).

Lemma 4 (Extension of Lemma 3.4 in [27] to mini-batches)

Under Assumption 1 for smooth regularizer, we have

$$ \mathbb{E} \left[\|\nabla_{B_{k}} f({w^{s}_{k}}) - \nabla_{B_{k}} f(w^{*})\|^{2}\right] \le 2L \left[ f({w^{s}_{k}}) - f(w^{*}) \right] $$
(22)

Proof

Given any k = 0, 1,..., (m − 1), consider the function,

$$ \phi_{B_{k}} (w) = f_{B_{k}} (w) - f_{B_{k}}(w^{*}) - \nabla_{B_{k}} f(w^{*})^{T} (w-w^{*}) $$

It is straightforward to check that \(\nabla \phi _{B_{k}} (w^{*}) = 0\), hence \(\min _{w} \phi _{B_{k}} (w) = \phi _{B_{k}} (w^{*}) = 0\). Since \(\phi _{B_{k}} (w)\) is Lipschitz continuous so we have,

$$ \begin{array}{@{}rcl@{}} &&\!\!\!\!\!\frac{1}{2L}\|\nabla\phi_{B_{k}} (w)\|^{2} \le \phi_{B_{k}} (w) - \min_{w} \phi_{B_{k}} (w)\\ &&{\kern66pt}= \phi_{B_{k}} (w) - \phi_{B_{k}} (w^{*}) = \phi_{B_{k}} (w)\\ &&\!\!\!\!\!\!\implies \!\| \nabla f_{B_{k}} (w) - \nabla f_{B_{k}}(w^{*})\|^{2}\\ &&{\kern66pt}\le\! 2L\! \left[ \!f_{B_{k}} (w) - f_{B_{k}}\!(w^{*}) - \nabla_{B_{k}} f(w^{*})^{T} \!(w - w^{*}) \!\right] \end{array} $$

Taking expectation, we have

$$ \begin{array}{@{}rcl@{}} \mathbb{E}[\| \nabla f_{B_{k}} (w) - \nabla f_{B_{k}}(w^{*})\|^{2} ] &\le& 2L \left[ f (w) - f(w^{*})\right.\\ && \left.- \nabla f(w^{*})^{T} (w-w^{*}) \right]\\ \end{array} $$
(23)

By optimality, ∇f(w) = 0, we have

$$ \begin{array}{ll} \mathbb{E}[\| \nabla f_{B_{k}} (w) - \nabla f_{B_{k}}(w^{*})\|^{2} ] \le 2L \left[ f (w) - f(w^{*}) \right] \end{array} $$

This proves the required lemma. □

Lemma 5 (Extension of Lemma 3.4 in [27] to mini-batches)

Under Assumption 1 for non-smooth regularizer, we have

$$ \mathbb{E} \left[\|\nabla_{B_{k}}f({w^{s}_{k}}) - \nabla_{B_{k}} f(w^{*})\|^{2}\right] \le 2L \left[ F({w^{s}_{k}}) - F(w^{*}) \right] $$
(24)

Proof

From inequality (23), we have

$$ \begin{array}{@{}rcl@{}} \mathbb{E}[\| \nabla f_{B_{k}} (w) - \nabla f_{B_{k}}(w^{*})\|^{2} ] &\le& 2L \left[ f (w) - f(w^{*}) \right.\\ &&\left.- \nabla f(w^{*})^{T} (w-w^{*}) \right]\\ \end{array} $$
(25)

By optimality, there exist ξg(w), such that, ∇F(w) = ∇f(w) + ξ = 0, we have

$$ \begin{array}{ll} \mathbb{E}[\| \nabla f_{B_{k}} (w) - \nabla f_{B_{k}}(w^{*})\|^{2} ] &\le 2L \left[ f (w) - f(w^{*}) + \xi^{T} (w-w^{*}) \right]\\ &\le 2L \left[ f (w) - f(w^{*}) + g(w) - g (w^{*}) \right]\\ &\le 2L \left[ F(w) - F(w^{*})\right] \end{array} $$
(26)

second inequality follows from the convexity of g. This proves the required lemma. □

Lemma 6 (Variance Bound for smooth problem)

Under the Assumption 1 and taking\(\nabla _{B_{k}} f({w^{s}_{k}}) = \frac {1}{b} {\sum }_{i \in B_{k}} \nabla f_{i} ({w^{s}_{k}})\),\( \nabla _{B^{\prime }_{k}} f(\tilde {w}^{s-1}) = \frac {1}{n} {\sum }_{i \in B_{k}} \nabla f_{i} (\tilde {w}^{s-1})\),\(\tilde {\mu }^{s} = \frac {1}{n} {\sum }_{i = 1}^{n} \nabla f_{i}(\tilde {w}^{s-1})\)andthe gradient estimator,\(\tilde {\nabla }_{s,k} = \nabla _{B_{k}} f({w^{s}_{k}}) - \nabla _{B^{\prime }_{k}} f(\tilde {w}^{s-1}) +\tilde {\mu }^{s} \),then the variance satisfies the following inequalityFootnote 3,

$$ \begin{array}{@{}rcl@{}} \mathbb{E} \left[\| \tilde{\nabla}_{s,k} -\nabla f({w^{s}_{k}})\|^{2} \right] &\le& 8L\alpha(b) \left[ f({w^{s}_{k}}) - f^{*} \right]\\&&+ \frac{8L\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}}\\ &&\times\left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime}\\ \end{array} $$
(27)

where α(b) = (nb)/(b(n − 1)) and \(R^{\prime }\) is a constant.

Proof

First the expectation of estimator is given by

$$ \begin{array}{ll} \mathbb{E}\left[\tilde{\nabla}_{s,k} \right] &= \mathbb{E}\left[\nabla_{B_{k}} f({w^{s}_{k}}) - \nabla_{B^{\prime}_{k}} f(\tilde{w}^{s-1}) +\tilde{\mu}^{s} \right]\\ &= \nabla f({w^{s}_{k}}) - \frac{b}{n} \nabla f(\tilde{w}^{s-1}) + \nabla f(\tilde{w}^{s-1})\\ & = \nabla f({w^{s}_{k}}) + \frac{m-1}{m} \nabla f(\tilde{w}^{s-1}), \end{array} $$
(28)

second equality follows as n = mb. Now the variance bound is calculated as follows,

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E} \left[ \| \tilde{\nabla}_{s,k} - \nabla f({w^{s}_{k}}) \|^{2}\right]\\ &=& \mathbb{E}\left[\| \nabla_{B_{k}} f({w^{s}_{k}}) - \nabla_{B^{\prime}_{k}} f(\tilde{w}^{s-1}) +\nabla f (\tilde{w}^{s-1}) - \nabla f ({w^{s}_{k}})\|^{2}\right]\\ &=& \mathbb{E}\left[\| \nabla_{B_{k}} f({w^{s}_{k}}) - \nabla_{B_{k}} f(\tilde{w}^{s-1}) +\nabla f (\tilde{w}^{s-1}) - \nabla f ({w^{s}_{k}}) + \frac{m-1}{m}\nabla_{B_{k}} f(\tilde{w}^{s-1}) \|^{2}\right]\\ &\le& 2 \mathbb{E} \left[ \| \nabla_{B_{k}} f({w^{s}_{k}}) - \nabla_{B_{k}} f(\tilde{w}^{s-1}) +\nabla f (\tilde{w}^{s-1}) - \nabla f ({w^{s}_{k}})\|^{2}\right] + \frac{2(m-1)^{2}}{m^{2}} \mathbb{E} \left[ \| \nabla_{B_{k}} f(\tilde{w}^{s-1}) \|^{2} \right]\\ &\le& 8L\alpha(b) \left[ f({w^{s}_{k}}) - f^{*} + f(\tilde{w}^{s-1}) - f^{*} \right] + \frac{2(m-1)^{2}}{m^{2}} \mathbb{E} \left[ \| \nabla_{B_{k}} f(\tilde{w}^{s-1}) \|^{2} \right] \end{array} $$
(29)

inequality follows from, \(\|a+b\|^{2} \le 2\left (\|a\|^{2} + \|b\|^{2} \right )\) for \(a,b\in \mathbb {R}^{d}\) and applying the Lemma 3.

$$ \begin{array}{@{}rcl@{}} \text{Now,} &&\frac{2(m-1)^{2}}{m^{2}} \mathbb{E} \left[ \| \nabla_{B_{k}} f(\tilde{w}^{s-1}) \|^{2} \right]\\ &\le& \frac{2(m-1)^{2}}{m^{2}} \left[2 \mathbb{E} \|\nabla_{B_{k}} f(\tilde{w}^{s-1}) - \nabla_{B_{k}} f(w^{*}) \|^{2}\right.\\&&\qquad\quad\quad\quad\left. + 2 \mathbb{E} \| \nabla_{B_{k}} f(w^{*}) \|^{2} \right]\\ & \le& \frac{8L(m-1)^{2}}{m^{2}}\left[ f(\tilde{w}^{s-1}) - f(w^{*}) \right] + R^{\prime} \end{array} $$
(30)

first inequality follows from, \(\|a+b\|^{2} \le 2\left (\|a\|^{2} + \|b\|^{2} \right )\) for \(a,b\in \mathbb {R}^{d}\), second inequality follows from Lemma 4 and assuming \(\mathbb {E}\| \nabla _{B_{k}} f(w^{*}) \|^{2} \le R, \forall k \text {and}\) where taking \(R^{\prime } = \frac {2(m-1)^{2}}{m^{2}}* R\). Now, substituting the above inequality in (29), we have

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E} \left[ \| \tilde{\nabla}_{s,k} - \nabla f({w^{s}_{k}}) \|^{2}\right]\\ &\le& 8L\alpha(b) \left[ f({w^{s}_{k}}) - f^{*} + f(\tilde{w}^{s-1}) - f^{*} \right]\\ &&+ \frac{8L(m-1)^{2}}{m^{2}}\left[ f(\tilde{w}^{s-1}) - f(w^{*}) \right] + R^{\prime}\\ &=& 8L\alpha(b) \left[ f({w^{s}_{k}}) - f^{*} \right]\\ &&+ \frac{8L\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}} \left[ \!f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime}, \end{array} $$
(31)

This proves the required lemma. □

Lemma 7 (Variance Bound for non-smooth problem)

Under Assumption 1 and taking notations as in Lemma 6, the variance bound satisfies the following inequality,

$$ \begin{array}{@{}rcl@{}} \mathbb{E} \left[\| \tilde{\nabla}_{s,k} -\nabla f({w^{s}_{k}})\|^{2} \right] &\le& 8L\alpha(b) \left[ F({w^{s}_{k}}) - F^{*} \right]\\ &&+ \frac{8L\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}}\\ &&\times\left[ F(\tilde{w}^{s-1}) - F^{*} \right] + R^{\prime}, \end{array} $$
(32)

whereα(b) = (nb)/(b(n − 1)) and\(R^{\prime }\)isconstant.

Proof

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E} \left[ \| \tilde{\nabla}_{s,k} - \nabla f({w^{s}_{k}}) \|^{2}\right]\\ &=& \mathbb{E}\left[\| \nabla_{B_{k}} f({w^{s}_{k}}) - \nabla_{B^{\prime}_{k}} f(\tilde{w}^{s-1}) +\nabla f (\tilde{w}^{s-1}) - \nabla f ({w^{s}_{k}})\|^{2}\right]\\ &=& \mathbb{E}\left[\| \nabla_{B_{k}} f({w^{s}_{k}}) - \nabla_{B_{k}} f(\tilde{w}^{s-1}) +\nabla f (\tilde{w}^{s-1}) - \nabla f ({w^{s}_{k}}) + \frac{m-1}{m}\nabla_{B_{k}} f(\tilde{w}^{s-1}) \|^{2}\right]\\ &\le& 2 \mathbb{E} \left[ \| \nabla_{B_{k}} f({w^{s}_{k}}) - \nabla_{B_{k}} f(\tilde{w}^{s-1}) +\nabla f (\tilde{w}^{s-1}) - \nabla f ({w^{s}_{k}})\|^{2}\right] + \frac{2(m-1)^{2}}{m^{2}} \mathbb{E} \left[ \| \nabla_{B_{k}} f(\tilde{w}^{s-1}) \|^{2} \right]\\ &\le& 8L\alpha(b) \left[ F({w^{s}_{k}}) - F({w^{*}}) + F(\tilde{w}^{s-1}) - F({w^{*}}) \right] + \frac{2(m-1)^{2}}{m^{2}} \mathbb{E} \left[ \| \nabla_{B_{k}} f(\tilde{w}^{s-1}) \|^{2} \right] \end{array} $$
(33)

inequality follows from, \(\|a+b\|^{2} \le 2\left (\|a\|^{2} + \|b\|^{2} \right )\) for \(a,b\in \mathbb {R}^{d}\) and applying the Lemma 2.

$$ \begin{array}{@{}rcl@{}} \text{Now,} &&\frac{2(m-1)^{2}}{m^{2}} \mathbb{E} \left[ \| \nabla_{B_{k}} f(\tilde{w}^{s-1}) \|^{2} \right]\\ &\le& \frac{2(m-1)^{2}}{m^{2}} \left[2 \mathbb{E} \|\nabla_{B_{k}} f(\tilde{w}^{s-1}) - \nabla_{B_{k}} f(w^{*}) \|^{2}\right.\\ &&\qquad\quad\quad~\left.+ 2 \mathbb{E} \| \nabla_{B_{k}} f(w^{*}) \|^{2} \right]\\ & \le& \frac{8L(m-1)^{2}}{m^{2}}\left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime} \end{array} $$
(34)

first inequality follows from, \(\|a+b\|^{2} \le 2\left (\|a\|^{2} + \|b\|^{2} \right )\) for \(a,b\in \mathbb {R}^{d}\), second inequality follows from Lemma 5 and assuming \(\| \nabla _{B_{k}} f(w^{*}) \|^{2} \le R, \forall k\) and taking \(R^{\prime } = \frac {2(m-1)^{2}}{m^{2}}* R\). Now, substituting the above inequality in (33), we have

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E} \left[ \| \tilde{\nabla}_{s,k} - \nabla f({w^{s}_{k}}) \|^{2}\right]\\ &\le& 8L\alpha(b) \left[ F({w^{s}_{k}}) - F^{*} + F(\tilde{w}^{s-1}) - F^{*} \right]\\ &&+ \frac{8L(m-1)^{2}}{m^{2}}\left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime}\\ &=& 8L\alpha(b) \left[ F({w^{s}_{k}}) - F^{*} \right] + \frac{8L\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}}\\ &&\times\left[ F(\tilde{w}^{s-1}) - F^{*} \right] + R^{\prime}, \end{array} $$
(35)

This proves the required lemma. □

Proof of Theorem 1

(Non-strongly convex and smooth problem with SAAG-IV) □

Proof

By smoothness, we have,

$$ \begin{array}{@{}rcl@{}} f(w^{s}_{k + 1}) &\le& f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}), w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ &=& f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}), w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ &&- \frac{L(\beta-1)}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ &=& f({w^{s}_{k}}) + <\tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ && + <\nabla f({w^{s}_{k}}) - \tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}>\\&&- \frac{L(\beta-1}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}, \end{array} $$
(36)

where β is appropriately chosen positive value. Now, separately simplifying the terms, we have

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}\left[ f({w^{s}_{k}}) + <\tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \right]\\ &=& f({w^{s}_{k}}) + \mathbb{E}\left[<\tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}>\right] + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ &=& f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}) + \frac{m-1}{m}\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ &=& f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}), w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} + \frac{m-1}{m}<\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - {w^{s}_{k}}>\\ &\le& f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}), w^{*} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{*} - {w^{s}_{k}} \|^{2} - \frac{L\beta}{2} \| w^{*} - w^{s}_{k + 1}\|^{2}\\ &&+ \frac{m-1}{m}<\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - {w^{s}_{k}}>\\ & =& f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}), w^{*} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{*} - {w^{s}_{k}} \|^{2} - \frac{L\beta}{2} \| w^{*} - w^{s}_{k + 1}\|^{2}\\ &&+ \frac{m-1}{m}\left[<\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - w^{*}>- <\nabla f(\tilde{w}^{s-1}), {w^{s}_{k}} - w^{*}>\right]\\ &\le& f(w^{*}) + \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right]\\ &&+ \frac{m-1}{m}\left[\frac{1}{2\delta}\|\nabla f(\tilde{w}^{s-1})\|^{2} + \frac{\delta}{2} \|w^{s}_{k + 1} - w^{*}\|^{2} - \left[ \frac{1}{2\delta}\|\nabla f(\tilde{w}^{s-1})\|^{2} +\frac{\delta}{2} \|{w^{s}_{k}} - w^{*}\|^{2}\right]\right]\\ &=& f(w^{*}) + \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right] + \frac{\delta(m-1)}{2m}\left[\|w^{s}_{k + 1} - w^{*}\|^{2} - \|{w^{s}_{k}} - w^{*}\|^{2}\right],\\ &=& f(w^{*}) + \left( \frac{L\beta}{2} - \frac{\delta(m-1)}{2m}\right) \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2} \right],\\ &=& f(w^{*}), \end{array} $$
(37)

second equality follows from, \(\mathbb {E}\left [\tilde {\nabla }_{s,k}\right ] = \nabla f({w^{s}_{k}}) + \frac {m-1}{m}\nabla f(\tilde {w}^{s-1})\), first inequality follows from Lemma 1, second inequality follows from the convexity, i.e., \( f(w^{*}) \ge f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}), w^{*} - {w^{s}_{k}}>\) and Young’s inequality, i.e., xTy ≤ 1/(2δ)∥x2 + δ/2∥y2 for δ > 0, and last equality follows by choosing \(\delta = \frac {mL\beta }{(m-1)}\).

$$ \begin{array}{@{}rcl@{}} \text{and}, &&\mathbb{E} \left[ <\nabla f({w^{s}_{k}}) - \tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}> - \frac{L(\beta-1)}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \right]\\ &\le& \mathbb{E} \left[ \frac{1}{2L(\beta-1)} \|\nabla f({w^{s}_{k}}) - \tilde{\nabla}_{s,k}\|^{2} + \frac{L(\beta-1)}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} - \frac{L(\beta-1)}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \right]\\ &=& \frac{1}{2L(\beta-1)} \mathbb{E} \left[\| \tilde{\nabla}_{s,k} - \nabla f({w^{s}_{k}})\|^{2} \right]\\ &\le& \frac{1}{2L(\beta-1)} \left[8L\alpha(b) \left[ f({w^{s}_{k}}) - f^{*} \right] + \frac{8L\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}} \left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime}\right]\\ &=& \frac{4\alpha(b)}{(\beta-1)} \left[ f({w^{s}_{k}}) - f^{*} \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime\prime} \end{array} $$
(38)

first inequality follows from Young’s inequality, second inequality follows from Lemma 6 and \(R^{\prime \prime } = R^{\prime }/(2L(\beta -1))\). Now, substituting the values into (36) from inequalities (37) and (38), and taking expectation w.r.t. mini-batches, we have

$$ \begin{array}{@{}rcl@{}} \mathbb{E} \left[ f(w^{s}_{k + 1})\right] \!&\le&\! f(w^{*}) + \frac{4\alpha(b)}{(\beta-1)} \left[ f({w^{s}_{k}}) - f^{*} \right]\\ \!\!&&+ \frac{4\left( \alpha(b)m^{2}+(m - 1)^{2}\right)}{m^{2}(\beta-1)} \!\left[ f(\tilde{w}^{s-1}) - f^{*} \!\right] + R^{\prime\prime}\\ \mathbb{E} \left[ {\kern-.5pt}f(w^{s}_{k + 1}) - f{\kern-.5pt}(w^{*})\right] \!&\le&\!\! \frac{4\alpha(b)}{(\beta-1)} \left[ f({w^{s}_{k}}) - f^{*} \right]\\ \!\!&&+ \frac{4\!{\kern-.5pt}\left( {\kern-.5pt}\alpha({\kern-.5pt}b{\kern-.5pt})m^{2} + (m - 1{\kern-.5pt})^{2}\right)}{m^{2}(\beta-1)} \!\left[ \!f(\tilde{w}^{s{\kern-.5pt}-{\kern-.5pt}1}{\kern-.5pt}) - f^{*} \!\right] + R^{\prime\prime} \end{array} $$
(39)

Taking sum over k = 0, 1,..., (m − 1) and dividing by m, we have

$$ \begin{array}{@{}rcl@{}} &&\frac{1}{m}{\sum}_{k = 0}^{m-1}\mathbb{E} \left[ f(w^{s}_{k + 1}) - f^{*}\right]\\ &\le& \frac{1}{m}{\sum}_{k = 0}^{m-1} \left[ \frac{4\alpha(b)}{(\beta-1)} \left[ f({w^{s}_{k}}) - f^{*} \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime\prime}\right]\\ &&\frac{1}{m}{\sum}_{k = 1}^{m}\mathbb{E} \left[ f({w^{s}_{k}}) - f^{*}\right]\\ &\le& \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m}{\sum}_{k = 1}^{m} \left[ f({w^{s}_{k}}) - f^{*} \right] + \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m} \left[ f({w^{s}_{0}}) - f^{*} - \lbrace f({w^{s}_{m}}) - f^{*} \rbrace\right]\\ &&+ \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f (\tilde{w}^{s-1}) - f^{*} \right]+ R^{\prime\prime} \end{array} $$
(40)

Subtracting \(\frac {4\alpha (b)}{(\beta -1)} \frac {1}{m}{\sum }_{k = 1}^{m} \left [ f({w^{s}_{k}}) - f^{*} \right ] \) from both sides, we have

$$ \begin{array}{@{}rcl@{}} &&\left( 1-\frac{4\alpha(b)}{(\beta-1)}\right)\frac{1}{m}{\sum}_{k = 1}^{m}\mathbb{E} \left[ f({w^{s}_{k}}) - f^{*}\right]\\ &\le& \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m} \left[ f({w^{s}_{0}}) - f^{*} - \lbrace f({w^{s}_{m}}) - f^{*} \rbrace\right]\\ &&+ \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f (\tilde{w}^{s-1}) - f^{*} \right]+ R^{\prime\prime} \end{array} $$
(41)

Since \(f({w^{s}_{m}}) - f^{*} \ge 0\) so dropping this term and using Assumption 3, we have

$$ \begin{array}{@{}rcl@{}} &&\left( 1-\frac{4\alpha(b)}{(\beta-1)}\right)\frac{1}{m}{\sum}_{k = 1}^{m}\mathbb{E} \left[ f({w^{s}_{k}}) - f^{*}\right]\\ &\le& \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m} \left[ f({w^{s}_{0}}) - f^{*} \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f (\tilde{w}^{s-1}) - f^{*} \right]+ R^{\prime\prime}\\ &\le& \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m} \left[ c\left( f (\tilde{w}^{s-1}) - f^{*}\right) \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f (\tilde{w}^{s-1}) - f^{*} \right]+ R^{\prime\prime}\\ &=&\left[ \frac{4\alpha(b)}{(\beta-1)}\frac{c}{m} +\frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \right] \left[ \left( f (\tilde{w}^{s-1}) - f^{*}\right) \right]+ R^{\prime\prime}, \end{array} $$
(42)

Dividing both sides by \(\left (1-\frac {4\alpha (b)}{(\beta -1)}\right )\), and since \(\tilde {w}^{s} = 1/m {\sum }_{k = 1}^{m}{w^{s}_{k}}\) so by convexity, \(f(\tilde {w}^{s}) \le 1/m {\sum }_{k = 1}^{m} f({w^{s}_{k}})\), we have

$$ \begin{array}{@{}rcl@{}} \mathbb{E} \left[ f(\tilde{w}^{s}) - f^{*}\right] &\le& \!\left[ \!\frac{4\alpha(b)}{(\beta - 1 - 4\alpha(b))}\frac{c}{m} + \frac{4\left( \alpha(b)m^{2} + (m-1)^{2}\right)}{m^{2}(\beta - 1 - 4\alpha(b))} \right]\\ &&\times\left[ f (\tilde{w}^{s-1}) - f^{*} \right]+ R^{\prime\prime\prime}, \end{array} $$
(43)

where \(R^{\prime \prime \prime } = R^{\prime \prime } (\beta -1)/\left (\beta -1-4\alpha (b)\right )\). Now, applying this inequality recursively, we have

$$ \begin{array}{ll} \mathbb{E} \left[ f(\tilde{w}^{s}) - f^{*}\right] \le C^{s} \left[ f (\tilde{w}^{0}) - f^{*} \right] + R^{\prime\prime\prime\prime}, \end{array} $$
(44)

inequality follows for \(R^{\prime \prime \prime \prime }= R^{\prime \prime \prime }/(1-C)\), since \({\sum }_{i = 0}^{k} r^{i} \le {\sum }_{i = 0}^{\infty } r^{i} = \frac {1}{1-r}, \quad \|r\|<1\) and \(C = \left [ \frac {4\alpha (b)}{(\beta -1-4\alpha (b))}\frac {c}{m} +\frac {4\left (\alpha (b)m^{2}+(m-1)^{2}\right )}{m^{2}(\beta -1-4\alpha (b))} \right ]\). For certain choice of β, one can easily prove that C < 1. This proves linear convergence with some initial error. □

Proof of Theorem 2

(Strongly convex and smooth problem with SAAG-IV) □

Proof

By smoothness, we have,

$$ \begin{array}{@{}rcl@{}} f(w^{s}_{k + 1}) &\le& f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}), w^{s}_{k + 1} - {w^{s}_{k}}>\\ &&+ \frac{L}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ &=& f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}), w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1}\\ && - {w^{s}_{k}} \|^{2} - \frac{L(\beta-1}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ &=& f({w^{s}_{k}}) + <\tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ && + <\nabla f({w^{s}_{k}}) - \tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}>\\&&- \frac{L(\beta-1}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \end{array} $$
(45)

where β is appropriately chosen positive value. Now, separately simplifying the terms, we have

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}\left[ f({w^{s}_{k}}) + <\tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \right]\\ &=& f({w^{s}_{k}}) + \mathbb{E}\left[<\tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}>\right] + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ &=& f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}) + \frac{m-1}{m}\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ &=& f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}), w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} + \frac{m-1}{m}<\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - {w^{s}_{k}}>\\ &\le& f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}), w^{*} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{*} - {w^{s}_{k}} \|^{2} - \frac{L\beta}{2} \| w^{*} - w^{s}_{k + 1}\|^{2}\\ &&+ \frac{m-1}{m}<\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - {w^{s}_{k}}>\\ &=& f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}), w^{*} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{*} - {w^{s}_{k}} \|^{2} - \frac{L\beta}{2} \| w^{*} - w^{s}_{k + 1}\|^{2}\\ &&+ \frac{m-1}{m}\left[<\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - \tilde{w}^{s-1}>- <\nabla f(\tilde{w}^{s-1}), {w^{s}_{k}} - \tilde{w}^{s-1}>\right]\\ &\le& f(w^{*}) + \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right]\\ &+& \frac{m-1}{m}\left[ f(w^{s}_{k + 1}) - f(\tilde{w}^{s-1}) - \left( f({w^{s}_{k}}) - f(\tilde{w}^{s-1})\right)\right]\\ &=& f(w^{*}) + \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right] + \frac{m-1}{m}\left[ f(w^{s}_{k + 1}) - f({w^{s}_{k}})\right], \end{array} $$
(46)

second equality follows from, \(\mathbb {E}\left [\tilde {\nabla }_{s,k}\right ] = \nabla f({w^{s}_{k}}) + \frac {m-1}{m}\nabla f(\tilde {w}^{s-1})\), first inequality follows from Lemma 1 and second inequality follows from the convexity, i.e., f(x) ≥ f(y)+ < ∇f(y),xy >.

$$ \begin{array}{@{}rcl@{}} \text{and}, &&\mathbb{E} \left[ <\nabla f({w^{s}_{k}}) - \tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}> - \frac{L(\beta-1)}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \right]\\ &\le& \mathbb{E} \left[ \frac{1}{2L(\beta-1)} \|\nabla f({w^{s}_{k}}) - \tilde{\nabla}_{s,k}\|^{2} + \frac{L(\beta-1)}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} - \frac{L(\beta-1)}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \right]\\ &\le& \frac{1}{2L(\beta-1)} \left[8L\alpha(b) \left[ f({w^{s}_{k}}) - f^{*} \right] + \frac{8L\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}} \left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime}\right]\\ &=& \frac{4\alpha(b)}{(\beta-1)} \left[ f({w^{s}_{k}}) - f^{*} \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime\prime} \end{array} $$
(47)

first inequality follows from Young’s inequality and second inequality follows from Lemma 6 and \(R^{\prime \prime } = R^{\prime }/(2L(\beta -1))\).

Now, substituting the values into (45) from inequalities (46) and (47), and taking expectation w.r.t. mini-batches, we have

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E} \left[ f(w^{s}_{k + 1})\right]\\ &\le& f(w^{*}) + \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right] + \frac{m-1}{m}\left[ f(w^{s}_{k + 1}) - f({w^{s}_{k}})\right]\\ &&+ \frac{4\alpha(b)}{(\beta-1)} \left[ f({w^{s}_{k}}) - f^{*} \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime\prime},\\ &&\mathbb{E} \left[ f(w^{s}_{k + 1})-f(w^{*})\right]\\&=& \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right] + \frac{m-1}{m}\left[ f(w^{s}_{k + 1}) - f({w^{s}_{k}})\right]\\ &+& \frac{4\alpha(b)}{(\beta-1)} \left[ f({w^{s}_{k}}) - f^{*} \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime\prime} \end{array} $$
(48)

Taking sum over k = 0, 1,..., (m − 1) and dividing by m, we have

$$ \begin{array}{@{}rcl@{}} &&\frac{1}{m}{\sum}_{k = 0}^{m-1}\mathbb{E} \left[ f(w^{s}_{k + 1})-f(w^{*})\right]\\ &\le& \frac{1}{m}{\sum}_{k = 0}^{m-1}\left\lbrace \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right] + \frac{m-1}{m}\left[ f(w^{s}_{k + 1}) - f({w^{s}_{k}})\right]\right\rbrace\\ &&+ \frac{1}{m}{\sum}_{k = 0}^{m-1}\left\lbrace\frac{4\alpha(b)}{(\beta-1)} \left[ f({w^{s}_{k}}) - f^{*} \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime\prime}\right\rbrace\\ &&\frac{1}{m}{\sum}_{k = 1}^{m}\mathbb{E} \left[ f({w^{s}_{k}})-f(w^{*})\right]\\ &\le& \frac{L\beta}{2m} \left[ \| w^{*} - {w^{s}_{0}} \|^{2} - \| w^{*} - {w^{s}_{m}}\|^{2}\right] + \frac{m-1}{m^{2}}\left[ f({w^{s}_{m}}) - f({w^{s}_{0}})\right]\\ &&+ \frac{4\alpha(b)}{(\beta-1)}\frac{1}{m}\left\lbrace {\sum}_{k = 1}^{m}\left[ f({w^{s}_{k}}) - f^{*} \right] + f({w^{s}_{0}}) - f^{*} - \left( f({w^{s}_{m}}) - f^{*} \right)\right\rbrace\\ &&+ \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime\prime} \end{array} $$
(49)

Subtracting \(\frac {4\alpha (b)}{(\beta -1)}\frac {1}{m}{\sum }_{k = 1}^{m}\left [ f({w^{s}_{k}}) - f^{*} \right ]\) from both sides, we have

$$ \begin{array}{@{}rcl@{}} &&\left( 1-\frac{4\alpha(b)}{(\beta-1)}\right)\frac{1}{m}{\sum}_{k = 1}^{m}\mathbb{E} \left[ f({w^{s}_{k}}) - f^{*}\right]\\ &\le& \frac{L\beta}{2m} \left[ \| w^{*} - {w^{s}_{0}} \|^{2} - \| w^{*} - {w^{s}_{m}}\|^{2}\right] - \left( \frac{m-1}{m^{2}} - \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m}\right) \left[ f({w^{s}_{0}}) -f({w^{s}_{m}}) \right]\\ &&+ \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime\prime}\\ &\le& \frac{L\beta}{2m} \| w^{*} - {w^{s}_{0}} \|^{2} - \left( \frac{m-1}{m^{2}} - \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m}\right) \left[ f({w^{s}_{0}}) -f^{*} - \lbrace f({w^{s}_{m}}) - f^{*}\rbrace \right]\\ &&+ \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime\prime}\\ &\le& \frac{L\beta}{2m} \frac{2}{\mu}\left( f({w^{s}_{0}})- f^{*} \right) - \left( \frac{m-1}{m^{2}} - \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m}\right) \left[ c\left[ f(\tilde{w}^{s-1}) - f^{*} \right] - c\left[ f(\tilde{w}^{s}) - f^{*} \right] \right]\\ &&+ \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime\prime}\\ &\le& \frac{L\beta}{2m} \frac{2}{\mu}c\left[ f(\tilde{w}^{s-1}) - f^{*} \right] - \left( \frac{m-1}{m^{2}} - \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m}\right) \left[ c\left[ f(\tilde{w}^{s-1}) - f^{*} \right] - c\left[ f(\tilde{w}^{s}) - f^{*} \right] \right]\\ &&+ \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime\prime} \end{array} $$
(50)

second inequality follows by dropping, \(\| w^{*} - {w^{s}_{m}}\|^{2} < 0\), third inequality follows from the strong convexity, i.e., \(\| {w^{s}_{0}} - w^{*}\|^{2} \le 2/ \mu \left (f({w^{s}_{0}})- f^{*} \right ) \) and application of Assumption 3 twice, and fourth inequality follows from Assumption 3.

Since \(\tilde {w}^{s} = 1/m {\sum }_{k = 1}^{m}{w^{s}_{k}}\) so by convexity using, \(f(\tilde {w}^{s}) \le 1/m {\sum }_{k = 1}^{m} f({w^{s}_{k}})\), we have

$$ \begin{array}{@{}rcl@{}} &&\left( 1-\frac{4\alpha(b)}{(\beta-1)}\right)\mathbb{E} \left[ f(\tilde{w}^{s}) - f^{*}\right]\\ &\le& \frac{L\beta}{2m} \frac{2}{\mu}c\left[ f(\tilde{w}^{s-1}) - f^{*} \right] - \left( \frac{m-1}{m^{2}} - \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m}\right)\\ &&\times\left[ c\left[ f(\tilde{w}^{s-1}) - f^{*} \right] - c\left[ f(\tilde{w}^{s}) - f^{*} \right] \right]\\ &&+ \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime\prime} \end{array} $$
(51)

Subtracting, \( c\left (\frac {m-1}{m^{2}} - \frac {4\alpha (b)}{(\beta -1)} \frac {1}{m}\right ) \mathbb {E} \left [ f(\tilde {w}^{s}) - f^{*}\right ]\) both sides, we have

$$ \begin{array}{@{}rcl@{}} &&\left( 1-\frac{4\alpha(b)}{(\beta-1)} - c\left( \frac{m-1}{m^{2}} - \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m}\right)\right) \mathbb{E} \left[ f(\tilde{w}^{s}) - f^{*}\right]\\ &\le& \left[\frac{cL\beta}{m\mu} + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} -\frac{c(m-1)}{m^{2}} + \frac{4\alpha(b)}{(\beta-1)}\right]\\ &&\times\left[ f(\tilde{w}^{s-1}) -f^{*} \right]+ R^{\prime\prime} \end{array} $$
(52)

Dividing both sides by \( \left (1-\frac {4\alpha (b)}{(\beta -1)} - c\left (\frac {m-1}{m^{2}} - \frac {4\alpha (b)}{(\beta -1)} \frac {1}{m}\right )\right )\), we have

$$ \mathbb{E} \left[ f(\tilde{w}^{s}) - f^{*}\right] \le C\left[ f(\tilde{w}^{s-1}) -f^{*} \right]+ R^{\prime\prime\prime} $$
(53)

where \(C = \left [\frac {cL\beta }{m\mu } + \frac {4\left (\alpha (b)m^{2}+(m-1)^{2}\right )}{m^{2}(\beta -1)} -\frac {c(m-1)}{m^{2}} + \frac {4\alpha (b)}{(\beta -1)}\right ]\)\(\left (1-\frac {4\alpha (b)}{(\beta -1)} - c\left (\frac {m-1}{m^{2}} - \frac {4\alpha (b)}{(\beta -1)} \frac {1}{m}\right )\right )^{-1}\) and \(R^{\prime \prime \prime }= R^{\prime \prime }\left (1-\frac {4\alpha (b)}{(\beta -1)} - c\left (\frac {m-1}{m^{2}} - \frac {4\alpha (b)}{(\beta -1)} \frac {1}{m}\right )\right )^{-1}\). Now, recursively applying the inequality, we have

$$ \mathbb{E} \left[ f(\tilde{w}^{s}) - f^{*}\right] \le C^{s}\left[ f(\tilde{w}^{0}) -f^{*} \right]+ R^{\prime\prime\prime\prime}, $$
(54)

inequality follows for \(R^{\prime \prime \prime \prime }= R^{\prime \prime \prime }/(1-C)\), since \({\sum }_{i = 0}^{k} r^{i} \le {\sum }_{i = 0}^{\infty } r^{i} = \frac {1}{1-r}, \quad \|r\|<1\). For certain choice of β, one can easily prove that C < 1. This proves linear convergence with some initial error. □

Proof of Theorem 3

(Non-strongly convex and non-smooth problem with SAAG-IV) □

Proof

By smoothness, we have,

$$ \begin{array}{@{}rcl@{}} f(w^{s}_{k + 1}) &\le& f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}), w^{s}_{k + 1} - {w^{s}_{k}}> \\ &&+ \frac{L}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \end{array} $$
(55)
$$ \begin{array}{@{}rcl@{}} \text{Now,} \!\!\!&&F(w^{s}_{k + 1}) = f(w^{s}_{k + 1}) + g(w^{s}_{k + 1})\\ \!\!\!&\le& f({w^{s}_{k}}) + g(w^{s}_{k + 1}) + <\nabla f({w^{s}_{k}}), w^{s}_{k + 1} - {w^{s}_{k}}>\\ &&+ \frac{L}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ \!\!\!&=& f({w^{s}_{k}}) + g(w^{s}_{k + 1}) + <\nabla f({w^{s}_{k}}), w^{s}_{k + 1} - {w^{s}_{k}}>\\ &&+ \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} - \frac{L(\beta-1}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ \!\!\!&=& f({w^{s}_{k}}) + g(w^{s}_{k + 1}) + <\tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}>\\ &&+ \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} + <\nabla f({w^{s}_{k}}) - \tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}>\\ &&- \frac{L(\beta-1}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \end{array} $$
(56)

where β is appropriately chosen positive value. Now, separately simplifying the terms, we have

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}\left[ f({w^{s}_{k}}) + g(w^{s}_{k + 1}) + <\tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \right]\\ & =& f({w^{s}_{k}}) + g(w^{s}_{k + 1}) + \mathbb{E}\left[<\tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}>\right] + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ & =& f({w^{s}_{k}}) + g(w^{s}_{k + 1}) + <\nabla f({w^{s}_{k}}) + \frac{m-1}{m}\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ & =& f({w^{s}_{k}}) + g(w^{s}_{k + 1}) + <\nabla f({w^{s}_{k}}), w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} + \frac{m-1}{m}<\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - {w^{s}_{k}}>\\ & \le& f({w^{s}_{k}}) + g(w^{*}) + <\nabla f({w^{s}_{k}}), w^{*} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{*} - {w^{s}_{k}} \|^{2} - \frac{L\beta}{2} \| w^{*} - w^{s}_{k + 1}\|^{2}\\ && + \frac{m-1}{m}<\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - {w^{s}_{k}}>\\ & =& f({w^{s}_{k}}) + g(w^{*}) + <\nabla f({w^{s}_{k}}), w^{*} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{*} - {w^{s}_{k}} \|^{2} - \frac{L\beta}{2} \| w^{*} - w^{s}_{k + 1}\|^{2}\\ && + \frac{m-1}{m}\left[<\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - w^{*}>- <\nabla f(\tilde{w}^{s-1}), {w^{s}_{k}} - w^{*}>\right]\\ & \le& f(w^{*}) + g(w^{*}) + \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right]\\ && + \frac{m-1}{m}\left[\frac{1}{2\delta}\|\nabla f(\tilde{w}^{s-1})\|^{2} + \frac{\delta}{2} \|w^{s}_{k + 1} - w^{*}\|^{2} - \left[ \frac{1}{2\delta}\|\nabla f(\tilde{w}^{s-1})\|^{2} +\frac{\delta}{2} \|{w^{s}_{k}} - w^{*}\|^{2}\right]\right]\\ & =& F(w^{*}) + \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right] + \frac{\delta(m-1)}{2m}\left[\|w^{s}_{k + 1} - w^{*}\|^{2} - \|{w^{s}_{k}} - w^{*}\|^{2}\right],\\ & =& F(w^{*}) + \left( \frac{L\beta}{2} - \frac{\delta(m-1)}{2m}\right) \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2} \right],\\ & =&F(w^{*}), \end{array} $$
(57)

second equality follows from, \(\mathbb {E}\left [\tilde {\nabla }_{s,k}\right ] = \nabla f({w^{s}_{k}}) + \frac {m-1}{m}\nabla f(\tilde {w}^{s-1})\), first inequality follows from Lemma 1, second inequality follows from the convexity, i.e., f(w) ≥ \( f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}), w^{*} - {w^{s}_{k}}>\) and Young’s inequality, i.e., xTy ≤ 1/(2δ)∥x2 + δ/2∥y2 for δ > 0, and last equality follows by choosing \(\delta = \frac {mL\beta }{(m-1)}\).

$$ \begin{array}{@{}rcl@{}} \text{And}, &&\mathbb{E} \left[ <\nabla f({w^{s}_{k}}) - \tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}> - \frac{L(\beta-1)}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \right]\\ & \le& \mathbb{E} \left[ \frac{1}{2L(\beta-1)} \|\nabla f({w^{s}_{k}}) - \tilde{\nabla}_{s,k}\|^{2} + \frac{L(\beta-1)}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} - \frac{L(\beta-1)}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \right]\\ & =& \frac{1}{2L(\beta-1)} \mathbb{E} \left[\| \tilde{\nabla}_{s,k} - \nabla f({w^{s}_{k}})\|^{2} \right]\\ & \le& \frac{1}{2L(\beta-1)}\left[8L\alpha(b) \left[ F({w^{s}_{k}}) - F^{*} \right] + \frac{8L\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}} \left[ F(\tilde{w}^{s-1}) - F^{*} \right] + R^{\prime}\right]\\ & =& \frac{4\alpha(b)} {(\beta-1)}\left[ F({w^{s}_{k}}) - F^{*} \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F^{*} \right] + R^{\prime\prime} \end{array} $$
(58)

first inequality follows from Young’s inequality and second inequality follows from Lemma 7 and \(R^{\prime \prime } = R^{\prime }/(2L(\beta -1))\). Now, substituting the values into (56) from inequalities (57) and (58), and taking expectation w.r.t. mini-batches, we have

$$ \begin{array}{@{}rcl@{}} \mathbb{E} \!\left[ F(w^{s}_{k + 1})\right] \!\!&\le&\! F(w^{*}) + \frac{4\alpha(b)} {(\beta-1)}\left[ F({w^{s}_{k}}) - F^{*} \right]\\ &&\!\!+ \frac{4\left( \alpha(b)m^{2} + (m - 1)^{2}\right)}{m^{2}(\beta-1)} \!\left[ \!F(\tilde{w}^{s-1}) - F^{*} \right] + R^{\prime\prime}\\ \mathbb{E} \!\left[ \!F{\kern-.5pt}({\kern-.5pt}w^{s}_{k + 1}{\kern-.5pt}) - F{\kern-.5pt}({\kern-.5pt}w^{*})\right] \!\!&\le&\! \frac{4\alpha(b)} {(\beta-1)}\left[ F({w^{s}_{k}}) - F^{*} \right]\\ &&\!\!+ \frac{4{\kern-.5pt}\left( {\kern-.5pt}\alpha({\kern-.5pt}b)m^{2} + (m - 1)^{2}\right)}{m^{2}(\beta-1)} \!\left[ \!F{\kern-.5pt}({\kern-.5pt}\tilde{w}^{s-1}{\kern-.5pt}) - F^{*} \!\right] + R^{\prime\prime} \end{array} $$
(59)

Taking sum over k = 0, 1,..., (m − 1) and dividing by m, we have

$$ \begin{array}{@{}rcl@{}} &&\frac{1}{m}{\sum}_{k = 0}^{m-1}\mathbb{E} \left[ F(w^{s}_{k + 1})-F(w^{*})\right]\\ &\le& \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m}{\sum}_{k = 0}^{m-1} \left[ F({w^{s}_{k}}) - F(w^{*}) \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime\prime}\\ &&\frac{1}{m}{\sum}_{k = 1}^{m}\mathbb{E} \left[ F({w^{s}_{k}}) - F(w^{*})\right]\\ &\le& \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m} \left\lbrace {\sum}_{k = 1}^{m} \left[ F({w^{s}_{k}}) - F(w^{*}) \right] + F({w^{s}_{0}}) - F(w^{*}) - \lbrace F({w^{s}_{m}}) - F(w^{*}) \rbrace \right\rbrace\\ &&+ \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime\prime} \end{array} $$
(60)

Subtracting \(\frac {4\alpha (b)}{(\beta -1)} \frac {1}{m}{\sum }_{k = 1}^{m} \left [ F({w^{s}_{k}}) - F(w^{*})\right ] \) from both sides, we have

$$ \begin{array}{@{}rcl@{}} \!\!&&\left( 1-\frac{4\alpha(b)}{(\beta-1)}\right)\frac{1}{m}{\sum}_{k = 1}^{m}\mathbb{E} \left[ F({w^{s}_{k}}) - F(w^{*})\right]\\ \!\!& \le& \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m} \left[ F({w^{s}_{0}}) - F(w^{*}) - \lbrace F({w^{s}_{m}}) - F(w^{*}) \rbrace\right]\\ \!\!&& + \frac{4\left( \alpha(b)m^{2} + (m - 1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime\prime} \end{array} $$
(61)

Since \(F({w^{s}_{m}}) - F(w^{*}) \ge 0\) so dropping this term and using Assumption 3, we have

$$ \begin{array}{@{}rcl@{}} &&\left( 1-\frac{4\alpha(b)}{(\beta-1)}\right)\frac{1}{m}{\sum}_{k = 1}^{m}\mathbb{E} \left[ F({w^{s}_{k}}) - F(w^{*})\right]\\ &\le& \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m} \left[ F({w^{s}_{0}}) - F(w^{*}) \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime\prime}\\ &\le& \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m} \left[ c\left( F(\tilde{w}^{s-1}) - F(w^{*})\right) \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime\prime}\\ &=& \left( \frac{4\alpha(b)}{(\beta-1)} \frac{c}{m}+ \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \right) \left[ \left( F(\tilde{w}^{s-1}) - F(w^{*})\right) \right] + R^{\prime\prime}, \end{array} $$
(62)

Dividing both sides by \(\left (1-\frac {4\alpha (b)}{(\beta -1)}\right )\), and since \(\tilde {w}^{s} = 1/m {\sum }_{k = 1}^{m}{w^{s}_{k}}\) so by convexity, \(F(\tilde {w}^{s}) \le 1/m {\sum }_{k = 1}^{m} F({w^{s}_{k}})\), we have

$$ \begin{array}{@{}rcl@{}} \mathbb{E} \left[ F(\tilde{w}^{s}) - F^{*}\right]\! &\le&\! \left( \! \frac{4\alpha(b)}{(\beta - 1 - 4\alpha(b))} \frac{c}{m}+ \frac{4\left( \alpha(b)m^{2} + (m-1)^{2}\right)}{m^{2}(\beta - 1 - 4\alpha(b))}\!\right)\\ &&\times\left[ F (\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime\prime\prime}, \end{array} $$
(63)

where \( R^{\prime \prime \prime }= R^{\prime \prime } \left (1-\frac {4\alpha (b)}{(\beta -1)}\right )^{-1}\). Now, applying above inequality recursively, we have

$$ \begin{array}{ll} \mathbb{E} \left[ F(\tilde{w}^{s}) - F(w^{*})\right] \le C^{s} \left[ F(\tilde{w}^{0}) - F(w^{*}) \right] + R^{\prime\prime\prime\prime}, \end{array} $$
(64)

inequality follows for \(R^{\prime \prime \prime \prime }= R^{\prime \prime \prime }/(1-C)\), since \({\sum }_{i = 0}^{k} r^{i} \le {\sum }_{i = 0}^{\infty } r^{i} = \frac {1}{1-r}, \quad \|r\|<1\) and \(C = \left (\frac {4\alpha (b)}{(\beta -1-4\alpha (b))} \frac {c}{m}+ \frac {4\left (\alpha (b)m^{2}+(m-1)^{2}\right )}{m^{2}(\beta -1-4\alpha (b))}\right ) \). For certain choice of β, one can easily prove that C < 1. This proves linear convergence with some initial error. □

Proof of Theorem 4

(Strongly convex and non-smooth problem with SAAG-IV) □

Proof

By smoothness, we have,

$$ \begin{array}{@{}rcl@{}} f(w^{s}_{k + 1}) & \le& f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}), w^{s}_{k + 1} - {w^{s}_{k}}> \\ &&+ \frac{L}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ \end{array} $$

Now,

$$ \begin{array}{@{}rcl@{}} F(w^{s}_{k + 1}) &=& f(w^{s}_{k + 1}) + g(w^{s}_{k + 1})\\ & \le& f({w^{s}_{k}}) + g(w^{s}_{k + 1}) + <\nabla f({w^{s}_{k}}), w^{s}_{k + 1} - {w^{s}_{k}}>\\ &&+ \frac{L}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ & =& f({w^{s}_{k}}) + g(w^{s}_{k + 1}) + <\nabla f({w^{s}_{k}}), w^{s}_{k + 1} - {w^{s}_{k}}>\\ && + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} - \frac{L(\beta-1}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ & =& f({w^{s}_{k}}) + g(w^{s}_{k + 1}) + <\tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}>\\ && + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} + <\nabla f({w^{s}_{k}}) - \tilde{\nabla}_{s,k}, w^{s}_{k + 1}\\ && - {w^{s}_{k}}>- \frac{L(\beta-1}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \end{array} $$
(65)

where β is appropriately chosen positive value. Now, separately simplifying the terms, we have

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}\left[ f({w^{s}_{k}}) + g(w^{s}_{k + 1}) + <\tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \right]\\ & =& f({w^{s}_{k}}) + g(w^{s}_{k + 1}) + \mathbb{E}\left[<\tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}>\right] + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ & =& f({w^{s}_{k}}) + g(w^{s}_{k + 1}) + <\nabla f({w^{s}_{k}}) + \frac{m-1}{m}\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ & =& f({w^{s}_{k}}) + g(w^{s}_{k + 1}) + <\nabla f({w^{s}_{k}}), w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} + \frac{m-1}{m}<\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - {w^{s}_{k}}>\\ & \le& f({w^{s}_{k}}) + g(w^{*}) + <\nabla f({w^{s}_{k}}), w^{*} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{*} - {w^{s}_{k}} \|^{2} - \frac{L\beta}{2} \| w^{*} - w^{s}_{k + 1}\|^{2}\\ &&+ \frac{m-1}{m}<\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - {w^{s}_{k}}>\\ & =& f({w^{s}_{k}}) + g(w^{*}) + <\nabla f({w^{s}_{k}}), w^{*} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{*} - {w^{s}_{k}} \|^{2} - \frac{L\beta}{2} \| w^{*} - w^{s}_{k + 1}\|^{2}\\ && + \frac{m-1}{m}\left[<\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - \tilde{w}^{s-1}>- <\nabla f(\tilde{w}^{s-1}), {w^{s}_{k}} - \tilde{w}^{s-1}>\right]\\ & \le& f(w^{*}) + g(w^{*}) + \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right]\\ && + \frac{m-1}{m}\left[ f(w^{s}_{k + 1}) - f(\tilde{w}^{s-1}) - \left( f({w^{s}_{k}}) - f(\tilde{w}^{s-1})\right)\right]\\ & =& f(w^{*}) + g(w^{*}) + \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right] + \frac{m-1}{m}\left[ f(w^{s}_{k + 1}) - f({w^{s}_{k}})\right],\\ & =& F(w^{*}) + \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right] + \frac{m-1}{m}\left[ f(w^{s}_{k + 1}) - f({w^{s}_{k}})\right], \end{array} $$
(66)

second equality follows from, \(\mathbb {E}\left [\tilde {\nabla }_{s,k}\right ] = \nabla f({w^{s}_{k}}) + \frac {m-1}{m}\nabla f(\tilde {w}^{s-1})\), first inequality follows from Lemma 1 and second inequality follows from the convexity, i.e., f(x) ≥ f(y)+ < ∇f(y),xy >.

$$ \begin{array}{@{}rcl@{}} \text{And}, &&\mathbb{E} \left[ <\nabla f({w^{s}_{k}}) - \tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}> - \frac{L(\beta-1)}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \right]\\ & \le& \mathbb{E} \left[ \frac{1}{2L(\beta-1)} \|\nabla f({w^{s}_{k}}) - \tilde{\nabla}_{s,k}\|^{2} + \frac{L(\beta-1)}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} - \frac{L(\beta-1)}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \right]\\ & =& \frac{1}{2L(\beta-1)} \mathbb{E} \left[\| \tilde{\nabla}_{s,k} - \nabla f({w^{s}_{k}})\|^{2} \right]\\ & \le& \frac{1}{2L(\beta-1)}\left[8L\alpha(b) \left[ F({w^{s}_{k}}) - F^{*} \right] + \frac{8L\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}} \left[ F(\tilde{w}^{s-1}) - F^{*} \right] + R^{\prime}\right]\\ & =& \frac{4\alpha(b)} {(\beta-1)}\left[ F({w^{s}_{k}}) - F^{*} \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F^{*} \right] + R^{\prime\prime}, \end{array} $$
(67)

first inequality follows from Young’s inequality and second inequality follows from Lemma 7 and \(R^{\prime \prime } = R^{\prime }/(2L(\beta -1))\). Now, substituting the values into (65) from inequalities (66) and (67), and taking expectation w.r.t. mini-batches, we have

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E} \left[ F(w^{s}_{k + 1})\right]\\ & \le& F(w^{*}) + \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right] + \frac{m-1}{m}\left[ f(w^{s}_{k + 1}) - f({w^{s}_{k}})\right]\\ &&+ \frac{4\alpha(b)} {(\beta-1)}\left[ F({w^{s}_{k}}) - F^{*} \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F^{*} \right] + R^{\prime\prime}\\ &&\mathbb{E} \left[ F(w^{s}_{k + 1})-F(w^{*})\right]\\ & \le& \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right] + \frac{m-1}{m}\left[ f(w^{s}_{k + 1}) - f({w^{s}_{k}})\right]\\ &&+ \frac{4\alpha(b)} {(\beta-1)}\left[ F({w^{s}_{k}}) - F^{*} \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F^{*} \right] + R^{\prime\prime} \end{array} $$
(68)

Taking sum over k = 0, 1,..., (m − 1) and dividing by m, we have

$$ \begin{array}{@{}rcl@{}} &&\frac{1}{m}{\sum}_{k = 0}^{m-1}\mathbb{E} \left[ F(w^{s}_{k + 1})-F(w^{*})\right]\\ & \le& \frac{1}{m}{\sum}_{k = 0}^{m-1}\left\lbrace \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right] + \frac{m-1}{m}\left[ f(w^{s}_{k + 1}) - f({w^{s}_{k}})\right] \right\rbrace\\ &&+ \frac{1}{m}{\sum}_{k = 0}^{m-1}\left\lbrace\frac{4\alpha(b)} {(\beta-1)}\left[ F({w^{s}_{k}}) - F^{*} \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F^{*} \right] + R^{\prime\prime} \right\rbrace\\ &&\frac{1}{m}{\sum}_{k = 1}^{m}\mathbb{E} \left[ F({w^{s}_{k}})-F(w^{*})\right]\\ & \le& \frac{L\beta}{2m} \left[ \| w^{*} - {w^{s}_{0}} \|^{2} - \| w^{*} - {w^{s}_{m}}\|^{2}\right] + \frac{m-1}{m^{2}}\left[ f({w^{s}_{m}}) - f({w^{s}_{0}})\right]\\ &&+ \frac{4\alpha(b)}{(\beta-1)}\frac{1}{m}\left\lbrace {\sum}_{k = 1}^{m}\left[ F({w^{s}_{k}}) - F(w^{*}) \right] + F({w^{s}_{0}}) - F(w^{*}) - \left( F({w^{s}_{m}}) - F(w^{*}) \right)\right\rbrace\\ && + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime\prime} \end{array} $$
(69)

Subtracting, \(\frac {4\alpha (b)}{(\beta -1)}\frac {1}{m}{\sum }_{k = 1}^{m}\left [ F({w^{s}_{k}}) - F(w^{*}) \right ]\) from both sides, we have

$$ \begin{array}{@{}rcl@{}} && \left( 1-\frac{4\alpha(b)}{(\beta-1)}\right)\frac{1}{m}{\sum}_{k = 1}^{m}\mathbb{E} \left[ F({w^{s}_{k}}) - F(w^{*}) \right]\\ & \le& \frac{L\beta}{2m} \left[ \| w^{*} - {w^{s}_{0}} \|^{2} - \| w^{*} - {w^{s}_{m}}\|^{2}\right] + \frac{m-1}{m^{2}}\left[ f({w^{s}_{m}}) - f({w^{s}_{0}})\right]\\ &&+ \frac{4\alpha(b)}{(\beta-1)}\frac{1}{m}\left\lbrace F({w^{s}_{0}}) - F(w^{*}) - \left( F({w^{s}_{m}}) - F(w^{*}) \right)\right\rbrace\\ && + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime\prime}\\ & \le& \frac{L\beta}{2m} \left[ \| w^{*} - {w^{s}_{0}} \|^{2} \right] + \frac{m-1}{m^{2}}\left[ F({w^{s}_{m}}) - F({w^{s}_{0}})\right] + \frac{4\alpha(b)}{(\beta-1)}\frac{1}{m}\left\lbrace F({w^{s}_{0}}) - F(w^{*}) - \left( F({w^{s}_{m}}) - F(w^{*}) \right)\right\rbrace\\ && + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime\prime\prime}\\ & \le& \frac{L\beta}{2m} \frac{2}{\mu} \left[ F({w^{s}_{0}}) - F(w^{*}) \right] + \frac{m-1}{m^{2}}\left[ F({w^{s}_{m}}) - F({w^{s}_{0}})\right] + \frac{4\alpha(b)}{(\beta-1)}\frac{1}{m}\left\lbrace F({w^{s}_{0}}) - F(w^{*}) - \left( F({w^{s}_{m}}) - F(w^{*}) \right)\right\rbrace\\ && + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime\prime\prime}\\ & =& \left( \frac{L\beta}{m\mu} + \frac{4\alpha(b)}{m(\beta-1)} - \frac{m-1}{m^{2}}\right) \left[ F({w^{s}_{0}}) - F(w^{*}) \right] + \left( \frac{m-1}{m^{2}} - \frac{4\alpha(b)}{m(\beta-1)} \right) \left[F({w^{s}_{m}}) - F(w^{*})\right]\\ && + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime\prime\prime}\\ & \le& \left( \frac{L\beta}{m\mu} + \frac{4\alpha(b)}{m(\beta-1)} - \frac{m-1}{m^{2}}\right) c \left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + \left( \frac{m-1}{m^{2}} - \frac{4\alpha(b)}{m(\beta-1)} \right) c \left[F(\tilde{w}^{s}) - F(w^{*})\right]\\ && + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime\prime\prime}\\ & \le& \left( \frac{Lc\beta}{m\mu} + \frac{4c\alpha(b)}{m(\beta-1)} - \frac{c(m-1)}{m^{2}} + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)}\right) \left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right]\\ &&+ \left( \frac{m-1}{m^{2}} - \frac{4\alpha(b)}{m(\beta-1)}\right)c \left[F(\tilde{w}^{s}) - F(w^{*})\right] + R^{\prime\prime\prime} \end{array} $$
(70)

second inequality follows from dropping, \(\| w^{*} - {w^{s}_{m}}\|^{2} \ge 0\) and converting, \(f({w^{s}_{m}}) - f({w^{s}_{0}})\) to \(f({w^{s}_{m}}) - f({w^{s}_{0}})\) by introducing some constant, third inequality follows from the strong convexity, i.e., \(\| {w^{s}_{0}} - w^{*}\|^{2} \le 2/ \mu \left (f({w^{s}_{0}})- f^{*} \right ) \), fourth inequality follows from Assumption 3 and \(R^{\prime \prime \prime } = R^{\prime \prime } + (m-1)g({w^{s}_{0}})/m^{2}\). Since \(\tilde {w}^{s} = 1/m {\sum }_{k = 1}^{m}{w^{s}_{k}}\) so by convexity using, \(f(\tilde {w}^{s}) \le 1/m {\sum }_{k = 1}^{m} f({w^{s}_{k}})\), and subtracting \(\left (\frac {m-1}{m^{2}} - \frac {4\alpha (b)}{m(\beta -1)} \right ) c \left [F(\tilde {w}^{s}) - F(w^{*})\right ]\) from both sides, we have

$$ \begin{array}{@{}rcl@{}} \!\!&&\left( \!1 - \frac{4\alpha(b)}{(\beta - 1)} - \frac{c(m - 1)}{m^{2}} + \frac{4c\alpha(b)}{m(\beta - 1)}\!\right)\mathbb{E} \left[ F(\tilde{w}^{s}) - F(w^{*}) \right]\\ \!\!&\le& \!\left( \!\frac{Lc\beta}{m\mu} \!+ \frac{4c\alpha(b)}{m(\beta-1)} - \frac{c(m-1)}{m^{2}} + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)}\right)\\ \!\!&&\times\left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime\prime\prime} \end{array} $$
(71)

Dividing both sides by \( \left (1-\frac {4\alpha (b)}{(\beta -1)} - \frac {c(m-1)}{m^{2}} + \frac {4c\alpha (b)}{m(\beta -1)}\right )\), we have

$$ \begin{array}{ll} &\mathbb{E} \left[ F(\tilde{w}^{s}) - F(w^{*}) \right] \le C \left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime\prime\prime\prime} \end{array} $$
(72)

where \(C = \left (\frac {Lc\beta }{m\mu } + \frac {4c\alpha (b)}{m(\beta -1)} - \frac {c(m-1)}{m^{2}} + \frac {4\left (\alpha (b)m^{2}+(m-1)^{2}\right )}{m^{2}(\beta -1)}\right )\)\(\left (1-\frac {4\alpha (b)}{(\beta -1)} - \frac {c(m-1)}{m^{2}} + \frac {4c\alpha (b)}{m(\beta -1)}\right )^{-1} \) and \(R^{\prime \prime \prime \prime } = R^{\prime \prime \prime }\left (1-\frac {4\alpha (b)}{(\beta -1)} - \frac {c(m-1)}{m^{2}} + \frac {4c\alpha (b)}{m(\beta -1)}\right )^{-1} \). Now, applying this inequality recursively, we have

$$ \begin{array}{ll} \mathbb{E} \left[ F(\tilde{w}^{s}) - F(w^{*}) \right] \le C^{s} \left[ F(\tilde{w}^{0}) - F(w^{*}) \right] + R^{\prime\prime\prime\prime\prime}, \end{array} $$
(73)

inequality follows for \(R^{\prime \prime \prime \prime \prime }= R^{\prime \prime \prime \prime }/(1-C)\), since \({\sum }_{i = 0}^{k} r^{i} \le {\sum }_{i = 0}^{\infty } r^{i} = \frac {1}{1-r}, \quad \|r\|<1\). For certain choice of β, one can easily prove that C < 1. This proves linear convergence with some initial error. □

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chauhan, V.K., Sharma, A. & Dahiya, K. SAAGs: Biased stochastic variance reduction methods for large-scale learning. Appl Intell 49, 3331–3361 (2019). https://doi.org/10.1007/s10489-019-01450-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-019-01450-3

Keywords

Navigation