SAAGs: Biased stochastic variance reduction methods for large-scale learning

Chauhan, Vinod Kumar; Sharma, Anuj; Dahiya, Kalpana

doi:10.1007/s10489-019-01450-3

SAAGs: Biased stochastic variance reduction methods for large-scale learning

Published: 05 April 2019

Volume 49, pages 3331–3361, (2019)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

683 Accesses
2 Citations
3 Altmetric
Explore all metrics

Abstract

Stochastic approximation is one of the effective approach to deal with the large-scale machine learning problems and the recent research has focused on reduction of variance, caused by the noisy approximations of the gradients. In this paper, we have proposed novel variants of SAAG-I and II (Stochastic Average Adjusted Gradient) (Chauhan et al. 2017), called SAAG-III and IV, respectively. Unlike SAAG-I, starting point is set to average of previous epoch in SAAG-III, and unlike SAAG-II, the snap point and starting point are set to average and last iterate of previous epoch in SAAG-IV, respectively. To determine the step size, we have used Stochastic Backtracking-Armijo line Search (SBAS) which performs line search only on selected mini-batch of data points. Since backtracking line search is not suitable for large-scale problems and the constants used to find the step size, like Lipschitz constant, are not always available so SBAS could be very effective in such cases. We have extended SAAGs (I, II, III and IV) to solve non-smooth problems and designed two update rules for smooth and non-smooth problems. Moreover, our theoretical results have proved linear convergence of SAAG-IV for all the four combinations of smoothness and strong-convexity, in expectation. Finally, our experimental studies have proved the efficacy of proposed methods against the state-of-art techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Overview of Stochastic Quasi-Newton Methods for Large-Scale Machine Learning

Article Open access 25 February 2023

Adaptive stochastic approximation algorithm

Article 27 February 2017

A new inexact stochastic recursive gradient descent algorithm with Barzilai–Borwein step size in machine learning

Article 23 October 2022

Notes

experimental results can be reproduced using the code available at link: https://sites.google.com/site/jmdvinodjmd/code
https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
For the simplification of proof, we take f(w) = F(w), i.e., f_i(w) = f_i(w) + g(w)∀i and then g(w) ≡ 0

References

Allen-Zhu Z (2017) Katyusha: The first direct acceleration of stochastic gradient methods. Journal of Machine Learning Research (to appear) Full version available at arXiv:1603.05953
Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Lechevallier Y, Saporta G (eds) Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010). http://leon.bottou.org/papers/bottou-2010. Springer, Paris, pp 177–187
Bottou L, Bousquet O (2007) The tradeoffs of large scale learning. In: Proceedings of the 20th International Conference on Neural Information Processing Systems, Curran Associates Inc., USA, NIPS’07, pp 161–168, http://dl.acm.org/citation.cfm?id=2981562.2981583
Bottou L, Curtis FE, Nocedal J (2016) Optimization Methods for Large-Scale Machine Learning. arXiv:1606.04838
Cauchy A L (1847) Méthode générale pour la résolution des systèmes d’équations simultanées. Compte Rendu des S’eances de L’Acad’emie des Sciences XXV S’erie A(25):536–538
Chauhan VK, Dahiya K, Sharma A (2017) Mini-batch block-coordinate based stochastic average adjusted gradient methods to solve big data problems. In: Proceedings of the Ninth Asian Conference on Machine Learning, PMLR, vol 77, pp 49–64, http://proceedings.mlr.press/v77/chauhan17a.html
Chauhan VK, Dahiya K, Sharma A (2018a) Problem formulations and solvers in linear svm: a review. Artificial Intelligence Review. https://doi.org/10.1007/s10462-018-9614-6
Chauhan V K, Sharma A, Dahiya K (2018b) Faster learning by reduction of data access time. Appl Intell 48(12):4715–4729. https://doi.org/10.1007/s10489-018-1235-x
Article Google Scholar
Chauhan VK, Sharma A, Dahiya K (2018c) Stochastic Trust Region Inexact Newton Method for Large-scale Machine Learning. arXiv:1812.10426
Csiba D, Richt P (2016) Importance Sampling for Minibatches pp 1–19, arXiv:1602.02283v1
Defazio A, Bach F, Lacoste-Julien S (2014) Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, NIPS’14, MIT Press, Cambridge, pp 1646–1654
Fanhua S, Zhou K, Cheng J, Tsang IW, Zhang L, Tao D (2018) Vr-sgd: A simple stochastic variance reduction method for machine learning. arXiv:https://arxiv.org/abs/1802.09932
Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems, vol 26, Curran Associates, Inc., pp 315–323
Kiefer J, Wolfowitz J (1952) Stochastic estimation of the maximum of a regression function. Ann Math Stat 23:462–466
Article MathSciNet MATH Google Scholar
Konečnẏ J, Richtȧrik P (2013) Semi-Stochastic Gradient Descent Methods 1:19. arXiv:1312.1666
Konečnẏ J, Liu J, Richtȧrik P, Takȧč M (2016) Mini-Batch Semi-Stochastic Gradient descent in the proximal setting. IEEE J Sel Top Signal Process 10(2):242–255
Article Google Scholar
Lan G (2012) An optimal method for stochastic composite optimization. Math Program 133(1):365–397
Article MathSciNet MATH Google Scholar
Le Roux N, Schmidt M, Bach F (2012) A Stochastic Gradient Method with an Exponential Convergence Rate for Strongly-Convex Optimization with Finite Training Sets. Technical Report, INRIA
Lin H, Mairal J, Harchaoui Z (2015) A universal catalyst for first-order optimization. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS’15, pp 3384–3392
Parikh N, Boyd S (2014) Proximal algorithms. Found Trends Optim 1(3):127–239
Article Google Scholar
Rakhlin A, Shamir O, Sridharan K (2012) Making gradient descent optimal for strongly convex stochastic optimization. In: Proceedings of the 29th International Conference on International Conference on Machine Learning, ICML’12, pp 1571–1578
Robbins H, Monro S (1951) A Stochastic Approximation Method. Ann Math Statist 22(3):400–407. Retrieved from http://www.jstor.org/stable/2236626
Shalev-Shwartz S, Zhang T (2013) Stochastic dual coordinate ascent methods for regularized loss. J Mach Learn Res 14(1):567–599
MathSciNet MATH Google Scholar
Shalev-Shwartz S, Singer Y, Srebro N (2007) Pegasos: Primal estimated sub-gradient solver for svm. In: Proceedings of the 24th International Conference on Machine Learning, ICML ’07. ACM, New York, pp 807–814
Wang H, Banerjee A (2014) Randomized block coordinate descent for online and stochastic optimization, pp 1–19. arXiv:1407.0107
Wright S J (2015) Coordinate descent algorithms. Math Program 151(1):3–34
Article MathSciNet MATH Google Scholar
Xiao L, Zhang T (2014) A proximal stochastic gradient method with progressive variance reduction. SIAM J Optim 24(4):2057–2075
Article MathSciNet MATH Google Scholar
Xu Y, Yin W (2015) Block stochastic gradient iteration for convex and nonconvex optimization. SIAM J Optim 25(3):1686– 1716
Article MathSciNet MATH Google Scholar
Yang Y, Chen J, Zhu J (2016) Distributing the stochastic gradient sampler for large-scale lda. In: Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16. ACM, New York, pp 1975– 1984
Yang Z, Wang C, Zhang Z, Li J (2018) Random barzilai–borwein step size for mini-batch algorithms. Eng Appl Artif Intell 72:124–135
Article Google Scholar
Shen Z, Qian H, Mu T, Zhang C (2017) Accelerated doubly stochastic gradient algorithm for large-scale empirical risk minimization. In: IJCAI
Zhang T (2004) Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the Twenty-first International Conference on Machine Learning, ICML’04. ACM, New York, pp 116–
Zhang Y, Xiao L (2015) Stochastic primal-dual coordinate method for regularized empirical risk minimization. In: Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, pp 353–361
Zhao T, Yu M, Wang Y, Arora R, Liu H (2014) Accelerated Mini-batch Randomized Block Coordinate Descent Method. Advances in Neural Information Processing Systems, pp 3329–3337
Zhou L, Pan S, Wang J, Vasilakos AV (2017) Machine learning on big data: Opportunities and challenges. Neurocomputing 237:350–361. https://doi.org/10.1016/j.neucom.2017.01.026
Article Google Scholar

Download references

Acknowledgements

First author is thankful to Ministry of Human Resource Development, Government of INDIA, to provide fellowship (University Grants Commission - Senior Research Fellowship) to pursue his PhD.

Author information

Authors and Affiliations

Computer Science and Applications, Panjab University, Chandigarh, India
Vinod Kumar Chauhan & Anuj Sharma
University Institute of Engineering and Technology, Panjab University, Chandigarh, India
Kalpana Dahiya

Authors

Vinod Kumar Chauhan
View author publications
You can also search for this author in PubMed Google Scholar
Anuj Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Kalpana Dahiya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vinod Kumar Chauhan.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: More Experiments

1.1 A.1 Results with Support Vector Machine (SVM)

This subsection compares SAAGs against SVRG and VR-SGD on SVM problem with mushroom and gisette datasets. Methods use stochastic backtracking line search method to find the step size. Figure 8 presents the results and compares the suboptimality against the training time (in seconds). Results are similar to experiments with logistic regression but are not that smooth. SAAGs outperform other methods on mushroom dataset (first row) and gisette dataset (second row) for suboptimality against training time and accuracy against time but all methods give almost similar results on accuracy versus training time for mushroom dataset. SAAG-IV outperforms other method and SAAG-III sometimes lags behind VR-SGD method. It is also observed that results with logistic regression are better than the results with the SVM problem. The optimization problem for SVM is given below:

$$ \underset{w}{\min} F(w) = \frac{1}{n} \sum\limits_{i = 1}^{n} \max\left( 0, 1 - y_{i} w^{T} x_{i} \right)^{2} + \frac{\lambda}{2} \|w\|^{2}, $$

(13)

where λ is the regularization coefficient (also penalty parameter) which balances the trade off between margin size and error [7].

1.2 A.2 Comparison of SAAGs (I, II, III and IV) for non-smooth problem

Comparison of SAAGs for non-smooth problem is depicted in Fig. 9 using Adult dataset with mini-batch of 32 data points. As it is clear from the figure, just like the smooth problem, results with SAAG-III and IV are stable and better or equal to SAAG-I and II.

1.3 A.3 Effect of mini-batch size on SAAG-III, IV, SVRG and VR-SGD for non-smooth problem

Effect of mini-batch size on SAAG-III, IV, SVRG and VR-SGD for non-smooth problem is depicted in Fig. 10 using rcv1 binary dataset with mini-batch of 32, 64 and 128 data points. Similar to smooth problem, proposed methods outperform SVRG and VR-SGD methods. SAAG-IV gives the best result in terms of time and epochs but in terms of gradients/n, SAAG-III gives best results.

1.4 A.4 Effect of mini-batch size on SAAGs (I, II, III, IV) for smooth problem

Effect of mini-batch size on SAAGs (I, II, III, IV) for smooth problem is depicted in Fig. 11 using Adult dataset with mini-batch sizes of 32, 64 and 128 data points. The results are similar to non-smooth problem.

1.5 A.5 Effect of regularization coefficient for non-smooth problem

Figure 12 depicts effect of regularization coefficient on SAAG-III, IV, SVRG and VR-SGD for non-smooth problem using rcv1 dataset. It considers regularization coefficient values as 10^− 3, 10^− 5 and 10^− 7. The results are similar to smooth problem. As it is clear from the figure, for larger values, 10^− 3, all the methods do not perform well but once the coefficient is sufficiently small, it does not make much difference, and in all the cases our proposed methods outperform SVRG and VR-SGD.

B Proofs

Following assumptions are considered in the paper:

Assumption 1 (Smoothness)

Suppose function$f_{i}: \mathbb {R}^{n} \rightarrow \mathbb {R}$isconvex and differentiable, and that gradient ∇f_i,∀iis L-Lipschitz-continuous, whereL > 0 isLipschitz constant, then, we have,

$$ \| \nabla f_{i}(y) - \nabla f_{i}(x)\| \le L \|y-x\|, $$

(14)

$$ \begin{array}{ll} \text{and},\quad f_{i}(y) \le f_{i}(x) +\nabla f_{i}(x)^{T}(y-x)+ \frac{L}{2} \|y-x\|^{2}. \end{array} $$

(15)

Assumption 2 (Strong Convexity)

Suppose function$F: \mathbb {R}^{n} \rightarrow \mathbb {R}$isμ-stronglyconvex function forμ > 0 andF^∗isthe optimal value of F, then, we have,

$$ F(y) \ge F(x) +\nabla F(x)^{T}(y-x) + \frac{\mu}{2} \|y-x\|^{2}, $$

(16)

$$ \begin{array}{ll} \text{and},\quad & F(x) - F^{*} \le \frac{1}{2\mu}\|\nabla F(x)||^{2} \end{array} $$

(17)

Assumption 3 (Assumption 3 in [12])

For alls = 1, 2,...,S,the following inequality holds

$$ \mathbb{E}\left[F({w^{s}_{0}}) - F(w^{*}) \right] \le c\mathbb{E}\left[F(\tilde{w}^{s-1}) - F(w^{*}) \right] $$

(18)

where 0 < c ≪ mis a constant.

We derive our proofs by taking motivation from [12] and [27]. Before providing the proofs, we provide certain lemmas, as given below:

Lemma 1 (3-Point Property [17])

Let$\hat {z}$bethe optimal solution of the following problem:$\underset {z\in \mathbb {R}^{d}}{\min }\quad \frac {\tau }{2} \|z-z_{0}\|^{2} + r(z), $whereτ ≥ 0 andr(z) isa convex function (but possibly non-differentiable). Then for any$z\in \mathbb {R}^{d}$,then the following inequality holds,

$$ \frac{\tau}{2} \|\hat{z}-z_{0}\|^{2} + r(\hat{z}) \le r(z) + \frac{\tau}{2} \left( \|z-z_{0}\|^{2} - \|z - \hat{z}\|^{2} \right) $$

(19)

Lemma 2 (Theorem 4 in [16])

For non-smooth problems, taking $\tilde {\nabla }^{\prime }_{s,k} = \frac {1}{b} {\sum }_{i \in B_{k}} \nabla f_{i} ({w^{s}_{k}}) - \frac {1}{b} {\sum }_{i \in B_{k}} \nabla f_{i} (\tilde {w}^{s-1}) + \frac {1}{n} {\sum }_{i = 1}^{n}f_{i}(\tilde {w}^{s-1})$ , we have $\mathbb {E} \left [\tilde {\nabla }^{\prime }_{s,k}\right ] = \nabla f({w^{s}_{k}})$ and the variance satisfies following inequality,

$$ \begin{array}{@{}rcl@{}} \mathbb{E} \left[\|\tilde{\nabla}^{\prime}_{s,k} - \nabla f({w^{s}_{k}})\|^{2}\right] &\le& 4L\alpha(b) \left[ F({w^{s}_{k}}) - F(w^{*})\right. \\ &&\left.+ F(\tilde{w}^{s-1}) - F(w^{*}) \right], \end{array} $$

(20)

whereα(b) = (n − b)/(b(n − 1)).

Following the Lemma 2 for non-smooth problems, one can easily prove the following results for the smooth problems,

Lemma 3

For smooth problems, taking $\tilde {\nabla }^{\prime }_{s,k} = \frac {1}{b} {\sum }_{i \in B_{k}} $ $\nabla f_{i} ({w^{s}_{k}}) - \frac {1}{b} {\sum }_{i \in B_{k}} \nabla f_{i} (\tilde {w}^{s-1}) + \frac {1}{n} {\sum }_{i = 1}^{n}f_{i}(\tilde {w}^{s-1})$ , we have $\mathbb {E} \left [\tilde {\nabla }^{\prime }_{s,k}\right ] = \nabla f({w^{s}_{k}})$ and the variance satisfies following inequality,

$$ \begin{array}{@{}rcl@{}} \mathbb{E} \left[\|\tilde{\nabla}^{\prime}_{s,k} - \nabla f({w^{s}_{k}})\|^{2}\right] &\le& 4L\alpha(b) \left[ f({w^{s}_{k}}) - f^{*} \right.\\ &&\left.+ f(\tilde{w}^{s-1}) - f^{*}\right], \end{array} $$

(21)

whereα(b) = (n − b)/(b(n − 1)).

Lemma 4 (Extension of Lemma 3.4 in [27] to mini-batches)

Under Assumption 1 for smooth regularizer, we have

$$ \mathbb{E} \left[\|\nabla_{B_{k}} f({w^{s}_{k}}) - \nabla_{B_{k}} f(w^{*})\|^{2}\right] \le 2L \left[ f({w^{s}_{k}}) - f(w^{*}) \right] $$

(22)

Proof

Given any k = 0, 1,..., (m − 1), consider the function,

$$ \phi_{B_{k}} (w) = f_{B_{k}} (w) - f_{B_{k}}(w^{*}) - \nabla_{B_{k}} f(w^{*})^{T} (w-w^{*}) $$

It is straightforward to check that $\nabla \phi _{B_{k}} (w^{*}) = 0$, hence $\min _{w} \phi _{B_{k}} (w) = \phi _{B_{k}} (w^{*}) = 0$. Since $\phi _{B_{k}} (w)$ is Lipschitz continuous so we have,

$$ \begin{array}{@{}rcl@{}} &&\!\!\!\!\!\frac{1}{2L}\|\nabla\phi_{B_{k}} (w)\|^{2} \le \phi_{B_{k}} (w) - \min_{w} \phi_{B_{k}} (w)\\ &&{\kern66pt}= \phi_{B_{k}} (w) - \phi_{B_{k}} (w^{*}) = \phi_{B_{k}} (w)\\ &&\!\!\!\!\!\!\implies \!\| \nabla f_{B_{k}} (w) - \nabla f_{B_{k}}(w^{*})\|^{2}\\ &&{\kern66pt}\le\! 2L\! \left[ \!f_{B_{k}} (w) - f_{B_{k}}\!(w^{*}) - \nabla_{B_{k}} f(w^{*})^{T} \!(w - w^{*}) \!\right] \end{array} $$

Taking expectation, we have

$$ \begin{array}{@{}rcl@{}} \mathbb{E}[\| \nabla f_{B_{k}} (w) - \nabla f_{B_{k}}(w^{*})\|^{2} ] &\le& 2L \left[ f (w) - f(w^{*})\right.\\ && \left.- \nabla f(w^{*})^{T} (w-w^{*}) \right]\\ \end{array} $$

(23)

By optimality, ∇f(w^∗) = 0, we have

$$ \begin{array}{ll} \mathbb{E}[\| \nabla f_{B_{k}} (w) - \nabla f_{B_{k}}(w^{*})\|^{2} ] \le 2L \left[ f (w) - f(w^{*}) \right] \end{array} $$

This proves the required lemma. □

Lemma 5 (Extension of Lemma 3.4 in [27] to mini-batches)

Under Assumption 1 for non-smooth regularizer, we have

$$ \mathbb{E} \left[\|\nabla_{B_{k}}f({w^{s}_{k}}) - \nabla_{B_{k}} f(w^{*})\|^{2}\right] \le 2L \left[ F({w^{s}_{k}}) - F(w^{*}) \right] $$

(24)

Proof

From inequality (23), we have

$$ \begin{array}{@{}rcl@{}} \mathbb{E}[\| \nabla f_{B_{k}} (w) - \nabla f_{B_{k}}(w^{*})\|^{2} ] &\le& 2L \left[ f (w) - f(w^{*}) \right.\\ &&\left.- \nabla f(w^{*})^{T} (w-w^{*}) \right]\\ \end{array} $$

(25)

By optimality, there exist ξ ∈ ∂g(w^∗), such that, ∇F(w^∗) = ∇f(w^∗) + ξ = 0, we have

$$ \begin{array}{ll} \mathbb{E}[\| \nabla f_{B_{k}} (w) - \nabla f_{B_{k}}(w^{*})\|^{2} ] &\le 2L \left[ f (w) - f(w^{*}) + \xi^{T} (w-w^{*}) \right]\\ &\le 2L \left[ f (w) - f(w^{*}) + g(w) - g (w^{*}) \right]\\ &\le 2L \left[ F(w) - F(w^{*})\right] \end{array} $$

(26)

second inequality follows from the convexity of g. This proves the required lemma. □

Lemma 6 (Variance Bound for smooth problem)

Under the Assumption 1 and taking$\nabla _{B_{k}} f({w^{s}_{k}}) = \frac {1}{b} {\sum }_{i \in B_{k}} \nabla f_{i} ({w^{s}_{k}})$,$ \nabla _{B^{\prime }_{k}} f(\tilde {w}^{s-1}) = \frac {1}{n} {\sum }_{i \in B_{k}} \nabla f_{i} (\tilde {w}^{s-1})$,$\tilde {\mu }^{s} = \frac {1}{n} {\sum }_{i = 1}^{n} \nabla f_{i}(\tilde {w}^{s-1})$andthe gradient estimator,$\tilde {\nabla }_{s,k} = \nabla _{B_{k}} f({w^{s}_{k}}) - \nabla _{B^{\prime }_{k}} f(\tilde {w}^{s-1}) +\tilde {\mu }^{s} $,then the variance satisfies the following inequality^{Footnote 3},

$$ \begin{array}{@{}rcl@{}} \mathbb{E} \left[\| \tilde{\nabla}_{s,k} -\nabla f({w^{s}_{k}})\|^{2} \right] &\le& 8L\alpha(b) \left[ f({w^{s}_{k}}) - f^{*} \right]\\&&+ \frac{8L\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}}\\ &&\times\left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime}\\ \end{array} $$

(27)

where α(b) = (n − b)/(b(n − 1)) and $R^{\prime }$ is a constant.

Proof

First the expectation of estimator is given by

$$ \begin{array}{ll} \mathbb{E}\left[\tilde{\nabla}_{s,k} \right] &= \mathbb{E}\left[\nabla_{B_{k}} f({w^{s}_{k}}) - \nabla_{B^{\prime}_{k}} f(\tilde{w}^{s-1}) +\tilde{\mu}^{s} \right]\\ &= \nabla f({w^{s}_{k}}) - \frac{b}{n} \nabla f(\tilde{w}^{s-1}) + \nabla f(\tilde{w}^{s-1})\\ & = \nabla f({w^{s}_{k}}) + \frac{m-1}{m} \nabla f(\tilde{w}^{s-1}), \end{array} $$

(28)

second equality follows as n = mb. Now the variance bound is calculated as follows,

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E} \left[ \| \tilde{\nabla}_{s,k} - \nabla f({w^{s}_{k}}) \|^{2}\right]\\ &=& \mathbb{E}\left[\| \nabla_{B_{k}} f({w^{s}_{k}}) - \nabla_{B^{\prime}_{k}} f(\tilde{w}^{s-1}) +\nabla f (\tilde{w}^{s-1}) - \nabla f ({w^{s}_{k}})\|^{2}\right]\\ &=& \mathbb{E}\left[\| \nabla_{B_{k}} f({w^{s}_{k}}) - \nabla_{B_{k}} f(\tilde{w}^{s-1}) +\nabla f (\tilde{w}^{s-1}) - \nabla f ({w^{s}_{k}}) + \frac{m-1}{m}\nabla_{B_{k}} f(\tilde{w}^{s-1}) \|^{2}\right]\\ &\le& 2 \mathbb{E} \left[ \| \nabla_{B_{k}} f({w^{s}_{k}}) - \nabla_{B_{k}} f(\tilde{w}^{s-1}) +\nabla f (\tilde{w}^{s-1}) - \nabla f ({w^{s}_{k}})\|^{2}\right] + \frac{2(m-1)^{2}}{m^{2}} \mathbb{E} \left[ \| \nabla_{B_{k}} f(\tilde{w}^{s-1}) \|^{2} \right]\\ &\le& 8L\alpha(b) \left[ f({w^{s}_{k}}) - f^{*} + f(\tilde{w}^{s-1}) - f^{*} \right] + \frac{2(m-1)^{2}}{m^{2}} \mathbb{E} \left[ \| \nabla_{B_{k}} f(\tilde{w}^{s-1}) \|^{2} \right] \end{array} $$

(29)

inequality follows from, $\|a+b\|^{2} \le 2\left (\|a\|^{2} + \|b\|^{2} \right )$ for $a,b\in \mathbb {R}^{d}$ and applying the Lemma 3.

$$ \begin{array}{@{}rcl@{}} \text{Now,} &&\frac{2(m-1)^{2}}{m^{2}} \mathbb{E} \left[ \| \nabla_{B_{k}} f(\tilde{w}^{s-1}) \|^{2} \right]\\ &\le& \frac{2(m-1)^{2}}{m^{2}} \left[2 \mathbb{E} \|\nabla_{B_{k}} f(\tilde{w}^{s-1}) - \nabla_{B_{k}} f(w^{*}) \|^{2}\right.\\&&\qquad\quad\quad\quad\left. + 2 \mathbb{E} \| \nabla_{B_{k}} f(w^{*}) \|^{2} \right]\\ & \le& \frac{8L(m-1)^{2}}{m^{2}}\left[ f(\tilde{w}^{s-1}) - f(w^{*}) \right] + R^{\prime} \end{array} $$

(30)

first inequality follows from, $\|a+b\|^{2} \le 2\left (\|a\|^{2} + \|b\|^{2} \right )$ for $a,b\in \mathbb {R}^{d}$, second inequality follows from Lemma 4 and assuming $\mathbb {E}\| \nabla _{B_{k}} f(w^{*}) \|^{2} \le R, \forall k \text {and}$ where taking $R^{\prime } = \frac {2(m-1)^{2}}{m^{2}}* R$. Now, substituting the above inequality in (29), we have

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E} \left[ \| \tilde{\nabla}_{s,k} - \nabla f({w^{s}_{k}}) \|^{2}\right]\\ &\le& 8L\alpha(b) \left[ f({w^{s}_{k}}) - f^{*} + f(\tilde{w}^{s-1}) - f^{*} \right]\\ &&+ \frac{8L(m-1)^{2}}{m^{2}}\left[ f(\tilde{w}^{s-1}) - f(w^{*}) \right] + R^{\prime}\\ &=& 8L\alpha(b) \left[ f({w^{s}_{k}}) - f^{*} \right]\\ &&+ \frac{8L\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}} \left[ \!f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime}, \end{array} $$

(31)

This proves the required lemma. □

Lemma 7 (Variance Bound for non-smooth problem)

Under Assumption 1 and taking notations as in Lemma 6, the variance bound satisfies the following inequality,

$$ \begin{array}{@{}rcl@{}} \mathbb{E} \left[\| \tilde{\nabla}_{s,k} -\nabla f({w^{s}_{k}})\|^{2} \right] &\le& 8L\alpha(b) \left[ F({w^{s}_{k}}) - F^{*} \right]\\ &&+ \frac{8L\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}}\\ &&\times\left[ F(\tilde{w}^{s-1}) - F^{*} \right] + R^{\prime}, \end{array} $$

(32)

whereα(b) = (n − b)/(b(n − 1)) and$R^{\prime }$isconstant.

Proof

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E} \left[ \| \tilde{\nabla}_{s,k} - \nabla f({w^{s}_{k}}) \|^{2}\right]\\ &=& \mathbb{E}\left[\| \nabla_{B_{k}} f({w^{s}_{k}}) - \nabla_{B^{\prime}_{k}} f(\tilde{w}^{s-1}) +\nabla f (\tilde{w}^{s-1}) - \nabla f ({w^{s}_{k}})\|^{2}\right]\\ &=& \mathbb{E}\left[\| \nabla_{B_{k}} f({w^{s}_{k}}) - \nabla_{B_{k}} f(\tilde{w}^{s-1}) +\nabla f (\tilde{w}^{s-1}) - \nabla f ({w^{s}_{k}}) + \frac{m-1}{m}\nabla_{B_{k}} f(\tilde{w}^{s-1}) \|^{2}\right]\\ &\le& 2 \mathbb{E} \left[ \| \nabla_{B_{k}} f({w^{s}_{k}}) - \nabla_{B_{k}} f(\tilde{w}^{s-1}) +\nabla f (\tilde{w}^{s-1}) - \nabla f ({w^{s}_{k}})\|^{2}\right] + \frac{2(m-1)^{2}}{m^{2}} \mathbb{E} \left[ \| \nabla_{B_{k}} f(\tilde{w}^{s-1}) \|^{2} \right]\\ &\le& 8L\alpha(b) \left[ F({w^{s}_{k}}) - F({w^{*}}) + F(\tilde{w}^{s-1}) - F({w^{*}}) \right] + \frac{2(m-1)^{2}}{m^{2}} \mathbb{E} \left[ \| \nabla_{B_{k}} f(\tilde{w}^{s-1}) \|^{2} \right] \end{array} $$

(33)

inequality follows from, $\|a+b\|^{2} \le 2\left (\|a\|^{2} + \|b\|^{2} \right )$ for $a,b\in \mathbb {R}^{d}$ and applying the Lemma 2.

$$ \begin{array}{@{}rcl@{}} \text{Now,} &&\frac{2(m-1)^{2}}{m^{2}} \mathbb{E} \left[ \| \nabla_{B_{k}} f(\tilde{w}^{s-1}) \|^{2} \right]\\ &\le& \frac{2(m-1)^{2}}{m^{2}} \left[2 \mathbb{E} \|\nabla_{B_{k}} f(\tilde{w}^{s-1}) - \nabla_{B_{k}} f(w^{*}) \|^{2}\right.\\ &&\qquad\quad\quad~\left.+ 2 \mathbb{E} \| \nabla_{B_{k}} f(w^{*}) \|^{2} \right]\\ & \le& \frac{8L(m-1)^{2}}{m^{2}}\left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime} \end{array} $$

(34)

first inequality follows from, $\|a+b\|^{2} \le 2\left (\|a\|^{2} + \|b\|^{2} \right )$ for $a,b\in \mathbb {R}^{d}$, second inequality follows from Lemma 5 and assuming $\| \nabla _{B_{k}} f(w^{*}) \|^{2} \le R, \forall k$ and taking $R^{\prime } = \frac {2(m-1)^{2}}{m^{2}}* R$. Now, substituting the above inequality in (33), we have

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E} \left[ \| \tilde{\nabla}_{s,k} - \nabla f({w^{s}_{k}}) \|^{2}\right]\\ &\le& 8L\alpha(b) \left[ F({w^{s}_{k}}) - F^{*} + F(\tilde{w}^{s-1}) - F^{*} \right]\\ &&+ \frac{8L(m-1)^{2}}{m^{2}}\left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime}\\ &=& 8L\alpha(b) \left[ F({w^{s}_{k}}) - F^{*} \right] + \frac{8L\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}}\\ &&\times\left[ F(\tilde{w}^{s-1}) - F^{*} \right] + R^{\prime}, \end{array} $$

(35)

This proves the required lemma. □

Proof of Theorem 1

(Non-strongly convex and smooth problem with SAAG-IV) □

Proof

By smoothness, we have,

$$ \begin{array}{@{}rcl@{}} f(w^{s}_{k + 1}) &\le& f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}), w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ &=& f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}), w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ &&- \frac{L(\beta-1)}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ &=& f({w^{s}_{k}}) + <\tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ && + <\nabla f({w^{s}_{k}}) - \tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}>\\&&- \frac{L(\beta-1}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}, \end{array} $$

(36)

where β is appropriately chosen positive value. Now, separately simplifying the terms, we have

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}\left[ f({w^{s}_{k}}) + <\tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \right]\\ &=& f({w^{s}_{k}}) + \mathbb{E}\left[<\tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}>\right] + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ &=& f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}) + \frac{m-1}{m}\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ &=& f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}), w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} + \frac{m-1}{m}<\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - {w^{s}_{k}}>\\ &\le& f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}), w^{*} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{*} - {w^{s}_{k}} \|^{2} - \frac{L\beta}{2} \| w^{*} - w^{s}_{k + 1}\|^{2}\\ &&+ \frac{m-1}{m}<\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - {w^{s}_{k}}>\\ & =& f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}), w^{*} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{*} - {w^{s}_{k}} \|^{2} - \frac{L\beta}{2} \| w^{*} - w^{s}_{k + 1}\|^{2}\\ &&+ \frac{m-1}{m}\left[<\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - w^{*}>- <\nabla f(\tilde{w}^{s-1}), {w^{s}_{k}} - w^{*}>\right]\\ &\le& f(w^{*}) + \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right]\\ &&+ \frac{m-1}{m}\left[\frac{1}{2\delta}\|\nabla f(\tilde{w}^{s-1})\|^{2} + \frac{\delta}{2} \|w^{s}_{k + 1} - w^{*}\|^{2} - \left[ \frac{1}{2\delta}\|\nabla f(\tilde{w}^{s-1})\|^{2} +\frac{\delta}{2} \|{w^{s}_{k}} - w^{*}\|^{2}\right]\right]\\ &=& f(w^{*}) + \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right] + \frac{\delta(m-1)}{2m}\left[\|w^{s}_{k + 1} - w^{*}\|^{2} - \|{w^{s}_{k}} - w^{*}\|^{2}\right],\\ &=& f(w^{*}) + \left( \frac{L\beta}{2} - \frac{\delta(m-1)}{2m}\right) \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2} \right],\\ &=& f(w^{*}), \end{array} $$

(37)

second equality follows from, $\mathbb {E}\left [\tilde {\nabla }_{s,k}\right ] = \nabla f({w^{s}_{k}}) + \frac {m-1}{m}\nabla f(\tilde {w}^{s-1})$, first inequality follows from Lemma 1, second inequality follows from the convexity, i.e., $ f(w^{*}) \ge f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}), w^{*} - {w^{s}_{k}}>$ and Young’s inequality, i.e., x^Ty ≤ 1/(2δ)∥x∥² + δ/2∥y∥² for δ > 0, and last equality follows by choosing $\delta = \frac {mL\beta }{(m-1)}$.

$$ \begin{array}{@{}rcl@{}} \text{and}, &&\mathbb{E} \left[ <\nabla f({w^{s}_{k}}) - \tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}> - \frac{L(\beta-1)}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \right]\\ &\le& \mathbb{E} \left[ \frac{1}{2L(\beta-1)} \|\nabla f({w^{s}_{k}}) - \tilde{\nabla}_{s,k}\|^{2} + \frac{L(\beta-1)}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} - \frac{L(\beta-1)}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \right]\\ &=& \frac{1}{2L(\beta-1)} \mathbb{E} \left[\| \tilde{\nabla}_{s,k} - \nabla f({w^{s}_{k}})\|^{2} \right]\\ &\le& \frac{1}{2L(\beta-1)} \left[8L\alpha(b) \left[ f({w^{s}_{k}}) - f^{*} \right] + \frac{8L\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}} \left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime}\right]\\ &=& \frac{4\alpha(b)}{(\beta-1)} \left[ f({w^{s}_{k}}) - f^{*} \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime\prime} \end{array} $$

(38)

first inequality follows from Young’s inequality, second inequality follows from Lemma 6 and $R^{\prime \prime } = R^{\prime }/(2L(\beta -1))$. Now, substituting the values into (36) from inequalities (37) and (38), and taking expectation w.r.t. mini-batches, we have

$$ \begin{array}{@{}rcl@{}} \mathbb{E} \left[ f(w^{s}_{k + 1})\right] \!&\le&\! f(w^{*}) + \frac{4\alpha(b)}{(\beta-1)} \left[ f({w^{s}_{k}}) - f^{*} \right]\\ \!\!&&+ \frac{4\left( \alpha(b)m^{2}+(m - 1)^{2}\right)}{m^{2}(\beta-1)} \!\left[ f(\tilde{w}^{s-1}) - f^{*} \!\right] + R^{\prime\prime}\\ \mathbb{E} \left[ {\kern-.5pt}f(w^{s}_{k + 1}) - f{\kern-.5pt}(w^{*})\right] \!&\le&\!\! \frac{4\alpha(b)}{(\beta-1)} \left[ f({w^{s}_{k}}) - f^{*} \right]\\ \!\!&&+ \frac{4\!{\kern-.5pt}\left( {\kern-.5pt}\alpha({\kern-.5pt}b{\kern-.5pt})m^{2} + (m - 1{\kern-.5pt})^{2}\right)}{m^{2}(\beta-1)} \!\left[ \!f(\tilde{w}^{s{\kern-.5pt}-{\kern-.5pt}1}{\kern-.5pt}) - f^{*} \!\right] + R^{\prime\prime} \end{array} $$

(39)

Taking sum over k = 0, 1,..., (m − 1) and dividing by m, we have

$$ \begin{array}{@{}rcl@{}} &&\frac{1}{m}{\sum}_{k = 0}^{m-1}\mathbb{E} \left[ f(w^{s}_{k + 1}) - f^{*}\right]\\ &\le& \frac{1}{m}{\sum}_{k = 0}^{m-1} \left[ \frac{4\alpha(b)}{(\beta-1)} \left[ f({w^{s}_{k}}) - f^{*} \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime\prime}\right]\\ &&\frac{1}{m}{\sum}_{k = 1}^{m}\mathbb{E} \left[ f({w^{s}_{k}}) - f^{*}\right]\\ &\le& \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m}{\sum}_{k = 1}^{m} \left[ f({w^{s}_{k}}) - f^{*} \right] + \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m} \left[ f({w^{s}_{0}}) - f^{*} - \lbrace f({w^{s}_{m}}) - f^{*} \rbrace\right]\\ &&+ \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f (\tilde{w}^{s-1}) - f^{*} \right]+ R^{\prime\prime} \end{array} $$

(40)

Subtracting $\frac {4\alpha (b)}{(\beta -1)} \frac {1}{m}{\sum }_{k = 1}^{m} \left [ f({w^{s}_{k}}) - f^{*} \right ] $ from both sides, we have

$$ \begin{array}{@{}rcl@{}} &&\left( 1-\frac{4\alpha(b)}{(\beta-1)}\right)\frac{1}{m}{\sum}_{k = 1}^{m}\mathbb{E} \left[ f({w^{s}_{k}}) - f^{*}\right]\\ &\le& \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m} \left[ f({w^{s}_{0}}) - f^{*} - \lbrace f({w^{s}_{m}}) - f^{*} \rbrace\right]\\ &&+ \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f (\tilde{w}^{s-1}) - f^{*} \right]+ R^{\prime\prime} \end{array} $$

(41)

Since $f({w^{s}_{m}}) - f^{*} \ge 0$ so dropping this term and using Assumption 3, we have

$$ \begin{array}{@{}rcl@{}} &&\left( 1-\frac{4\alpha(b)}{(\beta-1)}\right)\frac{1}{m}{\sum}_{k = 1}^{m}\mathbb{E} \left[ f({w^{s}_{k}}) - f^{*}\right]\\ &\le& \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m} \left[ f({w^{s}_{0}}) - f^{*} \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f (\tilde{w}^{s-1}) - f^{*} \right]+ R^{\prime\prime}\\ &\le& \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m} \left[ c\left( f (\tilde{w}^{s-1}) - f^{*}\right) \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f (\tilde{w}^{s-1}) - f^{*} \right]+ R^{\prime\prime}\\ &=&\left[ \frac{4\alpha(b)}{(\beta-1)}\frac{c}{m} +\frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \right] \left[ \left( f (\tilde{w}^{s-1}) - f^{*}\right) \right]+ R^{\prime\prime}, \end{array} $$

(42)

Dividing both sides by $\left (1-\frac {4\alpha (b)}{(\beta -1)}\right )$, and since $\tilde {w}^{s} = 1/m {\sum }_{k = 1}^{m}{w^{s}_{k}}$ so by convexity, $f(\tilde {w}^{s}) \le 1/m {\sum }_{k = 1}^{m} f({w^{s}_{k}})$, we have

$$ \begin{array}{@{}rcl@{}} \mathbb{E} \left[ f(\tilde{w}^{s}) - f^{*}\right] &\le& \!\left[ \!\frac{4\alpha(b)}{(\beta - 1 - 4\alpha(b))}\frac{c}{m} + \frac{4\left( \alpha(b)m^{2} + (m-1)^{2}\right)}{m^{2}(\beta - 1 - 4\alpha(b))} \right]\\ &&\times\left[ f (\tilde{w}^{s-1}) - f^{*} \right]+ R^{\prime\prime\prime}, \end{array} $$

(43)

where $R^{\prime \prime \prime } = R^{\prime \prime } (\beta -1)/\left (\beta -1-4\alpha (b)\right )$. Now, applying this inequality recursively, we have

$$ \begin{array}{ll} \mathbb{E} \left[ f(\tilde{w}^{s}) - f^{*}\right] \le C^{s} \left[ f (\tilde{w}^{0}) - f^{*} \right] + R^{\prime\prime\prime\prime}, \end{array} $$

(44)

inequality follows for $R^{\prime \prime \prime \prime }= R^{\prime \prime \prime }/(1-C)$, since ${\sum }_{i = 0}^{k} r^{i} \le {\sum }_{i = 0}^{\infty } r^{i} = \frac {1}{1-r}, \quad \|r\|<1$ and $C = \left [ \frac {4\alpha (b)}{(\beta -1-4\alpha (b))}\frac {c}{m} +\frac {4\left (\alpha (b)m^{2}+(m-1)^{2}\right )}{m^{2}(\beta -1-4\alpha (b))} \right ]$. For certain choice of β, one can easily prove that C < 1. This proves linear convergence with some initial error. □

Proof of Theorem 2

(Strongly convex and smooth problem with SAAG-IV) □

Proof

By smoothness, we have,

$$ \begin{array}{@{}rcl@{}} f(w^{s}_{k + 1}) &\le& f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}), w^{s}_{k + 1} - {w^{s}_{k}}>\\ &&+ \frac{L}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ &=& f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}), w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1}\\ && - {w^{s}_{k}} \|^{2} - \frac{L(\beta-1}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ &=& f({w^{s}_{k}}) + <\tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ && + <\nabla f({w^{s}_{k}}) - \tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}>\\&&- \frac{L(\beta-1}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \end{array} $$

(45)

where β is appropriately chosen positive value. Now, separately simplifying the terms, we have

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}\left[ f({w^{s}_{k}}) + <\tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \right]\\ &=& f({w^{s}_{k}}) + \mathbb{E}\left[<\tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}>\right] + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ &=& f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}) + \frac{m-1}{m}\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ &=& f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}), w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} + \frac{m-1}{m}<\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - {w^{s}_{k}}>\\ &\le& f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}), w^{*} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{*} - {w^{s}_{k}} \|^{2} - \frac{L\beta}{2} \| w^{*} - w^{s}_{k + 1}\|^{2}\\ &&+ \frac{m-1}{m}<\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - {w^{s}_{k}}>\\ &=& f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}), w^{*} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{*} - {w^{s}_{k}} \|^{2} - \frac{L\beta}{2} \| w^{*} - w^{s}_{k + 1}\|^{2}\\ &&+ \frac{m-1}{m}\left[<\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - \tilde{w}^{s-1}>- <\nabla f(\tilde{w}^{s-1}), {w^{s}_{k}} - \tilde{w}^{s-1}>\right]\\ &\le& f(w^{*}) + \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right]\\ &+& \frac{m-1}{m}\left[ f(w^{s}_{k + 1}) - f(\tilde{w}^{s-1}) - \left( f({w^{s}_{k}}) - f(\tilde{w}^{s-1})\right)\right]\\ &=& f(w^{*}) + \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right] + \frac{m-1}{m}\left[ f(w^{s}_{k + 1}) - f({w^{s}_{k}})\right], \end{array} $$

(46)

second equality follows from, $\mathbb {E}\left [\tilde {\nabla }_{s,k}\right ] = \nabla f({w^{s}_{k}}) + \frac {m-1}{m}\nabla f(\tilde {w}^{s-1})$, first inequality follows from Lemma 1 and second inequality follows from the convexity, i.e., f(x) ≥ f(y)+ < ∇f(y),x − y >.

$$ \begin{array}{@{}rcl@{}} \text{and}, &&\mathbb{E} \left[ <\nabla f({w^{s}_{k}}) - \tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}> - \frac{L(\beta-1)}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \right]\\ &\le& \mathbb{E} \left[ \frac{1}{2L(\beta-1)} \|\nabla f({w^{s}_{k}}) - \tilde{\nabla}_{s,k}\|^{2} + \frac{L(\beta-1)}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} - \frac{L(\beta-1)}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \right]\\ &\le& \frac{1}{2L(\beta-1)} \left[8L\alpha(b) \left[ f({w^{s}_{k}}) - f^{*} \right] + \frac{8L\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}} \left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime}\right]\\ &=& \frac{4\alpha(b)}{(\beta-1)} \left[ f({w^{s}_{k}}) - f^{*} \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime\prime} \end{array} $$

(47)

first inequality follows from Young’s inequality and second inequality follows from Lemma 6 and $R^{\prime \prime } = R^{\prime }/(2L(\beta -1))$.

Now, substituting the values into (45) from inequalities (46) and (47), and taking expectation w.r.t. mini-batches, we have

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E} \left[ f(w^{s}_{k + 1})\right]\\ &\le& f(w^{*}) + \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right] + \frac{m-1}{m}\left[ f(w^{s}_{k + 1}) - f({w^{s}_{k}})\right]\\ &&+ \frac{4\alpha(b)}{(\beta-1)} \left[ f({w^{s}_{k}}) - f^{*} \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime\prime},\\ &&\mathbb{E} \left[ f(w^{s}_{k + 1})-f(w^{*})\right]\\&=& \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right] + \frac{m-1}{m}\left[ f(w^{s}_{k + 1}) - f({w^{s}_{k}})\right]\\ &+& \frac{4\alpha(b)}{(\beta-1)} \left[ f({w^{s}_{k}}) - f^{*} \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime\prime} \end{array} $$

(48)

Taking sum over k = 0, 1,..., (m − 1) and dividing by m, we have

$$ \begin{array}{@{}rcl@{}} &&\frac{1}{m}{\sum}_{k = 0}^{m-1}\mathbb{E} \left[ f(w^{s}_{k + 1})-f(w^{*})\right]\\ &\le& \frac{1}{m}{\sum}_{k = 0}^{m-1}\left\lbrace \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right] + \frac{m-1}{m}\left[ f(w^{s}_{k + 1}) - f({w^{s}_{k}})\right]\right\rbrace\\ &&+ \frac{1}{m}{\sum}_{k = 0}^{m-1}\left\lbrace\frac{4\alpha(b)}{(\beta-1)} \left[ f({w^{s}_{k}}) - f^{*} \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime\prime}\right\rbrace\\ &&\frac{1}{m}{\sum}_{k = 1}^{m}\mathbb{E} \left[ f({w^{s}_{k}})-f(w^{*})\right]\\ &\le& \frac{L\beta}{2m} \left[ \| w^{*} - {w^{s}_{0}} \|^{2} - \| w^{*} - {w^{s}_{m}}\|^{2}\right] + \frac{m-1}{m^{2}}\left[ f({w^{s}_{m}}) - f({w^{s}_{0}})\right]\\ &&+ \frac{4\alpha(b)}{(\beta-1)}\frac{1}{m}\left\lbrace {\sum}_{k = 1}^{m}\left[ f({w^{s}_{k}}) - f^{*} \right] + f({w^{s}_{0}}) - f^{*} - \left( f({w^{s}_{m}}) - f^{*} \right)\right\rbrace\\ &&+ \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime\prime} \end{array} $$

(49)

Subtracting $\frac {4\alpha (b)}{(\beta -1)}\frac {1}{m}{\sum }_{k = 1}^{m}\left [ f({w^{s}_{k}}) - f^{*} \right ]$ from both sides, we have

$$ \begin{array}{@{}rcl@{}} &&\left( 1-\frac{4\alpha(b)}{(\beta-1)}\right)\frac{1}{m}{\sum}_{k = 1}^{m}\mathbb{E} \left[ f({w^{s}_{k}}) - f^{*}\right]\\ &\le& \frac{L\beta}{2m} \left[ \| w^{*} - {w^{s}_{0}} \|^{2} - \| w^{*} - {w^{s}_{m}}\|^{2}\right] - \left( \frac{m-1}{m^{2}} - \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m}\right) \left[ f({w^{s}_{0}}) -f({w^{s}_{m}}) \right]\\ &&+ \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime\prime}\\ &\le& \frac{L\beta}{2m} \| w^{*} - {w^{s}_{0}} \|^{2} - \left( \frac{m-1}{m^{2}} - \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m}\right) \left[ f({w^{s}_{0}}) -f^{*} - \lbrace f({w^{s}_{m}}) - f^{*}\rbrace \right]\\ &&+ \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime\prime}\\ &\le& \frac{L\beta}{2m} \frac{2}{\mu}\left( f({w^{s}_{0}})- f^{*} \right) - \left( \frac{m-1}{m^{2}} - \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m}\right) \left[ c\left[ f(\tilde{w}^{s-1}) - f^{*} \right] - c\left[ f(\tilde{w}^{s}) - f^{*} \right] \right]\\ &&+ \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime\prime}\\ &\le& \frac{L\beta}{2m} \frac{2}{\mu}c\left[ f(\tilde{w}^{s-1}) - f^{*} \right] - \left( \frac{m-1}{m^{2}} - \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m}\right) \left[ c\left[ f(\tilde{w}^{s-1}) - f^{*} \right] - c\left[ f(\tilde{w}^{s}) - f^{*} \right] \right]\\ &&+ \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime\prime} \end{array} $$

(50)

second inequality follows by dropping, $\| w^{*} - {w^{s}_{m}}\|^{2} < 0$, third inequality follows from the strong convexity, i.e., $\| {w^{s}_{0}} - w^{*}\|^{2} \le 2/ \mu \left (f({w^{s}_{0}})- f^{*} \right ) $ and application of Assumption 3 twice, and fourth inequality follows from Assumption 3.

Since $\tilde {w}^{s} = 1/m {\sum }_{k = 1}^{m}{w^{s}_{k}}$ so by convexity using, $f(\tilde {w}^{s}) \le 1/m {\sum }_{k = 1}^{m} f({w^{s}_{k}})$, we have

$$ \begin{array}{@{}rcl@{}} &&\left( 1-\frac{4\alpha(b)}{(\beta-1)}\right)\mathbb{E} \left[ f(\tilde{w}^{s}) - f^{*}\right]\\ &\le& \frac{L\beta}{2m} \frac{2}{\mu}c\left[ f(\tilde{w}^{s-1}) - f^{*} \right] - \left( \frac{m-1}{m^{2}} - \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m}\right)\\ &&\times\left[ c\left[ f(\tilde{w}^{s-1}) - f^{*} \right] - c\left[ f(\tilde{w}^{s}) - f^{*} \right] \right]\\ &&+ \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ f(\tilde{w}^{s-1}) - f^{*} \right] + R^{\prime\prime} \end{array} $$

(51)

Subtracting, $ c\left (\frac {m-1}{m^{2}} - \frac {4\alpha (b)}{(\beta -1)} \frac {1}{m}\right ) \mathbb {E} \left [ f(\tilde {w}^{s}) - f^{*}\right ]$ both sides, we have

$$ \begin{array}{@{}rcl@{}} &&\left( 1-\frac{4\alpha(b)}{(\beta-1)} - c\left( \frac{m-1}{m^{2}} - \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m}\right)\right) \mathbb{E} \left[ f(\tilde{w}^{s}) - f^{*}\right]\\ &\le& \left[\frac{cL\beta}{m\mu} + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} -\frac{c(m-1)}{m^{2}} + \frac{4\alpha(b)}{(\beta-1)}\right]\\ &&\times\left[ f(\tilde{w}^{s-1}) -f^{*} \right]+ R^{\prime\prime} \end{array} $$

(52)

Dividing both sides by $ \left (1-\frac {4\alpha (b)}{(\beta -1)} - c\left (\frac {m-1}{m^{2}} - \frac {4\alpha (b)}{(\beta -1)} \frac {1}{m}\right )\right )$, we have

$$ \mathbb{E} \left[ f(\tilde{w}^{s}) - f^{*}\right] \le C\left[ f(\tilde{w}^{s-1}) -f^{*} \right]+ R^{\prime\prime\prime} $$

(53)

where $C = \left [\frac {cL\beta }{m\mu } + \frac {4\left (\alpha (b)m^{2}+(m-1)^{2}\right )}{m^{2}(\beta -1)} -\frac {c(m-1)}{m^{2}} + \frac {4\alpha (b)}{(\beta -1)}\right ]$$\left (1-\frac {4\alpha (b)}{(\beta -1)} - c\left (\frac {m-1}{m^{2}} - \frac {4\alpha (b)}{(\beta -1)} \frac {1}{m}\right )\right )^{-1}$ and $R^{\prime \prime \prime }= R^{\prime \prime }\left (1-\frac {4\alpha (b)}{(\beta -1)} - c\left (\frac {m-1}{m^{2}} - \frac {4\alpha (b)}{(\beta -1)} \frac {1}{m}\right )\right )^{-1}$. Now, recursively applying the inequality, we have

$$ \mathbb{E} \left[ f(\tilde{w}^{s}) - f^{*}\right] \le C^{s}\left[ f(\tilde{w}^{0}) -f^{*} \right]+ R^{\prime\prime\prime\prime}, $$

(54)

inequality follows for $R^{\prime \prime \prime \prime }= R^{\prime \prime \prime }/(1-C)$, since ${\sum }_{i = 0}^{k} r^{i} \le {\sum }_{i = 0}^{\infty } r^{i} = \frac {1}{1-r}, \quad \|r\|<1$. For certain choice of β, one can easily prove that C < 1. This proves linear convergence with some initial error. □

Proof of Theorem 3

(Non-strongly convex and non-smooth problem with SAAG-IV) □

Proof

By smoothness, we have,

$$ \begin{array}{@{}rcl@{}} f(w^{s}_{k + 1}) &\le& f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}), w^{s}_{k + 1} - {w^{s}_{k}}> \\ &&+ \frac{L}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \end{array} $$

(55)

$$ \begin{array}{@{}rcl@{}} \text{Now,} \!\!\!&&F(w^{s}_{k + 1}) = f(w^{s}_{k + 1}) + g(w^{s}_{k + 1})\\ \!\!\!&\le& f({w^{s}_{k}}) + g(w^{s}_{k + 1}) + <\nabla f({w^{s}_{k}}), w^{s}_{k + 1} - {w^{s}_{k}}>\\ &&+ \frac{L}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ \!\!\!&=& f({w^{s}_{k}}) + g(w^{s}_{k + 1}) + <\nabla f({w^{s}_{k}}), w^{s}_{k + 1} - {w^{s}_{k}}>\\ &&+ \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} - \frac{L(\beta-1}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ \!\!\!&=& f({w^{s}_{k}}) + g(w^{s}_{k + 1}) + <\tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}>\\ &&+ \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} + <\nabla f({w^{s}_{k}}) - \tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}>\\ &&- \frac{L(\beta-1}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \end{array} $$

(56)

where β is appropriately chosen positive value. Now, separately simplifying the terms, we have

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}\left[ f({w^{s}_{k}}) + g(w^{s}_{k + 1}) + <\tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \right]\\ & =& f({w^{s}_{k}}) + g(w^{s}_{k + 1}) + \mathbb{E}\left[<\tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}>\right] + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ & =& f({w^{s}_{k}}) + g(w^{s}_{k + 1}) + <\nabla f({w^{s}_{k}}) + \frac{m-1}{m}\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ & =& f({w^{s}_{k}}) + g(w^{s}_{k + 1}) + <\nabla f({w^{s}_{k}}), w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} + \frac{m-1}{m}<\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - {w^{s}_{k}}>\\ & \le& f({w^{s}_{k}}) + g(w^{*}) + <\nabla f({w^{s}_{k}}), w^{*} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{*} - {w^{s}_{k}} \|^{2} - \frac{L\beta}{2} \| w^{*} - w^{s}_{k + 1}\|^{2}\\ && + \frac{m-1}{m}<\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - {w^{s}_{k}}>\\ & =& f({w^{s}_{k}}) + g(w^{*}) + <\nabla f({w^{s}_{k}}), w^{*} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{*} - {w^{s}_{k}} \|^{2} - \frac{L\beta}{2} \| w^{*} - w^{s}_{k + 1}\|^{2}\\ && + \frac{m-1}{m}\left[<\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - w^{*}>- <\nabla f(\tilde{w}^{s-1}), {w^{s}_{k}} - w^{*}>\right]\\ & \le& f(w^{*}) + g(w^{*}) + \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right]\\ && + \frac{m-1}{m}\left[\frac{1}{2\delta}\|\nabla f(\tilde{w}^{s-1})\|^{2} + \frac{\delta}{2} \|w^{s}_{k + 1} - w^{*}\|^{2} - \left[ \frac{1}{2\delta}\|\nabla f(\tilde{w}^{s-1})\|^{2} +\frac{\delta}{2} \|{w^{s}_{k}} - w^{*}\|^{2}\right]\right]\\ & =& F(w^{*}) + \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right] + \frac{\delta(m-1)}{2m}\left[\|w^{s}_{k + 1} - w^{*}\|^{2} - \|{w^{s}_{k}} - w^{*}\|^{2}\right],\\ & =& F(w^{*}) + \left( \frac{L\beta}{2} - \frac{\delta(m-1)}{2m}\right) \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2} \right],\\ & =&F(w^{*}), \end{array} $$

(57)

second equality follows from, $\mathbb {E}\left [\tilde {\nabla }_{s,k}\right ] = \nabla f({w^{s}_{k}}) + \frac {m-1}{m}\nabla f(\tilde {w}^{s-1})$, first inequality follows from Lemma 1, second inequality follows from the convexity, i.e., f(w^∗) ≥ $ f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}), w^{*} - {w^{s}_{k}}>$ and Young’s inequality, i.e., x^Ty ≤ 1/(2δ)∥x∥² + δ/2∥y∥² for δ > 0, and last equality follows by choosing $\delta = \frac {mL\beta }{(m-1)}$.

$$ \begin{array}{@{}rcl@{}} \text{And}, &&\mathbb{E} \left[ <\nabla f({w^{s}_{k}}) - \tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}> - \frac{L(\beta-1)}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \right]\\ & \le& \mathbb{E} \left[ \frac{1}{2L(\beta-1)} \|\nabla f({w^{s}_{k}}) - \tilde{\nabla}_{s,k}\|^{2} + \frac{L(\beta-1)}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} - \frac{L(\beta-1)}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \right]\\ & =& \frac{1}{2L(\beta-1)} \mathbb{E} \left[\| \tilde{\nabla}_{s,k} - \nabla f({w^{s}_{k}})\|^{2} \right]\\ & \le& \frac{1}{2L(\beta-1)}\left[8L\alpha(b) \left[ F({w^{s}_{k}}) - F^{*} \right] + \frac{8L\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}} \left[ F(\tilde{w}^{s-1}) - F^{*} \right] + R^{\prime}\right]\\ & =& \frac{4\alpha(b)} {(\beta-1)}\left[ F({w^{s}_{k}}) - F^{*} \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F^{*} \right] + R^{\prime\prime} \end{array} $$

(58)

first inequality follows from Young’s inequality and second inequality follows from Lemma 7 and $R^{\prime \prime } = R^{\prime }/(2L(\beta -1))$. Now, substituting the values into (56) from inequalities (57) and (58), and taking expectation w.r.t. mini-batches, we have

$$ \begin{array}{@{}rcl@{}} \mathbb{E} \!\left[ F(w^{s}_{k + 1})\right] \!\!&\le&\! F(w^{*}) + \frac{4\alpha(b)} {(\beta-1)}\left[ F({w^{s}_{k}}) - F^{*} \right]\\ &&\!\!+ \frac{4\left( \alpha(b)m^{2} + (m - 1)^{2}\right)}{m^{2}(\beta-1)} \!\left[ \!F(\tilde{w}^{s-1}) - F^{*} \right] + R^{\prime\prime}\\ \mathbb{E} \!\left[ \!F{\kern-.5pt}({\kern-.5pt}w^{s}_{k + 1}{\kern-.5pt}) - F{\kern-.5pt}({\kern-.5pt}w^{*})\right] \!\!&\le&\! \frac{4\alpha(b)} {(\beta-1)}\left[ F({w^{s}_{k}}) - F^{*} \right]\\ &&\!\!+ \frac{4{\kern-.5pt}\left( {\kern-.5pt}\alpha({\kern-.5pt}b)m^{2} + (m - 1)^{2}\right)}{m^{2}(\beta-1)} \!\left[ \!F{\kern-.5pt}({\kern-.5pt}\tilde{w}^{s-1}{\kern-.5pt}) - F^{*} \!\right] + R^{\prime\prime} \end{array} $$

(59)

Taking sum over k = 0, 1,..., (m − 1) and dividing by m, we have

$$ \begin{array}{@{}rcl@{}} &&\frac{1}{m}{\sum}_{k = 0}^{m-1}\mathbb{E} \left[ F(w^{s}_{k + 1})-F(w^{*})\right]\\ &\le& \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m}{\sum}_{k = 0}^{m-1} \left[ F({w^{s}_{k}}) - F(w^{*}) \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime\prime}\\ &&\frac{1}{m}{\sum}_{k = 1}^{m}\mathbb{E} \left[ F({w^{s}_{k}}) - F(w^{*})\right]\\ &\le& \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m} \left\lbrace {\sum}_{k = 1}^{m} \left[ F({w^{s}_{k}}) - F(w^{*}) \right] + F({w^{s}_{0}}) - F(w^{*}) - \lbrace F({w^{s}_{m}}) - F(w^{*}) \rbrace \right\rbrace\\ &&+ \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime\prime} \end{array} $$

(60)

Subtracting $\frac {4\alpha (b)}{(\beta -1)} \frac {1}{m}{\sum }_{k = 1}^{m} \left [ F({w^{s}_{k}}) - F(w^{*})\right ] $ from both sides, we have

$$ \begin{array}{@{}rcl@{}} \!\!&&\left( 1-\frac{4\alpha(b)}{(\beta-1)}\right)\frac{1}{m}{\sum}_{k = 1}^{m}\mathbb{E} \left[ F({w^{s}_{k}}) - F(w^{*})\right]\\ \!\!& \le& \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m} \left[ F({w^{s}_{0}}) - F(w^{*}) - \lbrace F({w^{s}_{m}}) - F(w^{*}) \rbrace\right]\\ \!\!&& + \frac{4\left( \alpha(b)m^{2} + (m - 1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime\prime} \end{array} $$

(61)

Since $F({w^{s}_{m}}) - F(w^{*}) \ge 0$ so dropping this term and using Assumption 3, we have

$$ \begin{array}{@{}rcl@{}} &&\left( 1-\frac{4\alpha(b)}{(\beta-1)}\right)\frac{1}{m}{\sum}_{k = 1}^{m}\mathbb{E} \left[ F({w^{s}_{k}}) - F(w^{*})\right]\\ &\le& \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m} \left[ F({w^{s}_{0}}) - F(w^{*}) \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime\prime}\\ &\le& \frac{4\alpha(b)}{(\beta-1)} \frac{1}{m} \left[ c\left( F(\tilde{w}^{s-1}) - F(w^{*})\right) \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime\prime}\\ &=& \left( \frac{4\alpha(b)}{(\beta-1)} \frac{c}{m}+ \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \right) \left[ \left( F(\tilde{w}^{s-1}) - F(w^{*})\right) \right] + R^{\prime\prime}, \end{array} $$

(62)

Dividing both sides by $\left (1-\frac {4\alpha (b)}{(\beta -1)}\right )$, and since $\tilde {w}^{s} = 1/m {\sum }_{k = 1}^{m}{w^{s}_{k}}$ so by convexity, $F(\tilde {w}^{s}) \le 1/m {\sum }_{k = 1}^{m} F({w^{s}_{k}})$, we have

$$ \begin{array}{@{}rcl@{}} \mathbb{E} \left[ F(\tilde{w}^{s}) - F^{*}\right]\! &\le&\! \left( \! \frac{4\alpha(b)}{(\beta - 1 - 4\alpha(b))} \frac{c}{m}+ \frac{4\left( \alpha(b)m^{2} + (m-1)^{2}\right)}{m^{2}(\beta - 1 - 4\alpha(b))}\!\right)\\ &&\times\left[ F (\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime\prime\prime}, \end{array} $$

(63)

where $ R^{\prime \prime \prime }= R^{\prime \prime } \left (1-\frac {4\alpha (b)}{(\beta -1)}\right )^{-1}$. Now, applying above inequality recursively, we have

$$ \begin{array}{ll} \mathbb{E} \left[ F(\tilde{w}^{s}) - F(w^{*})\right] \le C^{s} \left[ F(\tilde{w}^{0}) - F(w^{*}) \right] + R^{\prime\prime\prime\prime}, \end{array} $$

(64)

inequality follows for $R^{\prime \prime \prime \prime }= R^{\prime \prime \prime }/(1-C)$, since ${\sum }_{i = 0}^{k} r^{i} \le {\sum }_{i = 0}^{\infty } r^{i} = \frac {1}{1-r}, \quad \|r\|<1$ and $C = \left (\frac {4\alpha (b)}{(\beta -1-4\alpha (b))} \frac {c}{m}+ \frac {4\left (\alpha (b)m^{2}+(m-1)^{2}\right )}{m^{2}(\beta -1-4\alpha (b))}\right ) $. For certain choice of β, one can easily prove that C < 1. This proves linear convergence with some initial error. □

Proof of Theorem 4

(Strongly convex and non-smooth problem with SAAG-IV) □

Proof

By smoothness, we have,

$$ \begin{array}{@{}rcl@{}} f(w^{s}_{k + 1}) & \le& f({w^{s}_{k}}) + <\nabla f({w^{s}_{k}}), w^{s}_{k + 1} - {w^{s}_{k}}> \\ &&+ \frac{L}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ \end{array} $$

Now,

$$ \begin{array}{@{}rcl@{}} F(w^{s}_{k + 1}) &=& f(w^{s}_{k + 1}) + g(w^{s}_{k + 1})\\ & \le& f({w^{s}_{k}}) + g(w^{s}_{k + 1}) + <\nabla f({w^{s}_{k}}), w^{s}_{k + 1} - {w^{s}_{k}}>\\ &&+ \frac{L}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ & =& f({w^{s}_{k}}) + g(w^{s}_{k + 1}) + <\nabla f({w^{s}_{k}}), w^{s}_{k + 1} - {w^{s}_{k}}>\\ && + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} - \frac{L(\beta-1}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ & =& f({w^{s}_{k}}) + g(w^{s}_{k + 1}) + <\tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}>\\ && + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} + <\nabla f({w^{s}_{k}}) - \tilde{\nabla}_{s,k}, w^{s}_{k + 1}\\ && - {w^{s}_{k}}>- \frac{L(\beta-1}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \end{array} $$

(65)

where β is appropriately chosen positive value. Now, separately simplifying the terms, we have

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E}\left[ f({w^{s}_{k}}) + g(w^{s}_{k + 1}) + <\tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \right]\\ & =& f({w^{s}_{k}}) + g(w^{s}_{k + 1}) + \mathbb{E}\left[<\tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}>\right] + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ & =& f({w^{s}_{k}}) + g(w^{s}_{k + 1}) + <\nabla f({w^{s}_{k}}) + \frac{m-1}{m}\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2}\\ & =& f({w^{s}_{k}}) + g(w^{s}_{k + 1}) + <\nabla f({w^{s}_{k}}), w^{s}_{k + 1} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} + \frac{m-1}{m}<\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - {w^{s}_{k}}>\\ & \le& f({w^{s}_{k}}) + g(w^{*}) + <\nabla f({w^{s}_{k}}), w^{*} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{*} - {w^{s}_{k}} \|^{2} - \frac{L\beta}{2} \| w^{*} - w^{s}_{k + 1}\|^{2}\\ &&+ \frac{m-1}{m}<\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - {w^{s}_{k}}>\\ & =& f({w^{s}_{k}}) + g(w^{*}) + <\nabla f({w^{s}_{k}}), w^{*} - {w^{s}_{k}}> + \frac{L\beta}{2} \| w^{*} - {w^{s}_{k}} \|^{2} - \frac{L\beta}{2} \| w^{*} - w^{s}_{k + 1}\|^{2}\\ && + \frac{m-1}{m}\left[<\nabla f(\tilde{w}^{s-1}), w^{s}_{k + 1} - \tilde{w}^{s-1}>- <\nabla f(\tilde{w}^{s-1}), {w^{s}_{k}} - \tilde{w}^{s-1}>\right]\\ & \le& f(w^{*}) + g(w^{*}) + \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right]\\ && + \frac{m-1}{m}\left[ f(w^{s}_{k + 1}) - f(\tilde{w}^{s-1}) - \left( f({w^{s}_{k}}) - f(\tilde{w}^{s-1})\right)\right]\\ & =& f(w^{*}) + g(w^{*}) + \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right] + \frac{m-1}{m}\left[ f(w^{s}_{k + 1}) - f({w^{s}_{k}})\right],\\ & =& F(w^{*}) + \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right] + \frac{m-1}{m}\left[ f(w^{s}_{k + 1}) - f({w^{s}_{k}})\right], \end{array} $$

(66)

second equality follows from, $\mathbb {E}\left [\tilde {\nabla }_{s,k}\right ] = \nabla f({w^{s}_{k}}) + \frac {m-1}{m}\nabla f(\tilde {w}^{s-1})$, first inequality follows from Lemma 1 and second inequality follows from the convexity, i.e., f(x) ≥ f(y)+ < ∇f(y),x − y >.

$$ \begin{array}{@{}rcl@{}} \text{And}, &&\mathbb{E} \left[ <\nabla f({w^{s}_{k}}) - \tilde{\nabla}_{s,k}, w^{s}_{k + 1} - {w^{s}_{k}}> - \frac{L(\beta-1)}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \right]\\ & \le& \mathbb{E} \left[ \frac{1}{2L(\beta-1)} \|\nabla f({w^{s}_{k}}) - \tilde{\nabla}_{s,k}\|^{2} + \frac{L(\beta-1)}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} - \frac{L(\beta-1)}{2} \| w^{s}_{k + 1} - {w^{s}_{k}} \|^{2} \right]\\ & =& \frac{1}{2L(\beta-1)} \mathbb{E} \left[\| \tilde{\nabla}_{s,k} - \nabla f({w^{s}_{k}})\|^{2} \right]\\ & \le& \frac{1}{2L(\beta-1)}\left[8L\alpha(b) \left[ F({w^{s}_{k}}) - F^{*} \right] + \frac{8L\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}} \left[ F(\tilde{w}^{s-1}) - F^{*} \right] + R^{\prime}\right]\\ & =& \frac{4\alpha(b)} {(\beta-1)}\left[ F({w^{s}_{k}}) - F^{*} \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F^{*} \right] + R^{\prime\prime}, \end{array} $$

(67)

first inequality follows from Young’s inequality and second inequality follows from Lemma 7 and $R^{\prime \prime } = R^{\prime }/(2L(\beta -1))$. Now, substituting the values into (65) from inequalities (66) and (67), and taking expectation w.r.t. mini-batches, we have

$$ \begin{array}{@{}rcl@{}} &&\mathbb{E} \left[ F(w^{s}_{k + 1})\right]\\ & \le& F(w^{*}) + \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right] + \frac{m-1}{m}\left[ f(w^{s}_{k + 1}) - f({w^{s}_{k}})\right]\\ &&+ \frac{4\alpha(b)} {(\beta-1)}\left[ F({w^{s}_{k}}) - F^{*} \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F^{*} \right] + R^{\prime\prime}\\ &&\mathbb{E} \left[ F(w^{s}_{k + 1})-F(w^{*})\right]\\ & \le& \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right] + \frac{m-1}{m}\left[ f(w^{s}_{k + 1}) - f({w^{s}_{k}})\right]\\ &&+ \frac{4\alpha(b)} {(\beta-1)}\left[ F({w^{s}_{k}}) - F^{*} \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F^{*} \right] + R^{\prime\prime} \end{array} $$

(68)

Taking sum over k = 0, 1,..., (m − 1) and dividing by m, we have

$$ \begin{array}{@{}rcl@{}} &&\frac{1}{m}{\sum}_{k = 0}^{m-1}\mathbb{E} \left[ F(w^{s}_{k + 1})-F(w^{*})\right]\\ & \le& \frac{1}{m}{\sum}_{k = 0}^{m-1}\left\lbrace \frac{L\beta}{2} \left[ \| w^{*} - {w^{s}_{k}} \|^{2} - \| w^{*} - w^{s}_{k + 1}\|^{2}\right] + \frac{m-1}{m}\left[ f(w^{s}_{k + 1}) - f({w^{s}_{k}})\right] \right\rbrace\\ &&+ \frac{1}{m}{\sum}_{k = 0}^{m-1}\left\lbrace\frac{4\alpha(b)} {(\beta-1)}\left[ F({w^{s}_{k}}) - F^{*} \right] + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F^{*} \right] + R^{\prime\prime} \right\rbrace\\ &&\frac{1}{m}{\sum}_{k = 1}^{m}\mathbb{E} \left[ F({w^{s}_{k}})-F(w^{*})\right]\\ & \le& \frac{L\beta}{2m} \left[ \| w^{*} - {w^{s}_{0}} \|^{2} - \| w^{*} - {w^{s}_{m}}\|^{2}\right] + \frac{m-1}{m^{2}}\left[ f({w^{s}_{m}}) - f({w^{s}_{0}})\right]\\ &&+ \frac{4\alpha(b)}{(\beta-1)}\frac{1}{m}\left\lbrace {\sum}_{k = 1}^{m}\left[ F({w^{s}_{k}}) - F(w^{*}) \right] + F({w^{s}_{0}}) - F(w^{*}) - \left( F({w^{s}_{m}}) - F(w^{*}) \right)\right\rbrace\\ && + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime\prime} \end{array} $$

(69)

Subtracting, $\frac {4\alpha (b)}{(\beta -1)}\frac {1}{m}{\sum }_{k = 1}^{m}\left [ F({w^{s}_{k}}) - F(w^{*}) \right ]$ from both sides, we have

$$ \begin{array}{@{}rcl@{}} && \left( 1-\frac{4\alpha(b)}{(\beta-1)}\right)\frac{1}{m}{\sum}_{k = 1}^{m}\mathbb{E} \left[ F({w^{s}_{k}}) - F(w^{*}) \right]\\ & \le& \frac{L\beta}{2m} \left[ \| w^{*} - {w^{s}_{0}} \|^{2} - \| w^{*} - {w^{s}_{m}}\|^{2}\right] + \frac{m-1}{m^{2}}\left[ f({w^{s}_{m}}) - f({w^{s}_{0}})\right]\\ &&+ \frac{4\alpha(b)}{(\beta-1)}\frac{1}{m}\left\lbrace F({w^{s}_{0}}) - F(w^{*}) - \left( F({w^{s}_{m}}) - F(w^{*}) \right)\right\rbrace\\ && + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime\prime}\\ & \le& \frac{L\beta}{2m} \left[ \| w^{*} - {w^{s}_{0}} \|^{2} \right] + \frac{m-1}{m^{2}}\left[ F({w^{s}_{m}}) - F({w^{s}_{0}})\right] + \frac{4\alpha(b)}{(\beta-1)}\frac{1}{m}\left\lbrace F({w^{s}_{0}}) - F(w^{*}) - \left( F({w^{s}_{m}}) - F(w^{*}) \right)\right\rbrace\\ && + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime\prime\prime}\\ & \le& \frac{L\beta}{2m} \frac{2}{\mu} \left[ F({w^{s}_{0}}) - F(w^{*}) \right] + \frac{m-1}{m^{2}}\left[ F({w^{s}_{m}}) - F({w^{s}_{0}})\right] + \frac{4\alpha(b)}{(\beta-1)}\frac{1}{m}\left\lbrace F({w^{s}_{0}}) - F(w^{*}) - \left( F({w^{s}_{m}}) - F(w^{*}) \right)\right\rbrace\\ && + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime\prime\prime}\\ & =& \left( \frac{L\beta}{m\mu} + \frac{4\alpha(b)}{m(\beta-1)} - \frac{m-1}{m^{2}}\right) \left[ F({w^{s}_{0}}) - F(w^{*}) \right] + \left( \frac{m-1}{m^{2}} - \frac{4\alpha(b)}{m(\beta-1)} \right) \left[F({w^{s}_{m}}) - F(w^{*})\right]\\ && + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime\prime\prime}\\ & \le& \left( \frac{L\beta}{m\mu} + \frac{4\alpha(b)}{m(\beta-1)} - \frac{m-1}{m^{2}}\right) c \left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + \left( \frac{m-1}{m^{2}} - \frac{4\alpha(b)}{m(\beta-1)} \right) c \left[F(\tilde{w}^{s}) - F(w^{*})\right]\\ && + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)} \left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime\prime\prime}\\ & \le& \left( \frac{Lc\beta}{m\mu} + \frac{4c\alpha(b)}{m(\beta-1)} - \frac{c(m-1)}{m^{2}} + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)}\right) \left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right]\\ &&+ \left( \frac{m-1}{m^{2}} - \frac{4\alpha(b)}{m(\beta-1)}\right)c \left[F(\tilde{w}^{s}) - F(w^{*})\right] + R^{\prime\prime\prime} \end{array} $$

(70)

second inequality follows from dropping, $\| w^{*} - {w^{s}_{m}}\|^{2} \ge 0$ and converting, $f({w^{s}_{m}}) - f({w^{s}_{0}})$ to $f({w^{s}_{m}}) - f({w^{s}_{0}})$ by introducing some constant, third inequality follows from the strong convexity, i.e., $\| {w^{s}_{0}} - w^{*}\|^{2} \le 2/ \mu \left (f({w^{s}_{0}})- f^{*} \right ) $, fourth inequality follows from Assumption 3 and $R^{\prime \prime \prime } = R^{\prime \prime } + (m-1)g({w^{s}_{0}})/m^{2}$. Since $\tilde {w}^{s} = 1/m {\sum }_{k = 1}^{m}{w^{s}_{k}}$ so by convexity using, $f(\tilde {w}^{s}) \le 1/m {\sum }_{k = 1}^{m} f({w^{s}_{k}})$, and subtracting $\left (\frac {m-1}{m^{2}} - \frac {4\alpha (b)}{m(\beta -1)} \right ) c \left [F(\tilde {w}^{s}) - F(w^{*})\right ]$ from both sides, we have

$$ \begin{array}{@{}rcl@{}} \!\!&&\left( \!1 - \frac{4\alpha(b)}{(\beta - 1)} - \frac{c(m - 1)}{m^{2}} + \frac{4c\alpha(b)}{m(\beta - 1)}\!\right)\mathbb{E} \left[ F(\tilde{w}^{s}) - F(w^{*}) \right]\\ \!\!&\le& \!\left( \!\frac{Lc\beta}{m\mu} \!+ \frac{4c\alpha(b)}{m(\beta-1)} - \frac{c(m-1)}{m^{2}} + \frac{4\left( \alpha(b)m^{2}+(m-1)^{2}\right)}{m^{2}(\beta-1)}\right)\\ \!\!&&\times\left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime\prime\prime} \end{array} $$

(71)

Dividing both sides by $ \left (1-\frac {4\alpha (b)}{(\beta -1)} - \frac {c(m-1)}{m^{2}} + \frac {4c\alpha (b)}{m(\beta -1)}\right )$, we have

$$ \begin{array}{ll} &\mathbb{E} \left[ F(\tilde{w}^{s}) - F(w^{*}) \right] \le C \left[ F(\tilde{w}^{s-1}) - F(w^{*}) \right] + R^{\prime\prime\prime\prime} \end{array} $$

(72)

where $C = \left (\frac {Lc\beta }{m\mu } + \frac {4c\alpha (b)}{m(\beta -1)} - \frac {c(m-1)}{m^{2}} + \frac {4\left (\alpha (b)m^{2}+(m-1)^{2}\right )}{m^{2}(\beta -1)}\right )$$\left (1-\frac {4\alpha (b)}{(\beta -1)} - \frac {c(m-1)}{m^{2}} + \frac {4c\alpha (b)}{m(\beta -1)}\right )^{-1} $ and $R^{\prime \prime \prime \prime } = R^{\prime \prime \prime }\left (1-\frac {4\alpha (b)}{(\beta -1)} - \frac {c(m-1)}{m^{2}} + \frac {4c\alpha (b)}{m(\beta -1)}\right )^{-1} $. Now, applying this inequality recursively, we have

$$ \begin{array}{ll} \mathbb{E} \left[ F(\tilde{w}^{s}) - F(w^{*}) \right] \le C^{s} \left[ F(\tilde{w}^{0}) - F(w^{*}) \right] + R^{\prime\prime\prime\prime\prime}, \end{array} $$

(73)

inequality follows for $R^{\prime \prime \prime \prime \prime }= R^{\prime \prime \prime \prime }/(1-C)$, since ${\sum }_{i = 0}^{k} r^{i} \le {\sum }_{i = 0}^{\infty } r^{i} = \frac {1}{1-r}, \quad \|r\|<1$. For certain choice of β, one can easily prove that C < 1. This proves linear convergence with some initial error. □

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chauhan, V.K., Sharma, A. & Dahiya, K. SAAGs: Biased stochastic variance reduction methods for large-scale learning. Appl Intell 49, 3331–3361 (2019). https://doi.org/10.1007/s10489-019-01450-3

Download citation

Published: 05 April 2019
Issue Date: 15 September 2019
DOI: https://doi.org/10.1007/s10489-019-01450-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SAAGs: Biased stochastic variance reduction methods for large-scale learning

Abstract

Access this article

Similar content being viewed by others

An Overview of Stochastic Quasi-Newton Methods for Large-Scale Machine Learning

Adaptive stochastic approximation algorithm

A new inexact stochastic recursive gradient descent algorithm with Barzilai–Borwein step size in machine learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendices

Appendix A: More Experiments

1.1 A.1 Results with Support Vector Machine (SVM)

1.2 A.2 Comparison of SAAGs (I, II, III and IV) for non-smooth problem

1.3 A.3 Effect of mini-batch size on SAAG-III, IV, SVRG and VR-SGD for non-smooth problem

1.4 A.4 Effect of mini-batch size on SAAGs (I, II, III, IV) for smooth problem

1.5 A.5 Effect of regularization coefficient for non-smooth problem

B Proofs

Assumption 1 (Smoothness)

Assumption 2 (Strong Convexity)

Assumption 3 (Assumption 3 in [12])

Lemma 1 (3-Point Property [17])

Lemma 2 (Theorem 4 in [16])

Lemma 3

Lemma 4 (Extension of Lemma 3.4 in [27] to mini-batches)

Proof

Lemma 5 (Extension of Lemma 3.4 in [27] to mini-batches)

Proof

Lemma 6 (Variance Bound for smooth problem)

Proof

Lemma 7 (Variance Bound for non-smooth problem)

Proof

Proof of Theorem 1

Proof

Proof of Theorem 2

Proof

Proof of Theorem 3

Proof

Proof of Theorem 4

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation