Why random reshuffling beats stochastic gradient descent

Gürbüzbalaban, M.; Ozdaglar, A.; Parrilo, P. A.

doi:10.1007/s10107-019-01440-w

Why random reshuffling beats stochastic gradient descent

Full Length Paper
Series A
Published: 29 October 2019

Volume 186, pages 49–84, (2021)
Cite this article

Mathematical Programming Submit manuscript

2304 Accesses
24 Citations
3 Altmetric
Explore all metrics

Abstract

We analyze the convergence rate of the random reshuffling (RR) method, which is a randomized first-order incremental algorithm for minimizing a finite sum of convex component functions. RR proceeds in cycles, picking a uniformly random order (permutation) and processing the component functions one at a time according to this order, i.e., at each cycle, each component function is sampled without replacement from the collection. Though RR has been numerically observed to outperform its with-replacement counterpart stochastic gradient descent (SGD), characterization of its convergence rate has been a long standing open question. In this paper, we answer this question by providing various convergence rate results for RR and variants when the sum function is strongly convex. We first focus on quadratic component functions and show that the expected distance of the iterates generated by RR with stepsize $\alpha _k=\varTheta (1/k^s)$ for $s\in (0,1]$ converges to zero at rate $\mathcal{O}(1/k^s)$ (with $s=1$ requiring adjusting the stepsize to the strong convexity constant). Our main result shows that when the component functions are quadratics or smooth (with a Lipschitz assumption on the Hessian matrices), RR with iterate averaging and a diminishing stepsize $\alpha _k=\varTheta (1/k^s)$ for $s\in (1/2,1)$ converges at rate $\varTheta (1/k^{2s})$ with probability one in the suboptimality of the objective value, thus improving upon the $\varOmega (1/k)$ rate of SGD. Our analysis draws on the theory of Polyak–Ruppert averaging and relies on decoupling the dependent cycle gradient error into an independent term over cycles and another term dominated by $\alpha _k^2$. This allows us to apply law of large numbers to an appropriately weighted version of the cycle gradient errors, where the weights depend on the stepsize. We also provide high probability convergence rate estimates that shows decay rate of different terms and allows us to propose a modification of RR with convergence rate $\mathcal{O}(\frac{1}{k^2})$.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A new computational framework for log-concave density estimation

Article Open access 30 April 2024

Check your outliers! An introduction to identifying statistical outliers in R with easystats

Article 25 March 2024

Sample complexity analysis for adaptive optimization algorithms with stochastic oracles

Article 29 April 2024

Notes

Such functions arise naturally in support vector machines and other regularized learning algorithms or regression problems (see e.g. [32, 36, 38]).
There is some literature that analyzes SGD under correlated noise [25, Ch. 6], but the noise needs to have a special structure (such as a mixing property) which does not seem to be applicable to the analysis of RR.
IG shows similar properties to SGD in terms of the robustness of the stepsize rules $\alpha _k = R/k^s$. The convergence rate (in k) is only robust to the strong convexity constant of the objective for $s<1$ but not for $s=1$ [29, Section 2.1].
To see this, note that the RR iterations for this example are given by $ x_0^{k+1} = (1-\frac{3}{2}\alpha _k + 2\alpha _k^2) x_0^k - \alpha _k^2 \mu (\sigma _k)$ which implies, after taking norms of both sides and using the fact that $\Vert \mu (\sigma _k)\Vert \le 2$, $ \text {dist}_{k+1}\le (1-\frac{3}{2}\alpha _k + 2\alpha _k^2) \text {dist}_k + 2 \alpha _k^2.$ Then, by invoking classical results for the asymptotic behavior of non-negative sequences (see e.g. [6, Appendix A.4.3], we get $\text {dist}_{k+1}\rightarrow 0$. Theorem 1 also shows global convergence of RR on this example.
The original result in [21, Theorem 3.1] was stated for $\sigma =\{1,2,\dots ,m\}$ but here we translate this result into an arbitrary permutation $\sigma $ of $\{1,2,\dots ,m\}$ by noting that processing the set of functions $\{f_1, f_2, \dots , f_m\}$ with order $\sigma $ is equivalent to processing the permuted functions $\{f_{\sigma _1}, f_{\sigma _2}, \dots , f_{\sigma _m}\}$ with order $\{1,2,\dots ,m\}$.
This is due to the fact that the sequence $\alpha _k^2$ is summable when $s>1/2$.
Note that if this assumption holds and if $f_i$ is three-times continuously differentiable on the compact set $\mathcal {X}$, then the third-order derivatives are bounded and Assumption 2 holds.
The quadratic functions $f_i(x)$ have the form $f_i(x) = x^T A_i x + q_i^T x + r_i$. The matrices $A_i$ are chosen randomly satisfying $A_i = \frac{1}{n} R_i R_i^T + \lambda I$ where I is the $n\times n$ identity matrix, R is a random matrix with each entry uniform on the interval $[-50,50]$ and $\lambda $ is a regularization parameter to make the problem strongly convex. We set $\lambda = 5$. The vectors $q_i$ are random, each component is uniformly distributed on the interval $[-50,50]$ and $c_i$ is uniform on the interval $[-1,1]$.
We note that all experiments were performed on a Macbook Pro with an 3.1 GHz Intel Core i7 processor and 16 GB of RAM, using Matlab R2017a running on the operating system Mac OS Sierra v10.12.5.

References

Agarwal, A., Bartlett, P.L., Ravikumar, P., Wainwright, M.J.: Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Trans. Inf. Theory 58(5), 3235–3249 (2012)
Article MathSciNet Google Scholar
Bertsekas, D.: Incremental least squares methods and the extended Kalman filter. SIAM J. Optim. 6(3), 807–822 (1996)
Article MathSciNet Google Scholar
Bertsekas, D.: A hybrid incremental gradient method for least squares. SIAM J. Optim. 7, 913–926 (1997)
Article MathSciNet Google Scholar
Bertsekas, D.: Nonlinear Programming. Athena Scientific, Belmont (1999)
MATH Google Scholar
Bertsekas, D.: Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. Optim. Mach. Learn. 1–38, 2011 (2010)
Google Scholar
Bertsekas, D.: Convex Optimization Algorithms. Athena Scientific, Belmont (2015)
MATH Google Scholar
Bottou, L.: Curiously fast convergence of some stochastic gradient descent algorithms. In: Proceedings of the Symposium on Learning and Data Science, Paris (2009)
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Lechevallier, Y., Saporta, G. (eds.) Proceedings of COMPSTAT’2010, pp. 177–186. Physica-Verlag, HD (2010)
Bottou, L.: Stochastic gradient descent tricks. In: Neural Networks: Tricks of the Trade, pp. 421–436. Springer, Heidelberg (2012)
Bottou, L.: Stochastic gradient descent tricks. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science, vol. 7700, pp. 421–436. Springer, Berlin (2012)
Bottou, L., Le Cun, Y.: On-line learning for very large data sets. Appl. Stoch. Models Bus. Ind. 21(2), 137–151 (2005)
Article MathSciNet Google Scholar
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends$\textregistered $ in Machine Learning 3(1), 1–122 (2011)
Chung, K.L.: On a stochastic approximation method. Ann. Math. Stat. 25(3), 463–483 (1954)
Article MathSciNet Google Scholar
Defazio, A., Bach, F., Lacoste-Julien, S.: Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems, pp. 1646–1654 (2014)
Etemadi, N.: Convergence of weighted averages of random variables revisited. Proc. Am. Math. Soc. 134(9), 2739–2744 (2006)
Article MathSciNet Google Scholar
Fabian, V.: Stochastic approximation of minima with improved asymptotic speed. Ann. Math. Stat. 38(1), 191–200 (1967)
Article MathSciNet Google Scholar
Fabian, V.: On asymptotic normality in stochastic approximation. Ann. Math. Stat. 39, 1327–1332 (1968)
Article MathSciNet Google Scholar
Feng, X., Kumar, A., Recht, B., Ré, C.: Towards a unified architecture for in-RDBMS analytics. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 325–336. ACM (2012)
Gould, N.I.M., Leyffer, S.: An introduction to algorithms for nonlinear optimization. In: Blowey, J.F., Craig, A.W., Shardlow, T. (eds.) Frontiers in Numerical Analysis. Universitext, pp. 109–197. Springer, Berlin (2003)
Chapter Google Scholar
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv e-prints, page arXiv:1706.02677, (Jun 2017)
Gürbüzbalaban, M., Ozdaglar, A., Parrilo, P.: Convergence rate of incremental gradient and Newton methods. arXiv preprint arXiv:1510.08562, (October 2015)
Harikandeh, R., Ahmed, M.O., Virani, A., Schmidt, M., Konečnỳ, J., Sallinen, S.: Stopwasting my gradients: Practical SVRG. In: Advances in Neural Information Processing Systems, pp. 2251–2259 (2015)
Israel, A., Krahmer, F., Ward, R.: An arithmetic-geometric mean inequality for products of three matrices. Linear Algebra Appl. 488, 1–12 (2016)
Article MathSciNet Google Scholar
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)
Kushner, H.J., Yin, G.: Stochastic Approximation and Recursive Algorithms and Applications, vol. 35. Springer, Berlin (2003)
MATH Google Scholar
Moulines, E., Bach, F.R.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Advances in Neural Information Processing, pp. 451–459 (2011)
Nedić, A., Ozdaglar, A.: On the rate of convergence of distributed subgradient methods for multi-agent optimization. In: Proceedings of the 46th IEEE Conference on Decision and Control (CDC), pp. 4711–4716 (2007)
Nedić, A., Ozdaglar, A.: Distributed subgradient methods for multi-agent optimization. IEEE Trans. Autom. Control 54(1), 48–61 (2009)
Article MathSciNet Google Scholar
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)
Article MathSciNet Google Scholar
Nemirovskii, A.S., Yudin, D.B., Dawson, E.R.: Problem complexity and method efficiency in optimization. Wiley-Interscience Series in Discrete Mathematics. Wiley, Chichester (1983)
Google Scholar
Polyak, B.T., Juditsky, A.B.: Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30(4), 838–855 (1992)
Article MathSciNet Google Scholar
Rakhlin, A., Shamir, O., Sridharan, K.: Making gradient descent optimal for strongly convex stochastic optimization. In: Proceedings of the 29th International Conference on Machine Learning (ICML-12), pp. 449–456 (2012)
Ram, S.S., Nedic, A., Veeravalli, V.V.: Stochastic incremental gradient descent for estimation in sensor networks. Signals Syst. Comput. ACSSC 2007, 582–586 (2007)
Google Scholar
Recht, B., Ré, C.: Toward a noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences. JMLR Workshop Conf. Proc. 23, 11.1–11.24 (2012)
Google Scholar
Recht, B., Ré, C.: Parallel stochastic gradient algorithms for large-scale matrix completion. Math. Program. Comput. 5(2), 201–226 (2013)
Article MathSciNet Google Scholar
Recht, B., Ré, C., Wright, S., Feng, N. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 24, pp. 693–701. Curran Associates Inc (2011)
Robbins, H., Monro, S.: A stochastic approximation method. The Annals of Mathematical Statistics, pp. 400–407, (1951)
Roux, N.L., Schmidt, M., Bach, F.R.: A stochastic gradient method with an exponential convergence rate for finite training sets. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25, pp. 2663–2671. Curran Associates Inc, Red Hook (2012)
Google Scholar
Roux, N.L., Schmidt, M., Bach, F.R.: A stochastic gradient method with an exponential convergence \_rate for finite training sets. In: Advances in Neural Information Processing Systems, pp. 2663–2671 (2012)
Shamir, O.: Open Problem: Is Averaging Needed for Strongly Convex Stochastic Gradient Descent? COLT, (2012)
Sohl-Dickstein, J., Poole, B., Ganguli, S.: Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods. In: Jebara, T., Xing, E.P. (eds), JMLR Workshop and Conference Proceedings ICML, pp. 604–612 (2014)
Sparks, E.R., Talwalkar, A., Smith, V., Kottalam, J., Xinghao, P., Gonzalez, J., Franklin, M.J., Jordan, M.I., Kraska, T.: MLI: An API for distributed machine learning. In: IEEE 13th International Conference on Data Mining (ICDM), pp. 1187–1192 (2013)
Zhang, T.: Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the Twenty-first International Conference on Machine Learning, ICML, p. 116, New York, NY, USA, (2004). ACM
Zhang, T.: A note on the non-commutative arithmetic-geometric mean inequality. arXiv preprint arXiv:1411.5058, (November 2014)

Download references

Acknowledgements

Mert Gurbuzbalaban acknowledges support from the grants NSF DMS-1723085 and NSF CCF-1814888.

Author information

Authors and Affiliations

Department of Management Science and Information Systems, Rutgers Business School, Piscataway, NJ, 08854, USA
M. Gürbüzbalaban
Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
A. Ozdaglar & P. A. Parrilo

Authors

M. Gürbüzbalaban
View author publications
You can also search for this author in PubMed Google Scholar
A. Ozdaglar
View author publications
You can also search for this author in PubMed Google Scholar
P. A. Parrilo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to M. Gürbüzbalaban.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Proof of Theorem 2

Proof

By substituting the gradients of the component functions $\nabla f_i(x) = P_i x - q_i$ into the RR iterations given by (5), we obtain the recursion

$$\begin{aligned} x_0^{k+1}= & {} \prod _{i=1}^m ({I_n}- \alpha _k P_{\sigma _k(i)} ) x_0^k + \alpha _k \sum _{i=1}^m \prod _{j=i+1}^m ({I_n}- \alpha _k P_{\sigma _k(j)}) q_{\sigma _k(i)} \end{aligned}$$

(57)

$$\begin{aligned}= & {} \bigg ({I_n}- \alpha _k P + \mathcal{O}(\alpha _k^3)\bigg )x_0^k + \alpha _k \sum _{i=1}^m q_i - \alpha _k^2 \hat{\mu }_{\sigma _k} + \mathcal{O}(\alpha _k^3), \end{aligned}$$

(58)

where $P:= \sum _{i=1}^m P_i$ and

$$\begin{aligned} \hat{\mu }_{\sigma _k} := -\underset{1\le i < j\le m}{\sum } P_{\sigma _k(j)} \nabla f_{\sigma _k(i)}(x_0^k). \end{aligned}$$

(59)

Since the component functions are quadratics, the optimal solution can be computed explicitly and is given by $x^* = P^{-1}\sum _{i=1}^m q_i$. Then, it follows after a straightforward computation that (58) is equivalent to

$$\begin{aligned} x_0^{k+1} - x^* = \big (I - \alpha _k P + \mathcal{O}(\alpha _k^3)\big ) (x_0^k - x^*) - \alpha _k^2 \hat{\mu }_{\sigma _k} + \mathcal{O}(\alpha _k^3). \end{aligned}$$

(60)

We also have

$$\begin{aligned} \Vert \mu _{\sigma _k} - \hat{\mu }_{\sigma _k} \Vert\le & {} \underset{1\le i< j\le m}{\sum } \Vert P_{\sigma _k(j)} \Vert \Vert \nabla f_{\sigma _k(i)}(x_0^k) - \nabla f_{\sigma _k(i)}(x^*)\Vert \\\le & {} \underset{1\le i < j\le m}{\sum } L_{\sigma _k(j)} L_{\sigma _k(i)} \text {dist}_k = \mathcal{O}(\text {dist}_k), \end{aligned}$$

where $\mu _{\sigma _k}$ is defined by (23) with $\sigma =\sigma _k$. Plugging this into (60),

$$\begin{aligned} x_0^{k+1} - x^*= & {} \big (I - \alpha _k P + \mathcal{O}(\alpha _k^2) + \mathcal{O}(\alpha _k^3)\big ) (x_0^k - x^*) - \alpha _k^2 \mu _{\sigma _k} + \mathcal{O}(\alpha _k^3 + \alpha _k^2 \text {dist}_k). \end{aligned}$$

Taking norm squares of both sides, taking conditional expectations and using the fact that $\mu _{\sigma _k}$ is bounded (see (26)), we obtain

$$\begin{aligned} \mathbb {E}_{\sigma _k} \left( \text {dist}_{k+1}^2 \bigg | x_0^k\right)= & {} (x_0^k - x^*)^T \big (I - 2\alpha _k P + \mathcal{O}(\alpha _k^2)\big ) (x_0^k - x^*) + 2\alpha _k^2 \langle x_0^k - x^*, -\bar{\mu }\rangle \nonumber \\&+ \mathcal{O}\left( \alpha _k^3\text {dist}_k + \alpha _k^2 \text {dist}_k^2 + \alpha _k^4\right) , \end{aligned}$$

(61)

where $\mathbb {E}_{\sigma _k}$ denotes the expectation with respect to the random permutation $\sigma _k$ and

$$\begin{aligned} \bar{\mu } = \mathbb {E}_{\sigma _k}\left( \mu _{\sigma _k}\right) = \mathbb {E}_{\sigma _1}\left( \mu _{\sigma _1}\right) . \end{aligned}$$

It follows from Cauchy–Schwartz that for any $\beta >0$,

$$\begin{aligned} \alpha _k^2 \left\| \langle x_0^k - x^*, -\bar{\mu }\rangle \right\|\le & {} \alpha _k^2 \text {dist}_k \Vert \bar{\mu } \Vert = \left( \sqrt{\beta } \alpha _k^{1/2}\text {dist}_k\right) \frac{ \alpha _k^{3/2} \Vert \bar{\mu } \Vert }{\sqrt{\beta }} \\\le & {} \frac{\beta \alpha _k \text {dist}_k^2}{2} + \frac{\alpha _k^3 \Vert \bar{\mu } \Vert ^2 }{2\beta }, \end{aligned}$$

and also

$$\begin{aligned} \alpha _k^3 \text {dist}_k = \alpha _k^2 \left( \alpha _k \text {dist}_k\right) \le \frac{\alpha _k^4}{2} + \frac{\alpha _k^2 \text {dist}_k^2}{2}. \end{aligned}$$

Plugging these bounds back into (61), using the lower bound (6) on the Hessian $H_* = P$ and invoking the tower property of the expectations:

$$\begin{aligned} \mathbb {E}\left( \text {dist}_{k+1}^2 \right) ~=~&\big (1 - \alpha _k (2c-\beta ) + \mathcal{O}(\alpha _k^2)\big ) \mathbb {E}\left( \text {dist}_k^2\right) + \alpha _k^3 \frac{\Vert \bar{\mu } \Vert ^2}{\beta } + \mathcal{O}(\alpha _k^4). \end{aligned}$$

Plugging in $\alpha _k = R/k^s$, it follows from Chung’s lemma [16, Lemma 4.2] that,

$$\begin{aligned} \mathbb {E}\left( \text {dist}_{k+1}^2 \right) \le {\left\{ \begin{array}{ll} \frac{R^2 \Vert \bar{\mu } \Vert ^2}{\beta (2c-\beta )} \frac{1}{k^{2s}} + o\left( \frac{1}{k^{2s}}\right) &{} \quad \text{ if } \quad 0<s<1 \text{ and } 2c-\beta> 0,\\ \frac{R^3 \Vert \bar{\mu } \Vert ^2}{\beta (R\left( 2c-\beta )-2\right) } \frac{1}{k^{2}} + o\left( \frac{1}{k^{2}}\right) &{} \quad \text{ if } \quad s=1 \text{ and } R(2c-\beta )-2 > 0. \end{array}\right. } \end{aligned}$$

(62)

Next we choose $\beta $ to get the best upper bound above. This is done by choosing $\beta = c$ for $0<s<1$ and choosing $\beta = (Rc - 1)/R$ for $s=1$ which yields

$$\begin{aligned} \mathbb {E}\left( \text {dist}_{k+1}^2 \right) \le {\left\{ \begin{array}{ll} \frac{R^2 \Vert \bar{\mu } \Vert ^2}{c^2} \frac{1}{k^{2s}} + o\left( \frac{1}{k^{2s}}\right) &{}\quad \text{ if } \quad 0<s<1,\\ \frac{R^4 \Vert \bar{\mu } \Vert ^2}{(Rc-1)^2} \frac{1}{k^{2}} + o\left( \frac{1}{k^{2}}\right) &{} \quad \text{ if } \quad s=1 \text{ and } Rc -1 > 0. \end{array}\right. } \end{aligned}$$

(63)

By Jensen’s inequality, we have $\mathbb {E}(\text {dist}_k) \le \left( \mathbb {E}\left( \text {dist}_{k+1}^2 \right) \right) ^{1/2}$. Therefore, by taking square roots of both sides above in (62) we conclude. $\square $

Technical lemmas for the proof of Theorem 3

The first lemma is on characterizing what is the worst-case distance of the all the inner iterates of RR to the optimal solution $x^*$. This quantity we want to upper bound is a random variable, but the upper bounds we obtain are deterministic holding for every sample path. This lemma is based on Corollary 1 and uses the fact that the distance between the inner iterates are on the order of the stepsize.

Lemma 1

Under the conditions of Theorem 3 we have ${\max }_{0 \le i < m} \Vert x_i^k - x^*\Vert = \mathcal{O}(\frac{1}{k^s})$ where $\mathcal{O}(\cdot )$ hides a constant that depends only on ${G_*}, L, m,c$ and R.

Proof

By Corollary 1,

$$\begin{aligned} \Vert x_0^k - x^* \Vert = \mathcal{O}\left( \frac{1}{k^s}\right) , \end{aligned}$$

(64)

where $\mathcal{O}(\cdot )$ hides a constant that depends only on ${G_*}, L, m,R$ and c. We have also for any $0\le i< m$ and $k\ge 0$,

where we used the L-Lipschitzness of the gradient of f where L is given by 19. Using (64) and applying this inequality inductively for $i=0,1,2,\dots , m-1$ we conclude. $\square $

The second lemma is on characterizing how fast on average the outer iterates move (if normalized by the stepsize) after a cycle of the RR algorithm. This is clearly related to the magnitude of the gradients seen by the iterates and is fundamental for establishing the convergence rate of the averaged RR iterates in Theorem 3.

Lemma 2

Under the conditions of Theorem 3, consider the sequence

$$\begin{aligned} I_{q,k} = \frac{\sum _{j=(1-q)k}^{k-1} {(x_0^j - x_{0}^{j+1})}{\alpha _j^{-1}}}{qk}, \quad 0 < q \le 1. \end{aligned}$$

(65)

Then,

$$\begin{aligned} I_{q, k} = {\left\{ \begin{array}{ll} \mathcal{O}\big ( \frac{\log k}{k} \big ) &{} \quad \text{ if } \quad q=1, \\ \mathcal{O}\big ( \frac{1}{k}\big ) &{} \quad \text{ if } \quad 0<q<1. \end{array}\right. } \end{aligned}$$

In the former case, $\mathcal{O}(\cdot )$ hides a constant that depends only on ${G_*}, L,m, c, R, s,q$ and $\text {dist}_0$. In the latter case, the same dependency on the constants occurs except that the dependency on $\text {dist}_0$ can be removed.

Proof

It follows from integration by parts that for any $\ell <k$,

$$\begin{aligned} -\sum _{j=\ell }^{k-1}{(x_0^j - x_{0}^{j+1})}{\alpha _j^{-1}} = \alpha _k^{-1} (x_0^k - x^*) - \alpha _\ell ^{-1} (x_0^\ell - x^*) - \sum _{j=\ell }^{k-1} (x_0^{j+1} - x^*)(\alpha _{j+1}^{-1}-\alpha _j^{-1}). \end{aligned}$$

(66)

Next, we investigate the asymptotic behavior of the terms on the right-hand side. A consequence of Corollary 1 and the inequality 25 is that

$$\begin{aligned} \alpha _k^{-1} \Vert x_0^k -x^*\Vert = \frac{(k+1)^s}{R} \Vert x_0^k -x^*\Vert \le \frac{LmG_*}{c} + o(1) = \mathcal{O}(1), \end{aligned}$$

(67)

and therefore

$$\begin{aligned} | \alpha _{k+1}^{-1} - \alpha _{k}^{-1} | \Vert x_0^k -x^*\Vert= & {} \frac{(k+2)^s - (k+1)^s}{(k+1)^s} \alpha _k^{-1}\Vert x_0^k -x^*\Vert \\= & {} \bigg ( \left( 1 + \frac{1}{k+1} \right) ^s -1 \bigg ) \alpha _k^{-1}\Vert x_0^k -x^*\Vert \\\le & {} \frac{s}{k+1}\alpha _k^{-1}\Vert x_0^k -x^*\Vert \le \frac{sLmG_*}{c} \frac{1}{k+1} \\&+\, o\left( \frac{1}{k+1}\right) = \mathcal{O}\left( \frac{1}{k+1}\right) , \end{aligned}$$

where $\mathcal{O}(\cdot )$ hides a constant that depends only on $L,{G_*}, c, m$ and s. Then, setting $\ell = (1-q)k$ in (66), it follows that

$$\begin{aligned} \left\| \sum _{j=\ell }^{k-1} {(x_0^j - x_0^{j+1})}{\alpha _j^{-1}} \right\|\le & {} \Vert \alpha _k^{-1} (x_0^k - x^*)\Vert + \Vert \alpha _{(1-q)k}^{-1} (x_0^{(1-q)k}- x^*)\Vert \end{aligned}$$

(68)

$$\begin{aligned}&+ \sum _{j=(1-q)k}^{k-1} \Vert x_0^{j+1} - x^*\Vert |\alpha _{j+1}^{-1}-\alpha _j^{-1}|. \nonumber \\= & {} \mathcal{O}(1) + \Vert \alpha _{(1-q)k}^{-1} (x_0^{(1-q)k} - x^*)\Vert + \mathcal{O}\Bigg ( \sum _{j=(1-q)k}^{k-1} \frac{1}{j+1}\Bigg ). \nonumber \\ \end{aligned}$$

(69)

We also have

$$\begin{aligned} \Vert \alpha _{(1-q)k}^{-1} (x_0^{(1-q)k} - x^*)\Vert = {\left\{ \begin{array}{ll} \alpha _0^{-1} \text {dist}_0 &{} \quad \text{ if } \quad q = 1, \\ \mathcal{O}(1) &{} \quad \text{ if } \quad 0<q<1, \end{array}\right. } \end{aligned}$$

(70)

where the second part follows from (67) with similar constants for the $\mathcal{O}(\cdot )$ term. As the sequence $\frac{1}{j+1}$ is monotonically decreasing, for any $k>0$ we have the bounds

$$\begin{aligned} \sum _{j=(1-q)k}^{k-1} \frac{1}{j+1} \le \frac{1}{(1-q)k + 1} + \int _{(1-q)k}^{k-1} \frac{1}{x+1} dx \le {\left\{ \begin{array}{ll} 1 + \log k &{}\quad \text{ if } \quad q=1, \\ 1 + \log \left( \frac{1}{1-q}\right) &{}\quad \text{ if } \quad 0<q <1. \end{array}\right. } \end{aligned}$$

(71)

Note that when $q=1$ this bound grows with k logarithmically whereas for $q<1$ it does not grow with k. Then, combining (69), (70) and (71) we obtain

$$\begin{aligned} \Vert {I_{q,k}}\Vert \le \frac{\big \Vert \sum _{j=\ell }^{k-1} {(x_0^j - x_0^{j+1})}{\alpha _j^{-1}} \big \Vert }{qk} = {\left\{ \begin{array}{ll} \mathcal{O}\big ( \frac{\log k}{k}\big ) &{}\quad \text{ if } \quad q=1, \\ \mathcal{O}\big ( \frac{1}{k}\big ) &{}\quad \text{ if } \quad 0<q<1, \end{array}\right. } \end{aligned}$$

as desired which completes the proof. $\square $

Lemma 3

Let $\sigma $ be a random permutation of $\{1,2,\dots ,m\}$ sampled uniformly over the set of all permutations $\varGamma $ defined by (4) and $\mu (\sigma )$ be the vector defined by (23) that depends on $\sigma $. Then,

$$\begin{aligned} \bar{\mu } = \mathbb {E}_{\sigma }\big (\mu ({\sigma })\big ) = \frac{1}{2} \sum _{i=1}^m P_i \nabla f_i(x^*), \end{aligned}$$

(72)

where $\mathbb {E}_{\sigma }$ denotes the expectation with respect to the random permutation $\sigma $ and $\bar{\mu }$ is defined by (29).

Proof

For any $i \ne \ell $, the joint distribution of $(\sigma (i), \sigma (\ell ))$ is uniform over the set of all (ordered) pairs from $\{1,2,\dots ,m\}$. Therefore, for any $i \ne \ell $,

$$\begin{aligned} \mathbb {E}_\sigma \big [ P_{\sigma (i)} \nabla f_{\sigma (\ell ) }(x^*) \big ]= & {} \sum _{i=1}^m \sum _{i\ne j, j=1}^m \frac{P_i\nabla f_j(x^*)}{m(m-1)} \\= & {} \frac{\sum _{i=1}^m P_i \sum _{j=1}^m \nabla f_j(x^*) - \sum _{j=1}^m P_j \nabla f_j(x^*)}{m(m-1)}\\= & {} - \frac{\sum _{j=1}^m P_j \nabla f_j(x^*)}{m(m-1)}, \end{aligned}$$

where we used the fact that $\nabla f(x^*) = \sum _{j=1}^m \nabla f_j(x^*)=0$ by the first order optimality condition. Then, by taking the expectation of (74), we obtain

$$\begin{aligned} \mathbb {E}_\sigma (\mu (\sigma ))= & {} - \sum _{i=1}^{m} \sum _{\ell =0}^{i-1} \mathbb {E}\big [ P_{\sigma (i)} \nabla f_{\sigma (\ell )}(x^*) \big ] = \sum _{i=1}^{m} \sum _{\ell =0}^{i-1} \frac{\sum _{j=1}^m P_j \nabla f_j(x^*)}{m(m-1)}\\= & {} \frac{\sum _{j=1}^m P_j \nabla f_j(x^*)}{2}, \end{aligned}$$

which completes the proof. $\square $

Lemma 4

Under the conditions of Theorem 3, the following statements are true:

(i)
We have
$$\begin{aligned} E_k = \alpha _k \mu (\sigma _k) + \mathcal{O}(\alpha _k^2), \quad k\ge 0, \end{aligned}$$
(73)
where $E_k$ is the gradient error defined by (8), $\mathcal{O}(\cdot )$ hides a constant that depends only on ${G_*}, L,m, R$ and c and
$$\begin{aligned} \mu (\sigma _k) = - \sum _{i=1}^{m} P_{{\sigma _k(i)}} \sum _{\ell =1}^{i-1} \nabla f_{{\sigma _k(\ell )}}(x^*) \end{aligned}$$
(74)
is a sequence of i.i.d. variables where the function $\mu (\cdot )$ is defined by (23).
(ii)
For any $0 <q\le 1 $, $\lim _{k \rightarrow \infty } Y_{q,k}= \bar{\mu }$ a.s. where $ Y_{q,k} = \frac{\sum _{i=(1-q)k}^{k-1} E_j}{\sum _{j=(1-q)k}^{k-1} \alpha _j}.$
(iii)
It holds that
$$\begin{aligned} \Vert \mu (\sigma _k) \Vert \le Lm G_*. \end{aligned}$$
(75)

Proof

(i):

As component functions are quadratics, (8) becomes

$$\begin{aligned} E_k= & {} \sum _{i=1}^{m} P_{{\sigma _k(i)}}(x_{i-1}^k - x_0^k) = - \sum _{i=1}^{m} P_{{\sigma _k(i)}} \alpha _k \sum _{\ell =1}^{i-1} \nabla f_{{\sigma _k(\ell )}} (x_{\ell -1}^k), \end{aligned}$$

where we can substitute

$$\begin{aligned} \nabla f_{ {\sigma _k(\ell )}} (x_{\ell -1}^k) = \nabla f_{ {\sigma _k(\ell )}} (x^*) + P_{{\sigma _k(\ell )}} (x_{\ell -1}^k - x^*). \end{aligned}$$

(76)

Then an application of Lemma 1 proves directly the desired result.

(ii):

We introduce the normalized gradient error sequence $Y_j = E_j / \alpha _j$. By part (i), $Y_j = \mu (\sigma _j) + \mathcal{O}(\alpha _j)$ where $\mu (\sigma _j)$ is a sequence of i.i.d. variables. By the strong law of large numbers, we have

$$\begin{aligned} \lim _{k \rightarrow \infty } \frac{\sum _{j=0}^{k-1} \mu (\sigma _j)}{k} = \mathbb {E}\mu (\sigma _j) = \bar{\mu } \quad \text{ a.s. }, \end{aligned}$$

(77)

where the last equality is by the definition of $\bar{\mu }$. Therefore,

$$\begin{aligned} \lim _{k \rightarrow \infty } \frac{\sum _{j=0}^{k-1} Y_j}{k} = \lim _{k \rightarrow \infty } \bigg (\frac{\sum _{j=0}^{k-1} \mu (\sigma _j)}{k} + \frac{\sum _{j=0}^{k-1} \mathcal{O}(\alpha _j)}{k}\bigg ) = \bar{\mu } \quad \text{ a.s. }, \end{aligned}$$

where we used the fact that the second term is negligible as $\sum _{j=0}^{k-1} \alpha _j / k = \mathcal{O}(k^{-s}) \rightarrow 0$. As the average of the sequence $Y_j$ converges almost surely, one can show that this implies almost sure convergence of a weighted average of the sequence $Y_j$ as well as long as weights satisfy certain conditions as $k \rightarrow \infty $. In particular, as the sequence $\{\alpha _j\}$ is monotonically decreasing and is non-summable, by [15, Theorem 1],

$$\begin{aligned} \lim _{k \rightarrow \infty } Y_{1,k} = \lim _{k \rightarrow \infty }\frac{\sum _{j=0}^{k-1} \alpha _j Y_j}{\sum _{j=0}^{k-1} \alpha _j} = \lim _{k \rightarrow \infty } \frac{\sum _{j=0}^{k-1} E_j}{\sum _{j=0}^{k-1} \alpha _j} = \bar{\mu } \quad \text{ a.s. } \end{aligned}$$

(78)

This completes the proof for $q=1$. For $0<q<1$, by the definition of $Y_{q,k}$, we can write $Y_{1,k} = (1-w_k) Y_{q,k} + w_k Y_{1, (1-q)k}$ where the non-negative weights $w_k$ satisfy

$$\begin{aligned} w_k = \frac{\sum _{j=0}^{(1-q)k - 1 } \alpha _j}{\sum _{j=0}^{k - 1 } \alpha _j } \rightarrow _{k \rightarrow \infty } (1-q)^{1-s} < 1. \end{aligned}$$

As both $Y_{1,k}$ and $Y_{1, (1-q)k}$ go to $\bar{\mu }$ a.s. by (78), it follows that

$$\begin{aligned} \lim _{k\rightarrow \infty } Y_{q,k} = \lim _{k\rightarrow \infty } \frac{Y_{1,k} - w_k Y_{1,(1-q)k}}{1-w_k} = \bar{\mu }\quad \text{ a.s } \end{aligned}$$

as well for any $0<q<1$. This completes the proof.

(iii):

This is a direct consequence of the triangle inequality applied to the definition (74) with $L_i = \Vert P_i\Vert $ and $L=\sum _{i=1}^m L_i$. $\square $

Techical Lemmas for the proof of Theorem 4

We first state a result which follows from adapting existing results from the literature to our setting. It extends Corollary 1 from quadratics to smooth functions.

Corollary 2

Under the setting of Theorem 4, we have

$$\begin{aligned} \text {dist}_k \le \frac{R M}{c}\frac{1}{k^s} + o\left( \frac{1}{k^s}\right) , \end{aligned}$$

where the right-hand side is a deterministic sequence, $M := Lm {G_*}$ and ${G_*}$ is defined by (24).

Proof

The result [21, Theorem 3.2] on the asymptotic convergence of incremental gradient implies that all the iterates converge to the optimum, i.e. $x_i^k \rightarrow x^*$ for every i fixed as k goes to infinity. Let $\mathcal {X}_\varepsilon $ be the closed $\varepsilon $-ball around the optimum, i.e. $\mathcal {X}_\varepsilon := \{ x \in \mathbb {R}^n ~:~ \Vert x- x^*\Vert \le \varepsilon \}$. Clearly, the iterates will be contained in this ball when k is large enough, i.e. for every $\varepsilon >0$ there exists $k_0$ (that may depend on $\varepsilon $) such that $x_i^k \in \mathcal {X}$ for any $k\ge k_0$ and for all $i=1,2,\dots , m$. By [21, Theorem 3.2], we have also

$$\begin{aligned} \limsup _{k\rightarrow \infty } k^s \text {dist}_k \le \frac{R M_\varepsilon }{c}, \end{aligned}$$

(79)

where $ M_\varepsilon : = Lm G_\varepsilon $ and $G_\varepsilon := \max _{1\le i \le m} \sup _{x\in \mathcal {X}_\varepsilon }\Vert \nabla f_i(x)\Vert $ is the largest norm of the gradients of the component functions on the compact set $\mathcal {X}_\varepsilon $. If we let $\varepsilon $ go to zero, we can replace $G_\varepsilon $ with $G_*=\max _{1\le i \le m} \Vert \nabla f_i(x^*)\Vert $ and $M_\varepsilon $ with M in (79). This completes the proof. $\square $

Building on this corollary, we obtain the following results.

Lemma 5

Under the conditions of Theorem 4, all the conclusions of Lemma 1 remain valid.

Proof

The proof of Lemma 1 applies identically except that instead of Corollary 1 we use its extension Corollary 2. $\square $

Lemma 6

Under the conditions of Theorem 4, all the conclusions of Lemma 2 remain valid.

Proof

The proof of Lemma 2 applies identically with the only difference that the bound on $\text {dist}_k = \Vert x_0^k - x^*\Vert $ is obtained from Corollary 2 instead of Corollary 1. $\square $

Lemma 7

Under the conditions of Theorem 4, the following statements are true:

(i)
We have
$$\begin{aligned} E_k = \alpha _k v(\sigma _k) + \mathcal{O}(\alpha _k^2), \quad k\ge 0, \end{aligned}$$
(80)
where $\mathcal{O}(\cdot )$ hides a constant that depends only on ${G_*}, L,m, R, c$ and $U$ and
$$\begin{aligned} v(\sigma _k) = - \sum _{i=0}^{m-1} \nabla ^2 f_{{\sigma _k(i)}} (x^*) \sum _{\ell =0}^{i-1} \nabla f_{{\sigma _k(\ell )}}(x^*). \end{aligned}$$
(ii)
It holds that
$$\begin{aligned} \Vert v(\sigma _k)\Vert \le LmG_*, \end{aligned}$$
(81)
where
$$\begin{aligned} \bar{v}:= \mathbb {E}v(\sigma _k) = {\sum _{i=1}^m \nabla ^2 f_i(x^*) \nabla f_i(x^*)}/{2}. \end{aligned}$$
(82)
(iii)
For any $0 <q\le 1 $, $\lim _{k \rightarrow \infty } Y_{q,k}= \bar{v}$ with probability one where
$$\begin{aligned} Y_{q,k} = \frac{\sum _{i=(1-q)k}^{k-1} E_j}{\sum _{j=(1-q)k}^{k-1} \alpha _j}.\ \end{aligned}$$
(83)

Proof

For part (i), first we express $E_k$ using the Taylor expansion and the Hessian Lipschitzness as

$$\begin{aligned} E_k= & {} \sum _{i=1}^{m} \bigg ( \nabla ^2 f_{{\sigma _k(i)}}(x_0^k) \bigg ) (x_{i-1}^k - x_0^k) + \mathcal{O}( U\Vert x_{i-1}^k - x_0^k \Vert ^2 ) \\= & {} - \sum _{i=1}^{m} \bigg ( \nabla ^2 f_{{\sigma _k(i)}}(x_0^k) \bigg ) (x_{i-1}^k - x_0^k) + \mathcal{O}\bigg ( \alpha _k^2 U\bigg \Vert \sum _{\ell =1}^{i-1} \nabla f_{ {\sigma _k(\ell )}} (x_{\ell -1}^k) \bigg \Vert \bigg ). \end{aligned}$$

By Lemma 5, we have $\Vert x_\ell ^k - x^*\Vert = \mathcal{O}(\alpha ^k)$ with probability one. Then, by the gradient and Hessian Lipschitzness we can substitute above

$$\begin{aligned} \nabla f_{ {\sigma _k(\ell )}} (x_{\ell -1}^k) = \nabla f_{ {\sigma _k(\ell )}} (x^*) + \mathcal{O}(\alpha ^k), \quad \nabla ^2 f_{ {\sigma _k(\ell )}} (x_{\ell -1}^k) = \nabla ^2 f_{ {\sigma _k(\ell )}} (x^*) + \mathcal{O}(\alpha ^k), \end{aligned}$$

which implies directly Eq. (80). The rest of the proof for parts (ii) and (iii) is similar to the proof of Lemma 4 and is omitted. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gürbüzbalaban, M., Ozdaglar, A. & Parrilo, P.A. Why random reshuffling beats stochastic gradient descent. Math. Program. 186, 49–84 (2021). https://doi.org/10.1007/s10107-019-01440-w

Download citation

Received: 28 February 2018
Accepted: 03 October 2019
Published: 29 October 2019
Issue Date: March 2021
DOI: https://doi.org/10.1007/s10107-019-01440-w

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Why random reshuffling beats stochastic gradient descent

Abstract

Access this article

Similar content being viewed by others

A new computational framework for log-concave density estimation

Check your outliers﻿! An introduction to identifying statistical outliers in R with easystats

Sample complexity analysis for adaptive optimization algorithms with stochastic oracles

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Proof of Theorem 2

Proof

Technical lemmas for the proof of Theorem 3

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Lemma 4

Proof

Techical Lemmas for the proof of Theorem 4

Corollary 2

Proof

Lemma 5

Proof

Lemma 6

Proof

Lemma 7

Proof

Rights and permissions

About this article

Cite this article

Share this article

Mathematics Subject Classification

Search

Navigation

Check your outliers! An introduction to identifying statistical outliers in R with easystats