Abstract
We analyze the convergence rate of the random reshuffling (RR) method, which is a randomized first-order incremental algorithm for minimizing a finite sum of convex component functions. RR proceeds in cycles, picking a uniformly random order (permutation) and processing the component functions one at a time according to this order, i.e., at each cycle, each component function is sampled without replacement from the collection. Though RR has been numerically observed to outperform its with-replacement counterpart stochastic gradient descent (SGD), characterization of its convergence rate has been a long standing open question. In this paper, we answer this question by providing various convergence rate results for RR and variants when the sum function is strongly convex. We first focus on quadratic component functions and show that the expected distance of the iterates generated by RR with stepsize \(\alpha _k=\varTheta (1/k^s)\) for \(s\in (0,1]\) converges to zero at rate \(\mathcal{O}(1/k^s)\) (with \(s=1\) requiring adjusting the stepsize to the strong convexity constant). Our main result shows that when the component functions are quadratics or smooth (with a Lipschitz assumption on the Hessian matrices), RR with iterate averaging and a diminishing stepsize \(\alpha _k=\varTheta (1/k^s)\) for \(s\in (1/2,1)\) converges at rate \(\varTheta (1/k^{2s})\) with probability one in the suboptimality of the objective value, thus improving upon the \(\varOmega (1/k)\) rate of SGD. Our analysis draws on the theory of Polyak–Ruppert averaging and relies on decoupling the dependent cycle gradient error into an independent term over cycles and another term dominated by \(\alpha _k^2\). This allows us to apply law of large numbers to an appropriately weighted version of the cycle gradient errors, where the weights depend on the stepsize. We also provide high probability convergence rate estimates that shows decay rate of different terms and allows us to propose a modification of RR with convergence rate \(\mathcal{O}(\frac{1}{k^2})\).
Similar content being viewed by others
Notes
There is some literature that analyzes SGD under correlated noise [25, Ch. 6], but the noise needs to have a special structure (such as a mixing property) which does not seem to be applicable to the analysis of RR.
IG shows similar properties to SGD in terms of the robustness of the stepsize rules \(\alpha _k = R/k^s\). The convergence rate (in k) is only robust to the strong convexity constant of the objective for \(s<1\) but not for \(s=1\) [29, Section 2.1].
To see this, note that the RR iterations for this example are given by \( x_0^{k+1} = (1-\frac{3}{2}\alpha _k + 2\alpha _k^2) x_0^k - \alpha _k^2 \mu (\sigma _k)\) which implies, after taking norms of both sides and using the fact that \(\Vert \mu (\sigma _k)\Vert \le 2\), \( \text {dist}_{k+1}\le (1-\frac{3}{2}\alpha _k + 2\alpha _k^2) \text {dist}_k + 2 \alpha _k^2.\) Then, by invoking classical results for the asymptotic behavior of non-negative sequences (see e.g. [6, Appendix A.4.3], we get \(\text {dist}_{k+1}\rightarrow 0\). Theorem 1 also shows global convergence of RR on this example.
The original result in [21, Theorem 3.1] was stated for \(\sigma =\{1,2,\dots ,m\}\) but here we translate this result into an arbitrary permutation \(\sigma \) of \(\{1,2,\dots ,m\}\) by noting that processing the set of functions \(\{f_1, f_2, \dots , f_m\}\) with order \(\sigma \) is equivalent to processing the permuted functions \(\{f_{\sigma _1}, f_{\sigma _2}, \dots , f_{\sigma _m}\}\) with order \(\{1,2,\dots ,m\}\).
This is due to the fact that the sequence \(\alpha _k^2\) is summable when \(s>1/2\).
Note that if this assumption holds and if \(f_i\) is three-times continuously differentiable on the compact set \(\mathcal {X}\), then the third-order derivatives are bounded and Assumption 2 holds.
The quadratic functions \(f_i(x)\) have the form \(f_i(x) = x^T A_i x + q_i^T x + r_i\). The matrices \(A_i\) are chosen randomly satisfying \(A_i = \frac{1}{n} R_i R_i^T + \lambda I\) where I is the \(n\times n\) identity matrix, R is a random matrix with each entry uniform on the interval \([-50,50]\) and \(\lambda \) is a regularization parameter to make the problem strongly convex. We set \(\lambda = 5\). The vectors \(q_i\) are random, each component is uniformly distributed on the interval \([-50,50]\) and \(c_i\) is uniform on the interval \([-1,1]\).
We note that all experiments were performed on a Macbook Pro with an 3.1 GHz Intel Core i7 processor and 16 GB of RAM, using Matlab R2017a running on the operating system Mac OS Sierra v10.12.5.
References
Agarwal, A., Bartlett, P.L., Ravikumar, P., Wainwright, M.J.: Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Trans. Inf. Theory 58(5), 3235–3249 (2012)
Bertsekas, D.: Incremental least squares methods and the extended Kalman filter. SIAM J. Optim. 6(3), 807–822 (1996)
Bertsekas, D.: A hybrid incremental gradient method for least squares. SIAM J. Optim. 7, 913–926 (1997)
Bertsekas, D.: Nonlinear Programming. Athena Scientific, Belmont (1999)
Bertsekas, D.: Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. Optim. Mach. Learn. 1–38, 2011 (2010)
Bertsekas, D.: Convex Optimization Algorithms. Athena Scientific, Belmont (2015)
Bottou, L.: Curiously fast convergence of some stochastic gradient descent algorithms. In: Proceedings of the Symposium on Learning and Data Science, Paris (2009)
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Lechevallier, Y., Saporta, G. (eds.) Proceedings of COMPSTAT’2010, pp. 177–186. Physica-Verlag, HD (2010)
Bottou, L.: Stochastic gradient descent tricks. In: Neural Networks: Tricks of the Trade, pp. 421–436. Springer, Heidelberg (2012)
Bottou, L.: Stochastic gradient descent tricks. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. Lecture Notes in Computer Science, vol. 7700, pp. 421–436. Springer, Berlin (2012)
Bottou, L., Le Cun, Y.: On-line learning for very large data sets. Appl. Stoch. Models Bus. Ind. 21(2), 137–151 (2005)
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends\(\textregistered \) in Machine Learning 3(1), 1–122 (2011)
Chung, K.L.: On a stochastic approximation method. Ann. Math. Stat. 25(3), 463–483 (1954)
Defazio, A., Bach, F., Lacoste-Julien, S.: Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems, pp. 1646–1654 (2014)
Etemadi, N.: Convergence of weighted averages of random variables revisited. Proc. Am. Math. Soc. 134(9), 2739–2744 (2006)
Fabian, V.: Stochastic approximation of minima with improved asymptotic speed. Ann. Math. Stat. 38(1), 191–200 (1967)
Fabian, V.: On asymptotic normality in stochastic approximation. Ann. Math. Stat. 39, 1327–1332 (1968)
Feng, X., Kumar, A., Recht, B., Ré, C.: Towards a unified architecture for in-RDBMS analytics. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 325–336. ACM (2012)
Gould, N.I.M., Leyffer, S.: An introduction to algorithms for nonlinear optimization. In: Blowey, J.F., Craig, A.W., Shardlow, T. (eds.) Frontiers in Numerical Analysis. Universitext, pp. 109–197. Springer, Berlin (2003)
Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., He, K.: Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour. arXiv e-prints, page arXiv:1706.02677, (Jun 2017)
Gürbüzbalaban, M., Ozdaglar, A., Parrilo, P.: Convergence rate of incremental gradient and Newton methods. arXiv preprint arXiv:1510.08562, (October 2015)
Harikandeh, R., Ahmed, M.O., Virani, A., Schmidt, M., Konečnỳ, J., Sallinen, S.: Stopwasting my gradients: Practical SVRG. In: Advances in Neural Information Processing Systems, pp. 2251–2259 (2015)
Israel, A., Krahmer, F., Ward, R.: An arithmetic-geometric mean inequality for products of three matrices. Linear Algebra Appl. 488, 1–12 (2016)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)
Kushner, H.J., Yin, G.: Stochastic Approximation and Recursive Algorithms and Applications, vol. 35. Springer, Berlin (2003)
Moulines, E., Bach, F.R.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Advances in Neural Information Processing, pp. 451–459 (2011)
Nedić, A., Ozdaglar, A.: On the rate of convergence of distributed subgradient methods for multi-agent optimization. In: Proceedings of the 46th IEEE Conference on Decision and Control (CDC), pp. 4711–4716 (2007)
Nedić, A., Ozdaglar, A.: Distributed subgradient methods for multi-agent optimization. IEEE Trans. Autom. Control 54(1), 48–61 (2009)
Nemirovski, A., Juditsky, A., Lan, G., Shapiro, A.: Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19(4), 1574–1609 (2009)
Nemirovskii, A.S., Yudin, D.B., Dawson, E.R.: Problem complexity and method efficiency in optimization. Wiley-Interscience Series in Discrete Mathematics. Wiley, Chichester (1983)
Polyak, B.T., Juditsky, A.B.: Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30(4), 838–855 (1992)
Rakhlin, A., Shamir, O., Sridharan, K.: Making gradient descent optimal for strongly convex stochastic optimization. In: Proceedings of the 29th International Conference on Machine Learning (ICML-12), pp. 449–456 (2012)
Ram, S.S., Nedic, A., Veeravalli, V.V.: Stochastic incremental gradient descent for estimation in sensor networks. Signals Syst. Comput. ACSSC 2007, 582–586 (2007)
Recht, B., Ré, C.: Toward a noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences. JMLR Workshop Conf. Proc. 23, 11.1–11.24 (2012)
Recht, B., Ré, C.: Parallel stochastic gradient algorithms for large-scale matrix completion. Math. Program. Comput. 5(2), 201–226 (2013)
Recht, B., Ré, C., Wright, S., Feng, N. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 24, pp. 693–701. Curran Associates Inc (2011)
Robbins, H., Monro, S.: A stochastic approximation method. The Annals of Mathematical Statistics, pp. 400–407, (1951)
Roux, N.L., Schmidt, M., Bach, F.R.: A stochastic gradient method with an exponential convergence rate for finite training sets. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 25, pp. 2663–2671. Curran Associates Inc, Red Hook (2012)
Roux, N.L., Schmidt, M., Bach, F.R.: A stochastic gradient method with an exponential convergence \_rate for finite training sets. In: Advances in Neural Information Processing Systems, pp. 2663–2671 (2012)
Shamir, O.: Open Problem: Is Averaging Needed for Strongly Convex Stochastic Gradient Descent? COLT, (2012)
Sohl-Dickstein, J., Poole, B., Ganguli, S.: Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods. In: Jebara, T., Xing, E.P. (eds), JMLR Workshop and Conference Proceedings ICML, pp. 604–612 (2014)
Sparks, E.R., Talwalkar, A., Smith, V., Kottalam, J., Xinghao, P., Gonzalez, J., Franklin, M.J., Jordan, M.I., Kraska, T.: MLI: An API for distributed machine learning. In: IEEE 13th International Conference on Data Mining (ICDM), pp. 1187–1192 (2013)
Zhang, T.: Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the Twenty-first International Conference on Machine Learning, ICML, p. 116, New York, NY, USA, (2004). ACM
Zhang, T.: A note on the non-commutative arithmetic-geometric mean inequality. arXiv preprint arXiv:1411.5058, (November 2014)
Acknowledgements
Mert Gurbuzbalaban acknowledges support from the grants NSF DMS-1723085 and NSF CCF-1814888.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Proof of Theorem 2
Proof
By substituting the gradients of the component functions \(\nabla f_i(x) = P_i x - q_i\) into the RR iterations given by (5), we obtain the recursion
where \(P:= \sum _{i=1}^m P_i\) and
Since the component functions are quadratics, the optimal solution can be computed explicitly and is given by \(x^* = P^{-1}\sum _{i=1}^m q_i\). Then, it follows after a straightforward computation that (58) is equivalent to
We also have
where \(\mu _{\sigma _k}\) is defined by (23) with \(\sigma =\sigma _k\). Plugging this into (60),
Taking norm squares of both sides, taking conditional expectations and using the fact that \(\mu _{\sigma _k}\) is bounded (see (26)), we obtain
where \(\mathbb {E}_{\sigma _k}\) denotes the expectation with respect to the random permutation \(\sigma _k\) and
It follows from Cauchy–Schwartz that for any \(\beta >0\),
and also
Plugging these bounds back into (61), using the lower bound (6) on the Hessian \(H_* = P\) and invoking the tower property of the expectations:
Plugging in \(\alpha _k = R/k^s\), it follows from Chung’s lemma [16, Lemma 4.2] that,
Next we choose \(\beta \) to get the best upper bound above. This is done by choosing \(\beta = c\) for \(0<s<1\) and choosing \(\beta = (Rc - 1)/R\) for \(s=1\) which yields
By Jensen’s inequality, we have \(\mathbb {E}(\text {dist}_k) \le \left( \mathbb {E}\left( \text {dist}_{k+1}^2 \right) \right) ^{1/2}\). Therefore, by taking square roots of both sides above in (62) we conclude. \(\square \)
Technical lemmas for the proof of Theorem 3
The first lemma is on characterizing what is the worst-case distance of the all the inner iterates of RR to the optimal solution \(x^*\). This quantity we want to upper bound is a random variable, but the upper bounds we obtain are deterministic holding for every sample path. This lemma is based on Corollary 1 and uses the fact that the distance between the inner iterates are on the order of the stepsize.
Lemma 1
Under the conditions of Theorem 3 we have \({\max }_{0 \le i < m} \Vert x_i^k - x^*\Vert = \mathcal{O}(\frac{1}{k^s})\) where \(\mathcal{O}(\cdot )\) hides a constant that depends only on \({G_*}, L, m,c\) and R.
Proof
By Corollary 1,
where \(\mathcal{O}(\cdot )\) hides a constant that depends only on \({G_*}, L, m,R\) and c. We have also for any \(0\le i< m\) and \(k\ge 0\),
where we used the L-Lipschitzness of the gradient of f where L is given by 19. Using (64) and applying this inequality inductively for \(i=0,1,2,\dots , m-1\) we conclude. \(\square \)
The second lemma is on characterizing how fast on average the outer iterates move (if normalized by the stepsize) after a cycle of the RR algorithm. This is clearly related to the magnitude of the gradients seen by the iterates and is fundamental for establishing the convergence rate of the averaged RR iterates in Theorem 3.
Lemma 2
Under the conditions of Theorem 3, consider the sequence
Then,
In the former case, \(\mathcal{O}(\cdot )\) hides a constant that depends only on \({G_*}, L,m, c, R, s,q\) and \(\text {dist}_0\). In the latter case, the same dependency on the constants occurs except that the dependency on \(\text {dist}_0\) can be removed.
Proof
It follows from integration by parts that for any \(\ell <k\),
Next, we investigate the asymptotic behavior of the terms on the right-hand side. A consequence of Corollary 1 and the inequality 25 is that
and therefore
where \(\mathcal{O}(\cdot )\) hides a constant that depends only on \(L,{G_*}, c, m\) and s. Then, setting \(\ell = (1-q)k\) in (66), it follows that
We also have
where the second part follows from (67) with similar constants for the \(\mathcal{O}(\cdot )\) term. As the sequence \(\frac{1}{j+1}\) is monotonically decreasing, for any \(k>0\) we have the bounds
Note that when \(q=1\) this bound grows with k logarithmically whereas for \(q<1\) it does not grow with k. Then, combining (69), (70) and (71) we obtain
as desired which completes the proof. \(\square \)
Lemma 3
Let \(\sigma \) be a random permutation of \(\{1,2,\dots ,m\}\) sampled uniformly over the set of all permutations \(\varGamma \) defined by (4) and \(\mu (\sigma )\) be the vector defined by (23) that depends on \(\sigma \). Then,
where \(\mathbb {E}_{\sigma }\) denotes the expectation with respect to the random permutation \(\sigma \) and \(\bar{\mu }\) is defined by (29).
Proof
For any \(i \ne \ell \), the joint distribution of \((\sigma (i), \sigma (\ell ))\) is uniform over the set of all (ordered) pairs from \(\{1,2,\dots ,m\}\). Therefore, for any \(i \ne \ell \),
where we used the fact that \(\nabla f(x^*) = \sum _{j=1}^m \nabla f_j(x^*)=0\) by the first order optimality condition. Then, by taking the expectation of (74), we obtain
which completes the proof. \(\square \)
Lemma 4
Under the conditions of Theorem 3, the following statements are true:
-
(i)
We have
$$\begin{aligned} E_k = \alpha _k \mu (\sigma _k) + \mathcal{O}(\alpha _k^2), \quad k\ge 0, \end{aligned}$$(73)where \(E_k\) is the gradient error defined by (8), \(\mathcal{O}(\cdot )\) hides a constant that depends only on \({G_*}, L,m, R\) and c and
$$\begin{aligned} \mu (\sigma _k) = - \sum _{i=1}^{m} P_{{\sigma _k(i)}} \sum _{\ell =1}^{i-1} \nabla f_{{\sigma _k(\ell )}}(x^*) \end{aligned}$$(74)is a sequence of i.i.d. variables where the function \(\mu (\cdot )\) is defined by (23).
-
(ii)
For any \(0 <q\le 1 \), \(\lim _{k \rightarrow \infty } Y_{q,k}= \bar{\mu }\) a.s. where \( Y_{q,k} = \frac{\sum _{i=(1-q)k}^{k-1} E_j}{\sum _{j=(1-q)k}^{k-1} \alpha _j}.\)
-
(iii)
It holds that
$$\begin{aligned} \Vert \mu (\sigma _k) \Vert \le Lm G_*. \end{aligned}$$(75)
Proof
- (i):
-
As component functions are quadratics, (8) becomes
$$\begin{aligned} E_k= & {} \sum _{i=1}^{m} P_{{\sigma _k(i)}}(x_{i-1}^k - x_0^k) = - \sum _{i=1}^{m} P_{{\sigma _k(i)}} \alpha _k \sum _{\ell =1}^{i-1} \nabla f_{{\sigma _k(\ell )}} (x_{\ell -1}^k), \end{aligned}$$where we can substitute
$$\begin{aligned} \nabla f_{ {\sigma _k(\ell )}} (x_{\ell -1}^k) = \nabla f_{ {\sigma _k(\ell )}} (x^*) + P_{{\sigma _k(\ell )}} (x_{\ell -1}^k - x^*). \end{aligned}$$(76)Then an application of Lemma 1 proves directly the desired result.
- (ii):
-
We introduce the normalized gradient error sequence \(Y_j = E_j / \alpha _j\). By part (i), \(Y_j = \mu (\sigma _j) + \mathcal{O}(\alpha _j)\) where \(\mu (\sigma _j)\) is a sequence of i.i.d. variables. By the strong law of large numbers, we have
$$\begin{aligned} \lim _{k \rightarrow \infty } \frac{\sum _{j=0}^{k-1} \mu (\sigma _j)}{k} = \mathbb {E}\mu (\sigma _j) = \bar{\mu } \quad \text{ a.s. }, \end{aligned}$$(77)where the last equality is by the definition of \(\bar{\mu }\). Therefore,
$$\begin{aligned} \lim _{k \rightarrow \infty } \frac{\sum _{j=0}^{k-1} Y_j}{k} = \lim _{k \rightarrow \infty } \bigg (\frac{\sum _{j=0}^{k-1} \mu (\sigma _j)}{k} + \frac{\sum _{j=0}^{k-1} \mathcal{O}(\alpha _j)}{k}\bigg ) = \bar{\mu } \quad \text{ a.s. }, \end{aligned}$$where we used the fact that the second term is negligible as \(\sum _{j=0}^{k-1} \alpha _j / k = \mathcal{O}(k^{-s}) \rightarrow 0\). As the average of the sequence \(Y_j\) converges almost surely, one can show that this implies almost sure convergence of a weighted average of the sequence \(Y_j\) as well as long as weights satisfy certain conditions as \(k \rightarrow \infty \). In particular, as the sequence \(\{\alpha _j\}\) is monotonically decreasing and is non-summable, by [15, Theorem 1],
$$\begin{aligned} \lim _{k \rightarrow \infty } Y_{1,k} = \lim _{k \rightarrow \infty }\frac{\sum _{j=0}^{k-1} \alpha _j Y_j}{\sum _{j=0}^{k-1} \alpha _j} = \lim _{k \rightarrow \infty } \frac{\sum _{j=0}^{k-1} E_j}{\sum _{j=0}^{k-1} \alpha _j} = \bar{\mu } \quad \text{ a.s. } \end{aligned}$$(78)This completes the proof for \(q=1\). For \(0<q<1\), by the definition of \(Y_{q,k}\), we can write \(Y_{1,k} = (1-w_k) Y_{q,k} + w_k Y_{1, (1-q)k}\) where the non-negative weights \(w_k\) satisfy
$$\begin{aligned} w_k = \frac{\sum _{j=0}^{(1-q)k - 1 } \alpha _j}{\sum _{j=0}^{k - 1 } \alpha _j } \rightarrow _{k \rightarrow \infty } (1-q)^{1-s} < 1. \end{aligned}$$As both \(Y_{1,k}\) and \(Y_{1, (1-q)k}\) go to \(\bar{\mu }\) a.s. by (78), it follows that
$$\begin{aligned} \lim _{k\rightarrow \infty } Y_{q,k} = \lim _{k\rightarrow \infty } \frac{Y_{1,k} - w_k Y_{1,(1-q)k}}{1-w_k} = \bar{\mu }\quad \text{ a.s } \end{aligned}$$as well for any \(0<q<1\). This completes the proof.
- (iii):
-
This is a direct consequence of the triangle inequality applied to the definition (74) with \(L_i = \Vert P_i\Vert \) and \(L=\sum _{i=1}^m L_i\). \(\square \)
Techical Lemmas for the proof of Theorem 4
We first state a result which follows from adapting existing results from the literature to our setting. It extends Corollary 1 from quadratics to smooth functions.
Corollary 2
Under the setting of Theorem 4, we have
where the right-hand side is a deterministic sequence, \(M := Lm {G_*}\) and \({G_*}\) is defined by (24).
Proof
The result [21, Theorem 3.2] on the asymptotic convergence of incremental gradient implies that all the iterates converge to the optimum, i.e. \(x_i^k \rightarrow x^*\) for every i fixed as k goes to infinity. Let \(\mathcal {X}_\varepsilon \) be the closed \(\varepsilon \)-ball around the optimum, i.e. \(\mathcal {X}_\varepsilon := \{ x \in \mathbb {R}^n ~:~ \Vert x- x^*\Vert \le \varepsilon \}\). Clearly, the iterates will be contained in this ball when k is large enough, i.e. for every \(\varepsilon >0\) there exists \(k_0\) (that may depend on \(\varepsilon \)) such that \(x_i^k \in \mathcal {X}\) for any \(k\ge k_0\) and for all \(i=1,2,\dots , m\). By [21, Theorem 3.2], we have also
where \( M_\varepsilon : = Lm G_\varepsilon \) and \(G_\varepsilon := \max _{1\le i \le m} \sup _{x\in \mathcal {X}_\varepsilon }\Vert \nabla f_i(x)\Vert \) is the largest norm of the gradients of the component functions on the compact set \(\mathcal {X}_\varepsilon \). If we let \(\varepsilon \) go to zero, we can replace \(G_\varepsilon \) with \(G_*=\max _{1\le i \le m} \Vert \nabla f_i(x^*)\Vert \) and \(M_\varepsilon \) with M in (79). This completes the proof. \(\square \)
Building on this corollary, we obtain the following results.
Lemma 5
Under the conditions of Theorem 4, all the conclusions of Lemma 1 remain valid.
Proof
The proof of Lemma 1 applies identically except that instead of Corollary 1 we use its extension Corollary 2. \(\square \)
Lemma 6
Under the conditions of Theorem 4, all the conclusions of Lemma 2 remain valid.
Proof
The proof of Lemma 2 applies identically with the only difference that the bound on \(\text {dist}_k = \Vert x_0^k - x^*\Vert \) is obtained from Corollary 2 instead of Corollary 1. \(\square \)
Lemma 7
Under the conditions of Theorem 4, the following statements are true:
-
(i)
We have
$$\begin{aligned} E_k = \alpha _k v(\sigma _k) + \mathcal{O}(\alpha _k^2), \quad k\ge 0, \end{aligned}$$(80)where \(\mathcal{O}(\cdot )\) hides a constant that depends only on \({G_*}, L,m, R, c\) and \(U\) and
$$\begin{aligned} v(\sigma _k) = - \sum _{i=0}^{m-1} \nabla ^2 f_{{\sigma _k(i)}} (x^*) \sum _{\ell =0}^{i-1} \nabla f_{{\sigma _k(\ell )}}(x^*). \end{aligned}$$ -
(ii)
It holds that
$$\begin{aligned} \Vert v(\sigma _k)\Vert \le LmG_*, \end{aligned}$$(81)where
$$\begin{aligned} \bar{v}:= \mathbb {E}v(\sigma _k) = {\sum _{i=1}^m \nabla ^2 f_i(x^*) \nabla f_i(x^*)}/{2}. \end{aligned}$$(82) -
(iii)
For any \(0 <q\le 1 \), \(\lim _{k \rightarrow \infty } Y_{q,k}= \bar{v}\) with probability one where
$$\begin{aligned} Y_{q,k} = \frac{\sum _{i=(1-q)k}^{k-1} E_j}{\sum _{j=(1-q)k}^{k-1} \alpha _j}.\ \end{aligned}$$(83)
Proof
For part (i), first we express \(E_k\) using the Taylor expansion and the Hessian Lipschitzness as
By Lemma 5, we have \(\Vert x_\ell ^k - x^*\Vert = \mathcal{O}(\alpha ^k)\) with probability one. Then, by the gradient and Hessian Lipschitzness we can substitute above
which implies directly Eq. (80). The rest of the proof for parts (ii) and (iii) is similar to the proof of Lemma 4 and is omitted. \(\square \)
Rights and permissions
About this article
Cite this article
Gürbüzbalaban, M., Ozdaglar, A. & Parrilo, P.A. Why random reshuffling beats stochastic gradient descent. Math. Program. 186, 49–84 (2021). https://doi.org/10.1007/s10107-019-01440-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-019-01440-w