Abstract
Variance reduction (VR) methods for finitesum minimization typically require the knowledge of problemdependent constants that are often unknown and difficult to estimate. To address this, we use ideas from adaptive gradient methods to propose AdaSVRG, which is a morerobust variant of SVRG, a common VR method. AdaSVRG uses AdaGrad, a common adaptive gradient method, in the inner loop of SVRG, making it robust to the choice of stepsize. When minimizing a sum of n smooth convex functions, we prove that a variant of AdaSVRG requires \(\tilde{O}(n + 1/\epsilon )\) gradient evaluations to achieve an \(O(\epsilon )\)suboptimality, matching the typical rate, but without needing to know problemdependent constants. Next, we show that the dynamics of AdaGrad exhibit a twophase behavior – the stepsize remains approximately constant in the first phase, and then decreases at a \(O\left( {1}/{\sqrt{t}}\right)\) rate. This result maybe of independent interest, and allows us to propose a heuristic that adaptively determines the length of each innerloop in AdaSVRG. Via experiments on synthetic and realworld datasets, we validate the robustness and effectiveness of AdaSVRG, demonstrating its superior performance over standard and other “tunefree” VR methods.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Variance reduction (VR) methods (Schmidt et al., 2017; Konečnỳ & Richtárik, 2013; Mairal, 2013; ShalevShwartz & Zhang, 2013; Johnson & Zhang, 2013; Mahdavi & Jin, 2013; Konečnỳ & Richtárik, 2013; Defazio et al., 2014; Nguyen et al., 2017) have proven to be an important class of algorithms for stochastic optimization. These methods take advantage of the finitesum structure prevalent in machine learning problems, and have improved convergence over stochastic gradient descent (SGD) and its variants (see Gower et al., 2020 for a recent survey). For example, when minimizing a finite sum of n stronglyconvex, smooth functions with condition number \(\kappa\), these methods typically require \(O\left( (\kappa + n) \, \log (1/\epsilon ) \right)\) gradient evaluations to obtain an \(\epsilon\)error. This improves upon the complexity of fullbatch gradient descent (GD) that requires \(O\left( \kappa n \, \log (1/\epsilon ) \right)\) gradient evaluations, and SGD that has an \(O(\kappa /\epsilon )\) complexity. Moreover, there have been numerous VR methods that employ Nesterov acceleration (AllenZhu, 2017; Lan et al., 2019; Song et al., 2020) and can achieve even faster rates.
In order to guarantee convergence, VR methods require an easiertotune constant stepsize, whereas SGD needs a decreasing stepsize schedule. Consequently, VR methods are commonly used in practice, especially when training convex models such as logistic regression or conditional Markov random fields (Schmidt et al., 2015). However, all the abovementioned VR methods require knowledge of the smoothness of the underlying function in order to set the stepsize. The smoothness constant is often unknown and difficult to estimate in practice. Although we can obtain global upperbounds on it for simple problems such as least squares regression, these bounds are usually too loose to be practically useful and result in suboptimal performance. Consequently, implementing VR methods requires a computationally expensive search over a range of stepsizes. Furthermore, a constant stepsize does not adapt to the function’s local smoothness and may lead to poor empirical performance.
Consequently, there have been a number of works that try to adapt the stepsize in VR methods. Schmidt et al. (2017) and Mairal (2013) employ stochastic linesearch procedures to set the stepsize in VR algorithms. While they show promising empirical results using linesearches, these procedures have no theoretical convergence guarantees. Recent works (Tan et al., 2016; Li et al., 2020) propose to use the BarzilaiBorwein (BB) stepsize (Barzilai & Borwein, 1988) in conjunction with two common VR algorithms—stochastic variance reduced gradient (SVRG) (Johnson & Zhang, 2013) and the stochastic recursive gradient algorithm (SARAH) (Nguyen et al., 2017). Both Tan et al. (2016) and Li et al. (2020) can automatically set the stepsize without requiring the knowledge of problemdependent constants. However, in order to prove theoretical guarantees for stronglyconvex functions, these techniques require the knowledge of both the smoothness and strongconvexity parameters. In fact, their guarantees require using a small \(O(1/\kappa ^2)\) stepsize, a highlysuboptimal choice in practice. Consequently, there is a gap in the theory and practice of adaptive VR methods. To address this, we make the following contributions.
1.1 Background and contributions
1.1.1 SVRG meets AdaGrad
In Sect. 3 we use AdaGrad (Duchi et al., 2011; Levy et al., 2018), an adaptive gradient method, with stochastic variance reduction techniques. We focus on SVRG (Johnson & Zhang, 2013) and propose to use AdaGrad within its innerloop. We analyze the convergence of the resulting AdaSVRG algorithm for minimizing convex functions (without strongconvexity). Using O(n) innerloops for every outerloop (a typical setting used in practice Babanezhad Harikandeh et al., 2015; Sebbouh et al., 2019), and any bounded stepsize, we prove that AdaSVRG achieves an \(\epsilon\)error (for \(\epsilon = O(1/n)\)) with \(O(n/\epsilon )\) gradient evaluations (Theorem 1). This rate matches that of SVRG with a constant stepsize and O(n) innerloops (Reddi et al., 2016, Corollary 10). However, unlike Reddi et al. (2016), our result does not require knowledge of the smoothness constant in order to set the stepsize. We note that other previous work (Cutkosky & Orabona, 2019; Liu et al., 2020) consider adaptive methods with variance reduction for nonconvex minimization; however their algorithms still require knowledge of problemdependent parameters.
1.1.2 Multistage AdaSVRG
We propose a multistage variant of AdaSVRG where each stage involves running AdaSVRG for a fixed number of inner and outerloops. In particular, multistage AdaSVRG maintains a fixedsize outerloop and doubles the length of the innerloop across stages. We prove that it requires \(O((n + 1/\epsilon ) \log (1/\epsilon ))\) gradient evaluations to reach an \(O(\epsilon )\) error (Theorem 2). This improves upon the complexity of decreasing stepsize SVRG that requires \(O(n + \sqrt{n}/\epsilon )\) gradient evaluations (Reddi et al., 2016, Corollary 9); and matches the rate of SARAH (Nguyen et al., 2017).
After our work was made publicly available (DuboisTaine et al., 2021), recent work (Zhou et al., 2021) improved upon our result by applying a similar idea to an accelerated variant of SVRG (AllenZhu, 2017). Their algorithm requires \(\tilde{O}(n + \sqrt{n/\epsilon })\) gradient evaluations to obtain an \(\epsilon\)error without the knowledge of problemdependent constants.
1.1.3 AdaSVRG with adaptive termination
Instead of using a complex multistage procedure, we prove that AdaSVRG can also achieve the improved \(O((n + 1/\epsilon ) \log (1/\epsilon ))\) gradient evaluation complexity by adaptively terminating its innerloop (Sect. 4). However, the adaptive termination requires the knowledge of problemdependent constants, limiting its practical use.
To address this, we use the favourable properties of AdaGrad to design a practical heuristic for adaptively terminating the innerloop. Our technique for adaptive termination is related to heuristics (Pflug, 1983; Yaida, 2018; Lang et al., 2019; Pesme et al., 2020) that detect stalling for constant stepsize SGD, and may be of independent interest. First, we show that when minimizing smooth convex losses, AdaGrad has a twophase behaviour—a first “deterministic phase” where the stepsize remains approximately constant followed by a second “stochastic” phase where the stepsize decreases at an \(O(1/\sqrt{t})\) rate (Theorem 4). We show that it is empirically possible to efficiently detect this phase transition and aim to terminate the AdaSVRG innerloop when AdaGrad enters the stochastic phase.
1.1.4 Practical considerations and experimental evaluation
In Sect. 5, we describe some of the practical considerations for implementing AdaSVRG and the adaptive termination heuristic. We use standard realworld datasets to empirically verify the robustness and effectiveness of AdaSVRG. Across datasets, we demonstrate that AdaSVRG consistently outperforms variants of SVRG, SARAH and methods based on the BB stepsize (Tan et al., 2016; Li et al., 2020).
1.1.5 Adaptivity to overparameterization
Defazio and Bottou (2019) demonstrated the ineffectiveness of SVRG when training large overparameterized models such as deep neural networks. We argue that this ineffectiveness can be partially explained by the interpolation property satisfied by overparameterized models (Schmidt & Le Roux, 2013; Ma et al., 2018; Vaswani et al., 2019a). In the interpolation setting, SGD obtains an \(O(1/\epsilon )\) gradient complexity when minimizing smooth convex functions (Vaswani et al., 2019a), thus outperforming typical VR methods. However, interpolation is rarely exactly satisfied in practice, and using SGD can result in oscillations around the solution. On the other hand, although VR methods have a slower convergence, they do not oscillate, regardless of interpolation. In Sect. 6, we use AdaGrad to exploit the (approximate) interpolation property, and employ the above heuristic to adaptively switch to AdaSVRG, thus avoiding oscillatory behaviour. We design synthetic problems controlling the extent of interpolation and show that the hybrid AdaGradAdaSVRG algorithm can match or outperform both stochastic gradient and VR methods, thus achieving the best of both worlds.
2 Problem setup
We consider the minimization of an objective \(f:\mathbb {R}^d\rightarrow \mathbb {R}\) with a finitesum structure,
where X is a convex compact set of diameter D, meaning \(\sup _{x, y \in X}\left\ x  y\right\ \le D\). Problems with this structure are prevalent in machine learning. For example, in supervised learning, n represents the number of training examples, and \(f_i\) is the loss function when classifying or regressing to training example i. Throughout this paper, we assume f and each \(f_i\) are differentiable. We assume that f is convex, implying that there exists a solution \(w^{*}\in X\) that minimizes it, and define \(f^*:= f(w^{*})\). Interestingly we do not need each \(f_i\) to be convex. We further assume that each function \(f_{i}\) in the finitesum is \(L_i\)smooth, implying that f is \(L_{\max }\)smooth, where \(L_{\max } = \max _{i} L_i\). We include the formal definitions of these properties in Appendix “Definitions”.
The classical method for solving such a problem is stochastic gradient descent (SGD). Starting from the iterate \(x_0\), at each iteration t SGD samples (typically uniformly at random) a loss function \(f_{i_t}\) and takes a step in the negative direction of the stochastic gradient \(\nabla f_{i_t}(x_t)\) using a stepsize \(\eta _t\). This update can be expressed as
In order to ensure convergence to the minimizer, the sequence of stepsizes in SGD needs to be decreasing, typically at an \(O(1/\sqrt{t})\) rate (Moulines & Bach, 2011). This has the effect of slowing down the convergence and results in an \(\Theta (1/\sqrt{t})\) convergence to the minimizer for convex functions (compared to the O(1/t) convergence for gradient descent). Variance reduction methods were developed to overcome this slower convergence by exploiting the finitesum structure of the objective.
We focus on the SVRG algorithm (Johnson & Zhang, 2013) since it is more memory efficient than other variance reduction alternatives like SAG (Schmidt et al., 2017) or SAGA (Defazio et al., 2014). SVRG has a nested innerouter loop structure. In every outerloop k, it computes the full gradient \(\nabla f(w_{k})\) at a snapshot point \(w_{k}\). An outerloop k consists of \(m_k\) innerloops indexed by \(t = 1, 2, \ldots m_k\) and the innerloop iterate \(x_1\) is initialized to \(w_{k}\). In outerloop k and innerloop t, SVRG samples an example \(i_t\) (typically uniformly at random) and takes a step in the direction of the variancereduced gradient \(g_t\) using a constant stepsize \(\eta\). This update can be expressed as:
where \(\Pi _{X}\) denotes the Euclidean projection onto the set X. The variancereduced gradient is unbiased, meaning that \(\mathbb {E}_{i_t}[g_t \vert x_t] = \nabla f(x_t)\). At the end of the innerloop, the next snapshot point is typically set to either the last or averaged iterate in the innerloop.
SVRG requires the knowledge of both the strongconvexity and smoothness constants in order to set the stepsize and the number of innerloops. These requirements were relaxed in Hofmann et al. (2015), Kovalev et al. (2020), Gower et al. (2020) that only require knowledge of the smoothness.
In order to set the stepsize for SVRG without requiring knowledge of the smoothness, linesearch techniques are an attractive option. Such techniques are a common approach to automatically set the stepsize for (stochastic) gradient descent (Armijo, 1966; Vaswani et al., 2019b). However, we show that an intuitive Armijolike linesearch to set the SVRG stepsize is not guaranteed to converge to the solution. Specifically, we prove the following proposition in Appendix “Counterexample for linesearch for SVRG”.
Proposition 1
If in each innerloop t of SVRG, \(\eta _t\) is set as the largest stepsize satisfying the condition: \(\eta _t \le \eta _{{\max}}\) and
then for any \(c > 0\), \(\eta _{{\max}} > 0\), there exists a 1dimensional convex smooth function such that if \(\vert x_t  w^{*}\vert \le \min \{ \frac{1}{c}, 1\}\), then \(\vert x_{t+1}  w^{*}\vert \ge \vert x_t  w^{*}\vert\), implying that the update moves the iterate away from the solution when it is close to it, preventing convergence.
In the next section, we suggest a novel approach using AdaGrad (Duchi et al., 2011) to propose AdaSVRG, a provablyconvergent VR method that is more robust to the choice of stepsize. To justify our decision to use AdaGrad, we note that in general, there are (roughly) three common ways of designing methods that do not require knowledge of problemdependent constants: (i) BB stepsize, but it still requires knowledge of \(L_{{\max}}\) to guarantee convergence in the VR setting (Tan et al., 2016; Li et al., 2020), (ii) Linesearch methods that can fail to converge in the VR setting (Proposition 1), (iii) Adaptive gradient methods such as AdaGrad.
3 Adaptive SVRG
Like SVRG, AdaSVRG has a nested innerouter loop structure and relies on computing the full gradient in every outerloop. However, it uses AdaGrad in the innerloop, using the variance reduced gradient \(g_t\) to update the preconditioner \(A_t\) in the innerloop t. AdaSVRG computes the stepsize \(\eta _k\) in every outerloop (see Sect. 5 for details) and uses a preconditioned variancereduced gradient step to update the innerloop iterates: \(x_{t+1} = \Pi _{X}\left( x_t  \eta _{k} A_t^{1} g_t\right)\). AdaSVRG then sets the next snapshot \(w_{k+1}\) to be the average of the innerloop iterates.
We now analyze the convergence of AdaSVRG (see Algorithm 1 for the pseudocode). Throughout the main paper, we will only focus on the scalar variant (Ward et al., 2019) of AdaGrad. We defer the general diagonal and matrix variants (see Appendix “Algorithm in general case” for the pseudocode) and their corresponding theory to the Appendix. We start with the analysis of a single outerloop, and prove the following lemma in Appendix “Proof of Lemma 1”.
Lemma 1
(AdaSVRG with single outerloop) Assume (i) convexity of f, (ii) \(L_{\max }\)smoothness of \(f_i\) and (iii) bounded feasible set with diameter D. Defining \(\rho := \big ( \frac{D^2}{\eta _{k}} + 2\eta _{k}\big ) \sqrt{L_{{\max}}}\), for any outer loop k of AdaSVRG, with (a) innerloop length \(m_k\) and (b) stepsize \(\eta _{k}\),
The proof of the above lemma leverages the theoretical results of AdaGrad (Duchi et al., 2011; Levy et al., 2018). Specifically, the standard AdaGrad analysis bounds the “noise” term by the variance in the stochastic gradients. On the other hand, we use the properties of the variance reduced gradient in order to upperbound the noise in terms of the function suboptimality.
Lemma 1 shows that a single outerloop of AdaSVRG converges to the minimizer as \(O(1/\sqrt{m})\), where m is the number of innerloops. This implies that in order to obtain an \(\epsilon\)error, a single outerloop of AdaSVRG requires \(O(n + 1/\epsilon ^2)\) gradient evaluations. This result holds for any bounded stepsize and requires setting \(m = O(1/\epsilon ^2)\). This “single outerloop convergence” property of AdaSVRG is unlike SVRG or any of its variants; running only a singleloop of SVRG is ineffective, as it stops making progress at some point, resulting in the iterates oscillating in a neighbourhood of the solution. The favourable behaviour of AdaSVRG is similar to SARAH, but unlike SARAH, the above result does not require computing a recursive gradient or knowing the smoothness constant.
Next, we consider the convergence of AdaSVRG with a fixedsize innerloop and multiple outerloops. In the following theorems, we assume that we have a bounded range of stepsizes implying that for all k, \(\eta _{k}\in [\eta _{{\min}}, \eta _{{\max}}]\). For brevity, similar to Lemma 1, we define \(\rho :=\big ( \frac{D^2}{\eta _{{\min}}} + 2\eta _{{\max}}\big ) \sqrt{L_{{\max}}}\).
Theorem 1
(AdaSVRG with fixedsize innerloop) Under the same assumptions as Lemma 1, AdaSVRG with (a) stepsizes \(\eta _{k}\in [\eta _{{\min}}, \eta _{{\max}}]\), (b) innerloop size \(m_k = n\) for all k, results in the following convergence rate after \(K \le n\) iterations.
where \(\bar{w}_K = \frac{1}{K} \sum _{k = 1}^{K} w_{k}\).
The proof (refer to Appendix “Proof of Theorem 1”) recursively uses the result of Lemma 1 for K outerloops.
The above result requires a fixed innerloop size \(m_k = n\), a setting typically used in practice (Babanezhad Harikandeh et al., 2015; Gower et al., 2020). Notice that the above result holds only when \(K \le n\). Since K is the number of outerloops, it is typically much smaller than n, the number of functions in the finite sum, justifying the theorem’s \(K \le n\) requirement. Moreover, in the sense of generalization error, it is not necessary to optimize below an O(1/n) accuracy (Boucheron et al., 2005; Sridharan et al., 2008).
Theorem 1 implies that AdaSVRG can reach an \(\epsilon\)error (for \(\epsilon = \Omega (1/n)\)) using \(\mathcal O(n/\epsilon )\) gradient evaluations. This result matches the complexity of constant stepsize SVRG (with \(m_k = n\)) of Reddi et al. (2016, Corollary 10) but without requiring the knowledge of the smoothness constant. However, unlike SVRG and SARAH, the convergence rate depends on the diameter D rather than \(\left\ w_0  w^{*}\right\\), the initial distance to the solution. This dependence arises due to the use of AdaGrad in the innerloop, and is necessary for adaptive gradient methods. Specifically, Cutkosky and Boahen (2017) prove that any adaptive (to problemdependent constants) method will necessarily incur such a dependence on the diameter. Hence, such a diameter dependence can be considered to be the “cost” of the lack of knowledge of problemdependent constants.
Since the above result only holds for \(\epsilon = \Omega (1/n)\), we propose a multistage variant (Algorithm 2) of AdaSVRG that requires \(O((n+1/\epsilon )\log (1/\epsilon ))\) gradient evaluations to attain an \(O(\epsilon )\)error for any \(\epsilon\). To reach a target suboptimality of \(\epsilon\), we consider \(I = \log (1/\epsilon )\) stages. For each stage i, Algorithm 2 uses a fixed number of outerloops K and innerloops \(m^i\) with stage i is initialized to the output of the \((i1)\)th stage. In Appendix “Proof of Theorem 2”, we prove the following rate for multistage AdaSVRG.
Theorem 2
(Multistage AdaSVRG) Under the same assumptions as Theorem 1, multistage AdaSVRG with \(I = \log (1/\epsilon )\) stages, \(K \ge 3\) outerloops and \(m^i = 2^{i+1}\) innerloops at stage i, requires \(O(n\log {1}/{\epsilon } + {1}/{\epsilon })\) gradient evaluations to reach a \(\left( \rho ^2 (1+\sqrt{5}) \right) \epsilon\)suboptimality.
We see that multistage AdaSVRG matches the convergence rate of SARAH (upto constants), but does so without requiring the knowledge of the smoothness constant to set the stepsize. Observe that the number of innerloops increases with the stage i.e. \(m^i = 2^{i+1}\). The intuition behind this is that the convergence of AdaGrad (used in the kth innerloop of AdaSVRG) is slowed down by a “noise” term proportional to \(f(w_k)  f^*\) (see Lemma 1). When this “noise” term is large in the earlier stages of multistage AdaSVRG, the innerloops have to be short in order to maintain the overall \(O(1/{\epsilon })\) convergence. However, as the stages progress and the suboptimality decreases, the “noise” term becomes smaller, and the algorithm can use longer innerloops, which reduces the number of full gradient computations, resulting in the desired convergence rate.
Thus far, we have focused on using AdaSVRG with fixedsize innerloops. Next, we consider variants that can adaptively determine the innerloop size.
4 Adaptive termination of innerloop
Recall that the convergence of a single outerloop k of AdaSVRG (Lemma 1) is slowed down by the \(\sqrt{ {\left( f(w_k)  f^*\right)}/{m_k}}\) term. Similar to the multistage variant, the suboptimality \(f(w_k)  f^*\) decreases as AdaSVRG progresses. This allows the use of longer innerloops as k increases, resulting in fewer fullgradient evaluations. We instantiate this idea by setting \(m_k = O\left( {1}/{\left( f(w_k)  f^*\right) }\right)\). Since this choice requires the knowledge of \(f(w_k)  f^*\), we alternatively consider using \(m_k = O(1/\epsilon )\), where \(\epsilon\) is the desired suboptimality. We prove the following theorem in Appendix “Proof of Theorem 3”.
Theorem 3
(AdaSVRG with adaptivesized innerloops) Under the same assumptions as Lemma 1, AdaSVRG with (a) stepsizes \(\eta _{k}\in [\eta _{{\min}}, \eta _{{\max}}]\), (b1) innerloop size \(m_k = \frac{4\rho ^2}{\epsilon }\) for all k or (b2) innerloop size \(m_k = \frac{4\rho ^2}{f(w_k)  f^*}\) for outerloop k, results in the following convergence rate,
The above result implies a linear convergence in the number of outerloops, but each outerloop requires \(O(1/\epsilon )\) innerloops. Hence, Theorem 3 implies that AdaSVRG with adaptivesized innerloops requires \(O\left( (n + 1/\epsilon ) \log (1/\epsilon ) \right)\) gradient evaluations to reach an \(\epsilon\)error. This improves upon the rate of SVRG and matches the convergence rate of SARAH that also requires innerloops of length \(O(1/\epsilon )\). Compared to Theorem 1 that has an average iterate convergence (for \(\bar{w}_K\)), Theorem 3 has the desired convergence for the last outerloop iterate \(w_{K}\) and also holds for any bounded sequence of stepsizes. However, unlike Theorem 1, this result (with either setting of \(m_k\)) requires the knowledge of problemdependent constants in \(\rho\).
To address this issue, we design a heuristic for adaptive termination in the next sections. We start by describing the two phase behaviour of AdaGrad and subsequently utilize it for adaptive termination in AdaSVRG.
4.1 Two phase behaviour of AdaGrad
Diagnostic tests (Pflug, 1983; Yaida, 2018; Lang et al., 2019; Pesme et al., 2020) study the behaviour of the SGD dynamics to automatically control its stepsize. Similarly, designing the adaptive termination test requires characterizing the behaviour of AdaGrad used in the inner loop of AdaSVRG.
We first investigate the dynamics of constant stepsize AdaGrad in the stochastic setting. Specifically, we monitor the evolution of \(\sqrt{G_t}\) across iterations. We define \(\sigma ^2 :=\sup _{x \in X} \mathbb E_{i} \Vert \nabla f_i(x)  \nabla f(x) \Vert ^2\) as a uniform upperbound on the variance in the stochastic gradients for all iterates. We prove the following theorem showing that there exists an iteration \(T_0\) when the evolution of \(\sqrt{G_t}\) undergoes a phase transition. Note again that such a phase transition happens for the scalar, diagonal and matrix variants of AdaGrad. We only present the scalar result in the main paper and defer the other variants to the Appendix.
Theorem 4
(Phase Transition in AdaGrad Dynamics) Under the same assumptions as Lemma 1 and (iv) \(\sigma ^2\)bounded stochastic gradient variance and defining \(T_0 = \frac{\rho ^2 L_{{\max}}}{\sigma ^2}\), for constant stepsize AdaGrad we have \(\mathbb {E}[\sqrt{G_t}] = O(1) \text { for } t \le T_0\), and \(\mathbb {E}[\sqrt{G_t}] = O (\sqrt{tT_0}) \text { for } t \ge T_0\).
Theorem 4 (proved in Appendix “Proof of Theorem 4”) indicates that \(G_t\) is bounded by a constant for all \(t \le T_0\), implying that its rate of growth is slower than \(\log (t)\). This implies that the stepsize of AdaGrad is approximately constant (similar to gradient descent in the fullbatch setting) in this first phase until iteration \(T_0\). Indeed, if \(\sigma = 0\), \(T_0 = \infty\) and AdaGrad is always in this deterministic phase. This result generalizes (Qian & Qian, 2019, Theorem 3.1) that analyzes the diagonal variant of AdaGrad in the deterministic setting. After iteration \(T_0\), the noise \(\sigma ^2\) starts to dominate, and AdaGrad transitions into the stochastic phase where \(G_t\) grows as O(t). In this phase, the stepsize decreases as \(O(1/\sqrt{t})\), resulting in slower convergence to the minimizer. AdaGrad thus results in an overall \(O \left( {1}/{T} + {\sigma ^2}/{\sqrt{T}} \right)\) rate (Levy et al., 2018), where the first term corresponds to the deterministic phase and the second to the stochastic phase.
Since the exact detection of this phase transition is not possible, we design a heuristic to detect it without requiring the knowledge of problemdependent constants.
4.2 Heuristic for adaptive termination
Similar to tests used to detect stalling for SGD (Pflug, 1983; Pesme et al., 2020), the proposed diagnostic test has a burnin phase of \({n}/{2}\) innerloop iterations that allows the initial AdaGrad dynamics to stabilize. After this burnin phase, for every even iteration, we compute the ratio \(R = \frac{G_{t}  G_{t/2}}{G_{t/2}}\). Given a threshold hyperparameter \(\theta\), the test terminates the innerloop when \(R \ge \theta\). In the first deterministic phase, since the growth of \(G_{t}\) is slow, \(G_{2t} \approx G_{t}\) and \(R \approx 0\). In the stochastic phase, \(G_{t} = O(t)\), and \(R \approx 1\), justifying that the test can distinguish between the two phases. AdaSVRG with this test is fully specified in Algorithm 3. Experimentally, we use \(\theta = 0.5\) to give an early indication of the phase transition.
5 Experimental evaluation of AdaSVRG
We first describe the practical considerations for implementing AdaSVRG and then evaluate its performance on real and synthetic datasets. The code to reproduce the experiments can be found at the following link: https://github.com/bpauld/AdaSVRG. We do not use projections in our experiments as these problems have an unconstrained \(w^*\) with finite norm (we thus assume D is big enough to include it), and that we empirically observed that our iterates always stayed bounded, thus not requiring any projection.^{Footnote 1}
5.1 Implementing AdaSVRG
Though our theoretical results hold for any bounded sequence of stepsizes, its choice affects the practical performance of AdaGrad (Vaswani et al., 2020) (and hence AdaSVRG). Theoretically, the optimal stepsize minimizing the bound in Lemma 1 is given by \(\eta ^* = \frac{D}{\sqrt{2}}\). Since we do not have access to D, we use the following heuristic to set the stepsize for each outerloop of AdaSVRG. In outerloop k, we approximate D by \(\Vert w_{k} w^{*}\Vert\), that can be bounded using the cocoercivity of smooth convex functions as \(\Vert w_{k} w^{*}\Vert \ge {1}/{L_{{\max}}} \Vert \nabla f(w_k) \Vert\) (Nesterov, 2004, Thm. 2.1.5 (2.1.8)). We have access to \(\nabla f(w_k)\) for the current outerloop, and store the value of \(\nabla f(w_{k1})\) in order to approximate the smoothness constant. Specifically, by cocoercivity, \(L_{{\max}} \ge L_k :=\frac{\Vert \nabla f(w_k)  \nabla f(w_{k1})\Vert }{\Vert w_k  w_{k1}\Vert }\). Putting these together, \(\eta _{k}= \frac{\Vert \nabla f(w_k)\Vert }{\sqrt{2} \, \max _{i=0, \dots , k} L_i}\).^{Footnote 2} Although a similar heuristic could be used to estimate \(L_{{\max}}\) for SVRG or SARAH, the resulting stepsize is larger than \(1/L_{{\max}}\) implying that it would not have any theoretical guarantee, while our results hold for any bounded sequence of stepsizes. Although Algorithm 1 requires setting \(w_{k+1}\) to be the average of the innerloop iterates, we use the lastiterate and set \(w_{k}= x_{m_k}\), as this is a more common choice (Johnson & Zhang, 2013; Tan et al., 2016) and results in better empirical performance. We compare two variants of AdaSVRG, with (i) fixedsize innerloop Algorithm 1 and (ii) adaptive termination Algorithm 3. We handle a general batchsize b, and set \(m = {n}/{b}\) for Algorithm 1. This is a common practical choice (Babanezhad Harikandeh et al., 2015; Gower et al., 2020; Kovalev et al., 2020). For Algorithm 3, the burnin phase consists of \({n}/{2b}\) iterations and \(M = {10n}/{b}\).
5.2 Evaluating AdaSVRG
In order to assess the effectiveness of AdaSVRG, we experiment with binary classification on standard LIBSVM datasets (Chang & Lin, 2011). In particular, we consider \(\ell _2\)regularized problems (with regularization set to \({1}/{n}\)) with three losses—logistic loss, the squared loss or the Huber loss. For each experiment we plot the median and standard deviation across 5 independent runs. In the main paper, we show the results for four of the datasets and relegate the results for the three others to Appendix “Additional experiments”. Similarly, we consider batchsizes in the range [1, 8, 64, 128], but only show the results for \(b = 64\) in the main paper.
We compare the AdaSVRG variants against SVRG (Johnson & Zhang, 2013), looplessSVRG (Kovalev et al., 2020), SARAH (Nguyen et al., 2017), and SVRGBB (Tan et al., 2016), the only other tunefree VR method.^{Footnote 3} Since each of these methods requires a stepsize, we search over the grid \([10^{3}, 10^{2}, 10^{1}, 1, 10, 100]\), and select the best stepsize for each algorithm and each experiment. As is common, we set \(m = {n}/{b}\) for each of these methods. We note that though the theoretical results of SVRGBB require a small \(O(1/\kappa ^2)\) stepsize and \(O(\kappa ^2)\) innerloops, Tan et al. (2016) recommends setting \(m = O(n)\) in practice. Since AdaGrad results in the slower \(O(1/\epsilon ^2)\) rate (Levy et al., 2018; Vaswani et al., 2020) compared to the \(O(n + \frac{1}{\epsilon })\) rate of VR methods, we do not include it in the main paper. We demonstrate the poor performance of AdaGrad on two example datasets in Fig. 4 in Appendix “Additional experiments”.
We plot the gradient norm of the training objective (for the best stepsize) against the number of gradient evaluations normalized by the number of examples. We show the results for the logistic loss (Fig. 1a), Huber loss (Fig. 1b), and squared loss (Fig. 2). Our results show that (i) both variants of AdaSVRG (without any stepsize tuning) are competitive with the other besttuned VR methods, often outperforming them or matching their performance; (ii) SVRGBB often has an oscillatory behavior, even for the best stepsize; and (iii) the performance of AdaSVRG with adaptive termination (that has superior theoretical complexity) is competitive with that of the practically useful O(n) fixed innerloop setting.
In order to evaluate the effect of the stepsize on a method’s performance, we plot the gradient norm after 50 outerloops vs stepsize for each of the competing methods. For the AdaSVRG variants, we set the stepsize according to the heuristic described earlier. For the logistic loss (Fig. 1a), Huber loss (Fig. 1b) and squared loss (Fig. 2), we observe that (i) the performance of typical VR methods heavily depends on the choice of the stepsize; (ii) the stepsize corresponding to the minimum loss is different for each method, loss and dataset; and (iii) AdaSVRG with the stepsize heuristic results in competitive performance. Additional results plotted in Figs. 5a, b, 6a, b, 7a, b, 8a–c, 9a–c, 10a–c, 11a–c, 12a–c, 13a–c in Appendix “Additional experiments” confirm that the good performance of AdaSVRG is consistent across losses, batchsizes and datasets.
Finally, in Fig. 14a in Appendix “Evaluating the diagonal variant of AdaSVRG”, we give preliminary results benchmarking the performance of the diagonal variant of AdaSVRG and comparing it to the scalar variant. We do not compare to the full matrix variant since inverting a \(d \times d\) matrix in each iteration makes it impractical for most machine learning tasks. Our results demonstrate that with the current heuristic for setting \(\eta\), the performance of the diagonal variant does not significantly improve over the scalar variant. Moreover, since each iteration of the diagonal variant incurs an additional O(d) cost compared to the scalar variant, we did not conduct further experiments with the diagonal variant. In the future, we aim to develop robust heuristics for setting the stepsize for this variant of AdaSVRG.
6 Heuristic for adaptivity to overparameterization
In this section, we reason that the poor empirical performance of SVRG when training overparameterized models (Defazio & Bottou, 2019) can be partially explained by the interpolation property (Schmidt & Le Roux, 2013; Ma et al., 2018; Vaswani et al., 2019a) satisfied by these models (Zhang et al., 2017). In particular, we focus on smooth convex losses, but assume that the model is capable of completely fitting the training data, and that \(w^*\) lies in the interior of X. For example, these properties are simultaneously satisfied when minimizing the squared hingeloss for linear classification on separable data or unregularized kernel regression (Belkin et al., 2019; Liang et al., 2020) with \(\Vert w^*\Vert \le 1\).
Formally, the interpolation condition means that the gradient of each \(f_i\) in the finitesum converges to zero at an optimum. Additionally, we assume that each function \(f_i\) has finite minimum \(f_i^*\). If the overall objective f is minimized at \(w^{*}\), \(\nabla f(w^{*}) = 0\), then for all \(f_{i}\) we have \(\nabla f_{i}(w^{*}) = 0\). Since the interpolation property is rarely exactly satisfied in practice, we allow for a weaker version that uses \(\zeta ^2 :=\mathbb {E}_i [f^*  f_i^*] \in [0,\infty )\) (Loizou et al., 2020; Vaswani et al., 2020) to measure the extent of the violation of interpolation. If \(\zeta ^2 = 0\), interpolation is exactly satisfied.
When \(\zeta ^2 = 0\), both constant stepsize SGD and AdaGrad have a gradient complexity of \(O(1/\epsilon )\) in the smooth convex setting (Schmidt & Le Roux, 2013; Vaswani et al., 2019a, 2020). In contrast, typical VR methods have an \(\tilde{O}(n + \frac{1}{\epsilon })\) complexity. For example, both SVRG and AdaSVRG require computing the full gradient in every outerloop, and will thus unavoidably suffer an \(\Omega (n)\) cost. For large n, typical VR methods will thus be necessarily slower than SGD when training models that can exactly interpolate the data. This provides a partial explanation for the ineffectiveness of VR methods when training overparameterized models. When \(\zeta ^2 > 0\), AdaGrad has an \(O(1/\epsilon + \zeta /\epsilon ^2)\) rate (Vaswani et al., 2020). Here \(\zeta\), the violation of interpolation plays the role of noise and slows down the convergence to an \(O(1/\epsilon ^2)\) rate. On the other hand, AdaSVRG results in an \(\tilde{O}(n + 1/\epsilon )\) rate, regardless of \(\zeta\).
Following the reasoning in Sect. 4, if an algorithm can detect the slower convergence of AdaGrad and switch from AdaGrad to AdaSVRG, it can attain a faster convergence rate. It is straightforward to show that AdaGrad has a a similar phase transition as Theorem 4 when interpolation is only approximately satisfied. This enables the use of the test in Sect. 4 to terminate AdaGrad and switch to AdaSVRG, resulting in the hybrid algorithm described in Algorithm 4. If the diagnostic test can detect the phase transition accurately, Algorithm 4 will attain an \(O(1/\epsilon )\) convergence when interpolation is exactly satisfied (no switching in this case). When interpolation is only approximately satisfied, it will result in an \(O(1/\epsilon )\) convergence for \(\epsilon \ge \zeta\) (corresponding to the AdaGrad rate in the deterministic phase) and will attain an \(O(1/\zeta ^2 + ((n + 1/\epsilon ) \log (\zeta /\epsilon ))\) convergence thereafter (corresponding to the AdaSVRG rate). This implies that Algorithm 4 can indeed obtain the best of both worlds between AdaGrad and AdaSVRG.
6.1 Evaluating Algorithm 4
We use synthetic experiments to demonstrate the effect of interpolation on the convergence of stochastic and VR methods. Following the protocol in Meng et al. (2020), we generate a linearly separable dataset with \(n=10^4\) data points of dimension \(d=200\) and train a linear model with a convex loss. This setup ensures that interpolation is satisfied, but allows to eliminate other confounding factors such as nonconvexity and other implementation details. In order to smoothly violate interpolation, we show results with a mislabel fraction of points in the grid [0, 0.1, 0.2].
We use AdaGrad as a representative (fully) stochastic method, and to eliminate possible confounding because of its stepsize, we set it using the stochastic linesearch procedure (Vaswani et al., 2020). We compare the performance of AdaGrad, SVRG, AdaSVRG and the hybrid AdaGradAdaSVRG (Algorithm 4) each with a budget of 50 epochs (passes over the data). For SVRG, as before, we choose the best stepsize via a gridsearch. For AdaSVRG, we use the fixedsize innerloop variant and the stepsize heuristic described earlier. In order to evaluate the quality of the “switching” metric in Algorithm 4, we compare against a hybrid method referred to as “Optimal Manual Switching” in the plots. This method runs a gridsearch over switching points—after epoch \(\{1, 2, \ldots , 50\}\) and chooses the point that results in the minimum loss after 50 epochs.
In Fig. 3, we plot the results for the logistic loss using a batchsize of 64 (refer to Figs. 15a–c, 16a–d in Appendix “Additional experiments on adaptivity to overparameterization” for other losses and batchsizes). We observe that (i) when interpolation is exactly satisfied (no mislabeling), AdaGrad results in superior performance over SVRG and AdaSVRG, confirming the theory in Sect. 6. In this case, both the optimal manual switching and Algorithm 4 do not switch; (ii) when interpolation is not exactly satisfied (with \(10\%, 20\%\) mislabeling), the AdaGrad progress slows down to a stall in a neighbourhood of the solution, whereas both SVRG and AdaSVRG converge to the solution; (iii) in both cases, Algorithm 4 detects the slowdown in AdaGrad and switches to AdaSVRG, resulting in competitive performance with the optimal manual switching. For all three datasets, Algorithm 4 matches or outperforms the better of AdaGrad and AdaSVRG, showing that it can achieve the bestofbothworlds.
7 Discussion
Although there have been numerous papers on VR methods in the past ten years, all of the provably convergent methods require knowledge of problemdependent constants such as L. On the other hand, there has been substantial progress in designing adaptive gradient methods that have effectively replaced SGD for training ML models. Unfortunately, this progress has not been leveraged for developing better VR methods. Our work is the first to marry these lines of literature by designing AdaSVRG, that achieves a gradient complexity comparable to typical VR methods, but without needing to know the objective’s smoothness constant. Our results illustrate that it is possible to design principled techniques that can “painlessly” reduce the variance, achieving good theoretical and practical performance. We believe that our paper will help open up an exciting research direction. In the future, we aim to extend our theory to the stronglyconvex setting.
Availability of data and materials
Experiments were done on the publicly available LIBSVM datasets (Chang & Lin, 2011).
Code availability
Full code to replicate the experiments can be found at https://github.com/bpauld/AdaSVRG.
Notes
For \(k = 0\), we compute the full gradient at a random point \(w_{1}\) and approximate \(L_0\) in the same way.
We do not compare against SAG (Schmidt et al., 2017) because of its large memory footprint.
References
Ahn, K., Yun, C., & Sra, S. (2020). SGD with shuffling: Optimal rates without component convexity and large epoch requirements. In Neural information processing systems 2020, NeurIPS 2020.
AllenZhu, Z. (2017). Katyusha: The first direct acceleration of stochastic gradient methods. In Proceedings of the 49th annual ACM SIGACT symposium on theory of computing, STOC.
Armijo, L. (1966). Minimization of functions having lipschitz continuous first partial derivatives. Pacific Journal of mathematics, 16(1), 1–3.
Babanezhad Harikandeh, R., Ahmed, M. O., Virani, A., Schmidt, M., Konečnỳ, J., & Sallinen, S. (2015). Stop wasting my gradients: Practical SVRG. Advances in Neural Information Processing Systems, 28, 2251–2259.
Barzilai, J., & Borwein, J. M. (1988). Twopoint step size gradient methods. IMA Journal of Numerical Analysis, 8(1), 141–148.
Belkin, M., Rakhlin, A., & Tsybakov, A. B. (2019). Does data interpolation contradict statistical optimality? In The 22nd international conference on artificial intelligence and statistics (pp. 1611–1619). PMLR.
Bollapragada, R., Byrd, R. H., & Nocedal, J. (2019). Exact and inexact subsampled newton methods for optimization. IMA Journal of Numerical Analysis, 39(2), 545–578.
Boucheron, S., Bousquet, O., & Lugosi, G. (2005). Theory of classification: A survey of some recent advances. ESAIM: Probability and Statistics, 9, 323–375.
Chang, C.C., & Lin, C.J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), 1–27. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
Cutkosky, A., & Boahen, K. (2017). Online convex optimization with unconstrained domains and losses. arXiv preprint arXiv:1703.02622.
Cutkosky, A., & Orabona, F. (2019). Momentumbased variance reduction in nonconvex SGD. arXiv preprint arXiv:1905.10018.
Defazio, A., & Bottou, L. (2019). On the ineffectiveness of variance reduced optimization for deep learning. In NeurIPS: In advances in neural information processing systems.
Defazio, A., Bach, F., & LacosteJulien, S. (2014). SAGA: A fast incremental gradient method with support for nonstrongly convex composite objectives. In NeurIPS: In Advances in neural information processing systems.
DuboisTaine, B., Vaswani, S., Babanezhad, R., Schmidt, M., & LacosteJulien, S. (2021). Svrg meets adagrad: Painless variance reduction. arXiv preprint arXiv:2102.09645.
Duchi, J. C., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12, 2121–2159.
Gower, R. M., Schmidt, M., Bach, F., & Richtarik, P. (2020). Variancereduced methods for machine learning. Proceedings of the IEEE, 108(11), 1968–1983.
Hofmann, T., Lucchi, A., LacosteJulien, S., & McWilliams, B. (2015). Variance reduced stochastic gradient descent with neighbors. Advances in Neural Information Processing Systems, 28, 2305–2313.
Johnson, R., & Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In NeurIPS: In advances in neural information processing systems.
Konečnỳ, J., & Richtárik, P. (2013). Semistochastic gradient descent methods. arXiv preprint arXiv:1312.1666.
Kovalev, D., Horváth, S., & Richtárik, P. (2020). Don’t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop. In Algorithmic learning theory (pp. 451–467). PMLR.
Lan, G., Li, Z., & Zhou, Y. (2019). A unified variancereduced accelerated gradient method for convex optimization. In Advances in neural information processing systems (pp. 10462–10472).
Lang, H., Xiao, L., & Zhang, P. (2019). Using statistics to automate stochastic optimization. In Advances in neural information processing systems (pp. 9540–9550).
Levy, K. Y., Yurtsever, A., & Cevher, V. (2018). Online adaptive methods, universality and acceleration. In NeurIPS: In advances in neural information processing systems.
Li, B., Wang, L., & Giannakis, G. B. (2020). Almost tunefree variance reduction. In International conference on machine learning (pp. 5969–5978). PMLR.
Liang, T., Rakhlin, A., et al. (2020). Just interpolate: Kernel “ridgeless’’ regression can generalize. Annals of Statistics, 48(3), 1329–1347.
Liu, M., Zhang, W., Orabona, F., & Yang, T. (2020). Adam+: A stochastic method with adaptive variance reduction. arXiv preprint arXiv:2011.11985.
Loizou, N., Vaswani, S., Laradji, I., & LacosteJulien, S. (2020). Stochastic Polyak stepsize for SGD: An adaptive learning rate for fast convergence. arXiv preprint: arXiv:2002.10542.
Ma, S., Bassily, R., & Belkin, M. (2018). The power of interpolation: Understanding the effectiveness of SGD in modern overparametrized learning. In Proceedings of the 35th international conference on machine learning, ICML.
Mahdavi, M., & Jin, R. (2013). MixedGrad: An O(1/T) convergence rate algorithm for stochastic smooth optimization. arXiv preprint arXiv:1307.7192.
Mairal, J. (2013). Optimization with firstorder surrogate functions. In International conference on machine learning (pp. 783–791).
Meng, S. Y., Vaswani, S., Laradji, I., Schmidt, M., & LacosteJulien, S. (2020). Fast and furious convergence: Stochastic second order methods under interpolation. In The 23nd international conference on artificial intelligence and statistics, AISTATS.
Moulines, E., & Bach, F. R. (2011). Nonasymptotic analysis of stochastic approximation algorithms for machine learning. In NeurIPS: In advances in neural information processing systems.
Nesterov, Y. (2004). Introductory lectures on convex optimization: A basic course. Berlin: Springer.
Nguyen, L. M., Liu, J., Scheinberg, K., & Takáč, M. (2017). SARAH: a novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th international conference on machine learning (Vol. 70, pp. 2613–2621).
Pesme, S., Dieuleveut, A., & Flammarion, N. (2020). On convergencediagnostic based step sizes for stochastic gradient descent. arXiv preprint arXiv:2007.00534.
Pflug, G. C. (1983). On the determination of the step size in stochastic quasigradient methods. collaborative paper cp83025. International Institute for Applied Systems Analysis (IIASA), Laxenburg, Austria.
Qian, Q., & Qian, X. (2019). The implicit bias of adagrad on separable data. arXiv preprint arXiv:1906.03559.
Reddi, S. J., Hefny, A., Sra, S., Poczos, B., & Smola, A. (2016). Stochastic variance reduction for nonconvex optimization. In International conference on machine learning (pp. 314–323).
Reddi, S. J., Kale, S., & Kumar, S. (2018). On the convergence of Adam and Beyond. In International conference on learning representations.
Schmidt, M., & Le Roux, N. (2013). Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint: arXiv:1308.6370.
Schmidt, M., Babanezhad, R., Ahmed, M., Defazio, A., Clifton, A., & Sarkar, A. (2015). Nonuniform stochastic average gradient method for training conditional random fields. In Proceedings of the eighteenth international conference on artificial intelligence and statistics, AISTATS.
Schmidt, M., Le Roux, N., & Bach, F. (2017). Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1–2), 83–112.
Sebbouh, O., Gazagnadou, N., Jelassi, S., Bach, F., & Gower, R. (2019). Towards closing the gap between the theory and practice of SVRG. In Advances in neural information processing systems (pp. 648–658).
ShalevShwartz, S., & Zhang, T. (2013). Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(Feb), 567–599.
Song, C., Jiang, Y., & Ma, Y. (2020). Variance reduction via accelerated dual averaging for finitesum optimization. Advances in Neural Information Processing Systems, 33.
Sridharan, K., ShalevShwartz, S., & Srebro, N. (2008). Fast rates for regularized objectives. Advances in Neural Information Processing Systems, 21, 1545–1552.
Tan, C., Ma, S., Dai, Y.H., & Qian, Y. (2016). BarzilaiBorwein step size for stochastic gradient descent. arXiv preprint arXiv:1605.04131.
Vaswani, S., Bach, F., & Schmidt, M. (2019a). Fast and faster convergence of SGD for overparameterized models and an accelerated perceptron. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS.
Vaswani, S., Kunstner, F., Laradji, I., Meng, S. Y., Schmidt, M., & LacosteJulien, S. (2020). Adaptive gradient methods converge faster with overparameterization (and you can do a linesearch). arXiv preprint arXiv:2006.06835.
Vaswani, S., Mishkin, A., Laradji, I., Schmidt, M., Gidel, G., & LacosteJulien, S. (2019). Painless stochastic gradient: Interpolation, linesearch, and convergence rates. In NeurIPS: In advances in neural information processing systems.
Ward, R., Wu, X., & Bottou, L. (2019). AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization. In Proceedings of the 36th international conference on machine learning, ICML.
Yaida, S. (2018). Fluctuationdissipation relations for stochastic gradient descent. arXiv preprint arXiv:1810.00004.
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. In 5th international conference on learning representations, ICLR.
Zhou, K., So, A. M.C., & Cheng, J. (2021). Accelerating perturbed stochastic iterates in asynchronous lockfree optimization. arXiv preprint arXiv:2109.15292.
Funding
Benjamin DuboisTaine would like to acknowledge funding by the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR19P3IA0001 (PRAIRIE 3IA Institute). Mark Schmidt and Simon LacosteJulien would like to acknowledge support from the Canada CIFAR AI Chair Program.
Author information
Authors and Affiliations
Contributions
SV and RB conceived of the idea of combining SVRG and AdaGrad. SV, BDT and RB proved the theoretical results. BDT performed the experiments with support from SV. SV and BDT wrote down the manuscript. SLJ and MS supervised the project.
Corresponding authors
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethics approval
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Editor: Krzysztof Dembczynski and Emilie Devijver.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Organization of the Appendix

A
Definitions

B
Algorithm in general case

C
Proof of Lemma 1

D
Main Proposition

E
Proof of Theorem 1

F
Proof of Theorem 2

G
Proof of Theorem 3

H
Proof of Theorem 4

I
Helper lemmas

J
Counterexample for linesearch for SVRG

K
Additional experiments
Definitions
Our main assumptions are that each individual function \(f_i\) is differentiable and \(L_i\)smooth, meaning that for all v and w,
which also implies that f is \(L_{\max }\)smooth, where \(L_{\max }\) is the maximum smoothness constant of the individual functions. We also assume that f is convex, meaning that for all v and w,
Algorithm in general case
We restate Algorithm 1 to handle the full matrix and diagonal variants. The first difference is in the initialization and update of \(G_t\). Moreover, we use a more general notion of projection, i.e. \(\Pi _{X, A_t}\left( \cdot \right)\) is the projection onto set X with respect to the norm induced by a symmetric positive definite matrix \(A_t\) (such projections are common to adaptive gradient methods Duchi et al., 2011; Levy et al., 2018; Reddi et al., 2018). Note that in the scalar variant, \(A_t\) is a scalar and thus \(\Pi _{X, A_t}\left( \cdot \right) = \Pi _{X}\left( \cdot \right)\) and we recover Algorithm 1.
Proof of Lemma 1
We restate Lemma 1 to handle the three variants of AdaSVRG.
Lemma 2
(AdaSVRG with single outerloop) Assuming (i) convexity of f, (ii) \(L_{\max }\)smoothness of \(f_i\) and (iii) bounded feasible set with diameter D. For the scalar variant, defining \(\rho := \big ( \frac{D^2}{\eta _{k}} + 2\eta _{k}\big ) \sqrt{L_{{\max}}}\), for any outer loop k of AdaSVRG, with (a) innerloop length \(m_k\) and (b) stepsize \(\eta _{k}\),
For the full matrix and diagonal variants, setting \(\rho ':= \left( \frac{D^2}{\eta _k} + 2\eta _k \right) \sqrt{dL_{{\max}}}\),
Proof
For any of the three variants, we have, for any outer loop iteration k and any inner loop iteration t,
where the inequality follows from Reddi et al. (2018, Lemma 4). Dividing by \(\eta _k\), rearranging and summing over all inner loop iterations at stage k gives
By Lemma 4, we have that \(\textsf {Tr}\mathclose {\left(A_{m_k}\right)} \le \sqrt{\sum _{t=1}^{m_k} \left\ g_t\right\ ^2}\) in the scalar case, and \(\textsf {Tr}\mathclose {\left(A_{m_k}\right)} \le \sqrt{d}\sqrt{\sum _{t=1}^{m_k} \left\ g_t\right\ ^2 + d\delta }\) in the full matrix and diagonal variants. Therefore we set
and
Going back to the above inequality and taking expectation we get
Using convexity of f yields
where the second inequality comes from Jensen’s inequality applied to the (concave) square root function. Now, from Johnson and Zhang (2013, Proof of Theorem 1),
Going back to the previous equation, squaring and setting \(\tau = a' \sqrt{4 L_{{\max}}}\) we get
Using Lemma 5,
Finally, using Jensen’s inequality we get
which concludes the proof by noticing that by definition \(\tau = \rho\) in the scalar case and \(\tau = \rho '\) in the full matrix and diagonal cases. \(\square\)
Main proposition
We first state the main proposition for the three variants of AdaSVRG, which we later use for proving theorems.
Proposition 2
Assuming (i) convexity of f (ii) \(L_{{\max}}\)smoothness of f (iii) bounded feasible set (iv) \(\eta _{k}\in [\eta _{{\min}}, \eta _{{\max}}]\) (v) \(m_k = m\) for all k, then for the scalar variant,
and for the full matrix and diagonal variants,
where
Proof
As in the previous proof we define
and
Using the result from Lemma 1 and letting \(\Delta _k :=\mathbb {E}[f(w_k)  f^*]\), we have,
Squaring gives
which we can rewrite as
Since \(\Delta _{k+1}^2  2\frac{\tau ^2}{m} \Delta _{k+1} = (\Delta _{k+1}  \frac{\tau ^2}{m})^2  \frac{\tau ^4}{m^2}\), we get
Summing this gives
Using Jensen’s inequality on the (concave) square root function gives
going back to the previous inequality this gives
which we can rewrite
Setting \(\bar{w}_K = \frac{1}{K} \sum _{k=0}^{K1} w_k\) and using Jensen’s inequality on the convex function f, we get
which concludes the proof. \(\square\)
Proof of Theorem 1
For the remainder of the appendix we define \(\rho ' :=\big ( \frac{D^2}{\eta _{{\min}}} + 2\eta _{{\max}}\big ) \sqrt{dL_{{\max}}}\). We restate and prove Theorem 1 for all three variants of AdaSVRG.
Theorem 5
(AdaSVRG with fixedsize innerloop) Under the same assumptions as Lemma 1, AdaSVRG with (a) stepsizes \(\eta _{k}\in [\eta _{{\min}}, \eta _{{\max}}]\), (b) innerloop size \(m_k = n\) for all k, results in the following convergence rate after \(K \le n\) iterations. For the scalar variant,
and for the full matrix and diagonal variants,
where \(\bar{w}_K = \frac{1}{K} \sum _{k = 1}^{K} w_{k}\).
Proof
We have \(\frac{1}{m} = \frac{1}{n} \le \frac{1}{K}\) by the assumption. Using the result of Proposition 2 for the scalar variant we have
Using the result of Proposition 2 for the full matrix and diagonal variants we have
\(\square\)
Corollary 1
Under the assumptions of Theorem 1, the computational complexity of AdaSVRG to reach \(\epsilon\)accuracy is \(O\left( \frac{n}{\epsilon }\right)\) when
for the scalar variant and when
for the full matrix and diagonal variants.
Proof
We deal with the scalar variant first. Let \(c = \rho ^2( 1 +\sqrt{5}) + \rho \sqrt{2\left( f(w_0)  f^*\right) }\). By the previous theorem, to reach \(\epsilon\)accuracy we require
We thus require \(O\left( \frac{1}{\epsilon }\right)\) outer loops to reach \(\epsilon\)accuracy. For \(m_k = n\), 3n gradients are computed in each outer loop, thus the computational complexity is indeed \(O\left( \frac{n}{\epsilon }\right)\).
The condition \(\epsilon \ge \frac{c}{n}\) follows from the assumption that \(K \le n\).
The proof for the full matrix and diagonal variants is similar by taking \(c = (\rho ')^2( 1 +\sqrt{5}) + 2\rho ' \sqrt{ {d\delta }/{L_{{\max}}}} + \rho '\sqrt{2\left( f(w_0)  f^*\right) }\). \(\square\)
Proof of Theorem 2
Theorem 2(Multistage AdaSVRG) Under the same assumptions as Theorem 1, multistage AdaSVRG with \(I = \log (1/\epsilon )\) stages, \(K \ge 3\) outerloops and \(m^i = 2^{i+1}\) innerloops at stage i, requires \(O(n\log {1}/{\epsilon } + {1}/{\epsilon })\) gradient evaluations to reach a \(\left( \rho ^2 (1+\sqrt{5}) \right) \epsilon\)suboptimality.
Proof
We deal with the scalar variant first. Let \(\Delta _i:= \mathbb {E}[f(\bar{w}^i)  f^*]\) and \(c:= \rho ^2(1 + \sqrt{5})\). Suppose that \(K \ge 3\) and \(m_i = 2^{i+1}\), as in the theorem statement. We claim that for all i, we have \(\Delta _i \le \frac{c}{2^i}\). We prove this by induction. For \(i=0\), we have
The quantity \(\frac{D^2}{\eta } + 2\eta\) reaches a minimum for \(\eta ^* = \frac{D}{\sqrt{2}}\). Therefore we can write
Now suppose that \(\Delta _{i1} \le \frac{c}{2^{i1}}\) for some \(i\ge 1\). Using the upperbound analysis of AdaSVRG in Proposition 2 we get
Since \(K \ge 3\), one can check that \(\frac{8}{(1+ \sqrt{5}) K} \le 1\) and thus
which concludes the induction step.
At time step \(I = \log \frac{1}{\epsilon }\), we thus have \(\Delta _I \le \frac{c}{2^{I}} = c\epsilon\). All that is left is to compute the gradient complexity. If we assume that \(K = \gamma\) for some constant \(\gamma \ge 3\), the gradient complexity is given by
which concludes the proof for the scalar variant.
Now we look at the full matrix and diagonal variants. Let’s take \(c = (\rho ')^2(1 + \sqrt{5}) + 2\rho '\sqrt{ {d \delta }/{L_{{\max}}}}\). Suppose that \(K \ge 3\) and \(m_i = 2^{i+1}\), as in the theorem statement. Again, we claim that for all i, we have \(\Delta _i \le \frac{c}{2^i}\). We prove this by induction. For \(i=0\), we have
The quantity \(\frac{D^2}{\eta } + 2\eta\) reaches a minimum for \(\eta ^* = \frac{D}{\sqrt{2}}\). Therefore we can write
The induction step is exactly the same as in the scalar case. \(\square\)
Proof of Theorem 3
We restate and prove Theorem 3 for the three variants of AdaSVRG.
Theorem 6
(AdaSVRG with adaptivesized innerloops) Under the same assumptions as Lemma 1, AdaSVRG with (a) stepsizes \(\eta _{k}\in [\eta _{{\min}}, \eta _{{\max}}]\), (b1) innerloop size \(m_k = \frac{\nu }{\epsilon }\) for all k or (b2) innerloop size \(m_k = \frac{\nu }{f(w_k)  f^*}\) for outerloop k, results in the following convergence rate,
where \(\nu = 4\rho ^2\) for the scalar variant and \(\nu = \left( \frac{2\rho ' + \sqrt{16 (\rho ')^2 + 12\rho ' \sqrt{ {d\delta }/{4L_{{\max}}}}}}{3}\right) ^2\) for the full matrix and diagonal variants.
Proof
Let us define
and
Similar to the proof of Lemma 1, for a innerloop k with \(m_k\) iterations and \(\alpha := \frac{1}{2}\big ( \frac{D^2}{\eta _{{\min}}} + 2\eta _{{\max}}\big )\) we can show
Using Lemma 5 we get
If we set \(m_k = \frac{C}{\epsilon _k}\),
Define \(a :=\sqrt{4L_{{\max}}\alpha ^2}\) and \(\gamma :=\sqrt{b/4L_{{\max}}}\), by dividing both sides of (59) by \(m_k\) and using the definition of \(w_{k+1}\),
Setting \(C=\left( \frac{2a+\sqrt{4a^2+12(a^2+a\gamma )}}{3}\right) ^2\), we get \(\epsilon _{k+1} \le 3/4 \epsilon _k\). However, the above proof requires knowing \(\epsilon _{k}\). Instead, let us assume a target error of \(\epsilon\), implying that we want to have \(\mathbb E [f(w_{K})  f^* ] \le \epsilon\). Going back to (58) and setting \(m_k = C/\epsilon\), we obtain,
With \(C=\left( \frac{2a+\sqrt{4a^2+12(a^2+a\gamma )}}{3}\right) ^2\), we get linear convergence to \(\epsilon\)suboptimality. Based on the above we require \(K = \mathcal O(\log (1/\epsilon ))\) outerloops. However in each outerloop we need \(\mathcal O(n+\frac{1}{\epsilon })\) gradient evaluations. All in all, our total computation complexity is of \(\mathcal {O} ((n+\frac{1}{\epsilon }) \log (\frac{1}{\epsilon }))\).
The proof is done by noticing that in the scalar variant, \(\gamma = 0\) and \(a = \rho\) so that
and in the full matrix and diagonal variants, \(a = \rho '\) and \(\gamma = \sqrt{ {d\delta }/{4L_{{\max}}}}\) so that
\(\square\)
Proof of Theorem 4
We restate and prove Theorem 4 for the three variants of AdaGrad. To handle all three variants, we restate the theorem in terms of \(\Vert G_t\Vert _* = \sqrt{\textsf {Tr}\mathclose {\left(G_t\right)}}\). Note that this does not change the claim for the scalar variant, as in that case \(G_t = \textsf {Tr}\mathclose {\left(G_t\right)}\).
Theorem 7
(Phase Transition in AdaGrad Dynamics) Under the same assumptions as Lemma 1 and (iv) \(\sigma ^2\)bounded stochastic gradient variance and defining \(T_0 = \frac{\rho ^2 L_{{\max}}}{\sigma ^2}\), for constant stepsize AdaGrad we have \(\mathbb {E}\Vert G_t\Vert _{*} = O(1) \text { for } t \le T_0\), and \(\mathbb {E}\Vert G_t\Vert _{*} = O (\sqrt{tT_0}) \text { for } t \ge T_0\).
The same result holds for the full matrix and diagonal variants of constant stepsize AdaGrad for \(T_0 = \frac{\left( \rho ' \sqrt{L_{{\max}}} + \sqrt{2\rho ' \sqrt{d\delta L_{{\max}}} + d\delta }\right) ^2}{\sigma ^2}\)
Proof
We start with the scalar variant. Consider the general AdaGrad update
The same we did in the proof of Theorem 1, we can bound suboptimality as
By rearranging, dividing by \(\eta\) and summing for T iteration we have
Define
and
We then have \(\textsf {Tr}\mathclose {\left(A_T\right)} \le \alpha \sqrt{\sum _{t=1}^T\Vert \nabla f_{i_t}(x_{t})\Vert ^2 + b }\) by Lemma 3 and Lemma 4. Going back to Eq. (66) and taking expectation and using the upperbound we get
Using convexity of f and Jensen’s inequality on the (concave) square root function, we have
Now,
Taking expectations, and since \(\mathbb {E}[\nabla f_{i_t}(x_t)  \nabla f(x_t)] = 0\), we have
Going back to the previous inequality we get
where we used smoothness in the last inequality. Now, if \(\sigma = 0\), namely \(\nabla f(x_t) = \nabla f_{i_t}(x_t)\), we have
Using Lemma 5 we get
so that
Now,
Thus we have
where we used smoothness for the inequality. This shows that \(\Vert G_T\Vert _*\) is a bounded series in the deterministic case. Now, if \(\sigma \not = 0\), going back to Eq. (72) we have
Using Lemma 5 we get
We then have
from which we get
This implies that for \(T \le \frac{\left( 2 \alpha L_{{\max}} + \sqrt{2L_{{\max}} \alpha \sqrt{b} + b}\right) ^2}{\sigma ^2}\), we have
and for \(T\ge \frac{\left( 2 \alpha L_{{\max}} + \sqrt{2L_{{\max}} \alpha \sqrt{b} + b}\right) ^2}{\sigma ^2}\),
The proof is done by noticing that
\(\square\)
Corollary 2
Under the same assumptions as Lemma 1, for outerloop k of AdaSVRG with constant stepsize \(\eta _k\), there exists \(T_0 = \frac{C}{f(w_k)  f^*}\) such that,
Proof
Using Theorem 7 with \(\sigma ^2 = f(w_{k})  f^*\) gives us the result. \(\square\)
Helper lemmas
We make use of the following helper lemmas from Vaswani et al. (2020), proved here for completeness.
Lemma 3
For any of the full matrix, diagonal and scalar versions, we have
Proof
For any of the three versions, we have by construction that \(A_t\) is nondecreasing, i.e. \(A_t  A_{t1} \succeq 0\) (for the scalar version, we consider \(A_t\) as a matrix of dimension 1 for simplicity). We can then use the bounded feasible set assumption to get
We then upperbound \(\lambda _{\max }\) by the trace and use the linearity of the trace to telescope the sum,
\(\square\)
Lemma 4
For any of the full matrix, diagonal and scalar versions, we have
Moreover, for the scalar version we have
and for the full matrix and diagonal version we have
Proof
We prove this by induction. Start with \(m=1\).
For the full matrix version, \(A_1=(\delta I+g_1g_1^\top )^{ {1}/{2}}\) and we have
For the diagonal version \(A_1=(\delta I+{{\,\mathrm{diag}\,}}(g_1g_1^\top ))^{ {1}/{2}}\) we have
Since \(A_1^{1}\) is diagonal, the diagonal elements of \(A_1^{1} g_1 g_1^\top\) are the same as the diagonal elements of \(A_1^{1} {{\,\mathrm{diag}\,}}(g_1 g_1^\top )\). Thus we get
For the scalar version \(A_1 = \left( g_1^\top g_1\right) ^{ {1}/{2}}\) and we have
Induction step: Suppose now that it holds for \(m1\), i.e. \(\sum _{t=1}^{m1} \left\ g_t\right\ ^2_{A_t^{1}}\le 2 \textsf {Tr}\mathclose {\left(A_{m1}\right)}\). We will show that it also holds for m.
For the full matrix version we have
We then use the fact that for any \(X \succeq Y \succeq 0\), we have (Duchi et al., 2011, Lemma 8)
As \(X = A_m^2\succeq Y = g_m g_m^\top \succeq 0\), we can use the above inequality and the induction holds for m.
For the diagonal version we have
As before, since \(A_m^{1}\) is diagonal, we have that the diagonal elements of \(A_m^{1} g_m g_m^\top\) are the same as the diagonal elements \(A_m^{1} {{\,\mathrm{diag}\,}}(g_m g_m^\top )\). Thus we get
We can then again apply the result from Duchi et al. (2011, Lemma 8) with \(X = A_m^2 \succeq Y = {{\,\mathrm{diag}\,}}(g_m g_m^\top ) \succeq 0\), and we obtain the desired result.
For the scalar version, since \(A_m^{1}\) is a scalar we have by induction hypothesis,
where the equality follows from the AdaGrad update. We can then again apply the result from Duchi et al. (2011, Lemma 8) with \(X = A_m^2 \ge Y = g_m^\top g_m \ge 0\), and we obtain the desired result.
Bound on the trace: For the trace bound, recall that \(A_m= G_m^{1/2}\). For the scalar version we have
For the diagonal and full matrix variants, we use Jensen’s inequality to get
there \(\lambda _j(G_m)\) denotes the jth eigenvalue of \(G_m\).
For the full matrix version, we have
For the diagonal version, we have
which concludes the proof. \(\square\)
Lemma 5
If \(x^2 \le a(x+b)\) for \(a\ge 0\) and \(b \ge 0\),
Proof
The starting point is the quadratic inequality \(x^2  ax  ab \le 0\). Letting \(r_1 \le r_2\) be the roots of the quadratic, the inequality holds if \(x \in [r_1, r_2]\). The upper bound is then given by using \(\sqrt{a+b} \le \sqrt{a} +\sqrt{b}\)
\(\square\)
Counterexample for linesearch for SVRG
Proposition 3
For any \(c> 0, \eta _{{\max}} > 0\), there exists a 1dimensional function f whose minimizer is \(x^*=0\), and for which the following holds: If at any point of Algorithm 6, we have \(\mathclose {\leftx_t^k\right} \in \big (0, \min \{ \frac{1}{c}, 1\}\big )\), then \(\mathclose {\leftx_{t+1}^k\right} \ge \mathclose {\leftx_t^k\right}\).
Proof
Define the following function
where \(a>0\) is a constant that will be determined later. We then have the following
The minimizer of f is 0, while the minimizers of \(f_1\) and \(f_2\) are 1 and 1, respectively. This symmetry will make the algorithm fail.
Now, as stated by the assumption, let \(\mathclose {\leftx_t^k\right} \in \big (0, \min \{ \frac{1}{c}, 1\}\big )\). WLOG assume \(x_t^k > 0\), the other case is symmetric.
Case 1: \(i_t = 1\). Then we have
Observe that \(g_t > 0\). Since \(x_t^k < 1\) and the function \(f_1\) is strictly decreasing in the interval \((\infty , 1]\), moving in the direction \(g_t\) from \(x_t^k\) can only increase the function value. Thus the Armijo line search will fail and yield \(\eta _t = 0\). Thus in that case \(x_{t+1}^k = x_t^k\).
Case 2: \(i_t = 2\). Then we have
The Armijo line search then reads
which we can rewrite as
Simplifying this gives
which simplifies even further to
Therefore, the Armijo linesearch will return a stepsize such that
Now, recall that by assumption we have \(x_t^k < 1/c\). Then \(1/x_t^k  c > 0\), which implies that
Now is the time to choose a. Indeed, if a is such that \(1/a \le \eta _{{\max}}\), we then have by Eq. (89) that
We then have
where the inequality comes from \(\eta _t \ge 1/a\) and the fact that \(x_t^k \ge 0\). Thus we indeed have \(\mathclose {\leftx_{t+1}^k\right} \ge \mathclose {\leftx_t^k\right}\). \(\square\)
Additional experiments
1.1 Poor performance of AdaGrad compared to variance reduction methods
See Fig. 4.
1.2 Additional experiments with batchsize = 64
1.3 Studying the effect of the batchsize on the performance of AdaSVRG
See Fig. 8, 9, 10, 11, 12 and 13.
1.4 Evaluating the diagonal variant of AdaSVRG
1.5 Additional experiments on adaptivity to overparameterization
See Fig. 16.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author selfarchiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
DuboisTaine, B., Vaswani, S., Babanezhad, R. et al. SVRG meets AdaGrad: painless variance reduction. Mach Learn 111, 4359–4409 (2022). https://doi.org/10.1007/s1099402206265x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s1099402206265x