1 Introduction

Variance reduction (VR) methods (Schmidt et al., 2017; Konečnỳ & Richtárik, 2013; Mairal, 2013; Shalev-Shwartz & Zhang, 2013; Johnson & Zhang, 2013; Mahdavi & Jin, 2013; Konečnỳ & Richtárik, 2013; Defazio et al., 2014; Nguyen et al., 2017) have proven to be an important class of algorithms for stochastic optimization. These methods take advantage of the finite-sum structure prevalent in machine learning problems, and have improved convergence over stochastic gradient descent (SGD) and its variants (see Gower et al., 2020 for a recent survey). For example, when minimizing a finite sum of n strongly-convex, smooth functions with condition number \(\kappa\), these methods typically require \(O\left( (\kappa + n) \, \log (1/\epsilon ) \right)\) gradient evaluations to obtain an \(\epsilon\)-error. This improves upon the complexity of full-batch gradient descent (GD) that requires \(O\left( \kappa n \, \log (1/\epsilon ) \right)\) gradient evaluations, and SGD that has an \(O(\kappa /\epsilon )\) complexity. Moreover, there have been numerous VR methods that employ Nesterov acceleration (Allen-Zhu, 2017; Lan et al., 2019; Song et al., 2020) and can achieve even faster rates.

In order to guarantee convergence, VR methods require an easier-to-tune constant step-size, whereas SGD needs a decreasing step-size schedule. Consequently, VR methods are commonly used in practice, especially when training convex models such as logistic regression or conditional Markov random fields (Schmidt et al., 2015). However, all the above-mentioned VR methods require knowledge of the smoothness of the underlying function in order to set the step-size. The smoothness constant is often unknown and difficult to estimate in practice. Although we can obtain global upper-bounds on it for simple problems such as least squares regression, these bounds are usually too loose to be practically useful and result in sub-optimal performance. Consequently, implementing VR methods requires a computationally expensive search over a range of step-sizes. Furthermore, a constant step-size does not adapt to the function’s local smoothness and may lead to poor empirical performance.

Consequently, there have been a number of works that try to adapt the step-size in VR methods. Schmidt et al. (2017) and Mairal (2013) employ stochastic line-search procedures to set the step-size in VR algorithms. While they show promising empirical results using line-searches, these procedures have no theoretical convergence guarantees. Recent works (Tan et al., 2016; Li et al., 2020) propose to use the Barzilai-Borwein (BB) step-size (Barzilai & Borwein, 1988) in conjunction with two common VR algorithms—stochastic variance reduced gradient (SVRG) (Johnson & Zhang, 2013) and the stochastic recursive gradient algorithm (SARAH) (Nguyen et al., 2017). Both Tan et al. (2016) and Li et al. (2020) can automatically set the step-size without requiring the knowledge of problem-dependent constants. However, in order to prove theoretical guarantees for strongly-convex functions, these techniques require the knowledge of both the smoothness and strong-convexity parameters. In fact, their guarantees require using a small \(O(1/\kappa ^2)\) step-size, a highly-suboptimal choice in practice. Consequently, there is a gap in the theory and practice of adaptive VR methods. To address this, we make the following contributions.

1.1 Background and contributions

1.1.1 SVRG meets AdaGrad

In Sect. 3 we use AdaGrad (Duchi et al., 2011; Levy et al., 2018), an adaptive gradient method, with stochastic variance reduction techniques. We focus on SVRG (Johnson & Zhang, 2013) and propose to use AdaGrad within its inner-loop. We analyze the convergence of the resulting AdaSVRG algorithm for minimizing convex functions (without strong-convexity). Using O(n) inner-loops for every outer-loop (a typical setting used in practice Babanezhad Harikandeh et al., 2015; Sebbouh et al., 2019), and any bounded step-size, we prove that AdaSVRG achieves an \(\epsilon\)-error (for \(\epsilon = O(1/n)\)) with \(O(n/\epsilon )\) gradient evaluations (Theorem 1). This rate matches that of SVRG with a constant step-size and O(n) inner-loops (Reddi et al., 2016, Corollary 10). However, unlike Reddi et al. (2016), our result does not require knowledge of the smoothness constant in order to set the step-size. We note that other previous work (Cutkosky & Orabona, 2019; Liu et al., 2020) consider adaptive methods with variance reduction for non-convex minimization; however their algorithms still require knowledge of problem-dependent parameters.

1.1.2 Multi-stage AdaSVRG

We propose a multi-stage variant of AdaSVRG where each stage involves running AdaSVRG for a fixed number of inner and outer-loops. In particular, multi-stage AdaSVRG maintains a fixed-size outer-loop and doubles the length of the inner-loop across stages. We prove that it requires \(O((n + 1/\epsilon ) \log (1/\epsilon ))\) gradient evaluations to reach an \(O(\epsilon )\) error (Theorem 2). This improves upon the complexity of decreasing step-size SVRG that requires \(O(n + \sqrt{n}/\epsilon )\) gradient evaluations (Reddi et al., 2016, Corollary 9); and matches the rate of SARAH (Nguyen et al., 2017).

After our work was made publicly available (Dubois-Taine et al., 2021), recent work (Zhou et al., 2021) improved upon our result by applying a similar idea to an accelerated variant of SVRG (Allen-Zhu, 2017). Their algorithm requires \(\tilde{O}(n + \sqrt{n/\epsilon })\) gradient evaluations to obtain an \(\epsilon\)-error without the knowledge of problem-dependent constants.

1.1.3 AdaSVRG with adaptive termination

Instead of using a complex multi-stage procedure, we prove that AdaSVRG can also achieve the improved \(O((n + 1/\epsilon ) \log (1/\epsilon ))\) gradient evaluation complexity by adaptively terminating its inner-loop (Sect. 4). However, the adaptive termination requires the knowledge of problem-dependent constants, limiting its practical use.

To address this, we use the favourable properties of AdaGrad to design a practical heuristic for adaptively terminating the inner-loop. Our technique for adaptive termination is related to heuristics (Pflug, 1983; Yaida, 2018; Lang et al., 2019; Pesme et al., 2020) that detect stalling for constant step-size SGD, and may be of independent interest. First, we show that when minimizing smooth convex losses, AdaGrad has a two-phase behaviour—a first “deterministic phase” where the step-size remains approximately constant followed by a second “stochastic” phase where the step-size decreases at an \(O(1/\sqrt{t})\) rate (Theorem 4). We show that it is empirically possible to efficiently detect this phase transition and aim to terminate the AdaSVRG inner-loop when AdaGrad enters the stochastic phase.

1.1.4 Practical considerations and experimental evaluation

In Sect. 5, we describe some of the practical considerations for implementing AdaSVRG and the adaptive termination heuristic. We use standard real-world datasets to empirically verify the robustness and effectiveness of AdaSVRG. Across datasets, we demonstrate that AdaSVRG consistently outperforms variants of SVRG, SARAH and methods based on the BB step-size (Tan et al., 2016; Li et al., 2020).

1.1.5 Adaptivity to over-parameterization

Defazio and Bottou (2019) demonstrated the ineffectiveness of SVRG when training large over-parameterized models such as deep neural networks. We argue that this ineffectiveness can be partially explained by the interpolation property satisfied by over-parameterized models (Schmidt & Le Roux, 2013; Ma et al., 2018; Vaswani et al., 2019a). In the interpolation setting, SGD obtains an \(O(1/\epsilon )\) gradient complexity when minimizing smooth convex functions (Vaswani et al., 2019a), thus out-performing typical VR methods. However, interpolation is rarely exactly satisfied in practice, and using SGD can result in oscillations around the solution. On the other hand, although VR methods have a slower convergence, they do not oscillate, regardless of interpolation. In Sect. 6, we use AdaGrad to exploit the (approximate) interpolation property, and employ the above heuristic to adaptively switch to AdaSVRG, thus avoiding oscillatory behaviour. We design synthetic problems controlling the extent of interpolation and show that the hybrid AdaGrad-AdaSVRG algorithm can match or outperform both stochastic gradient and VR methods, thus achieving the best of both worlds.

2 Problem setup

We consider the minimization of an objective \(f:\mathbb {R}^d\rightarrow \mathbb {R}\) with a finite-sum structure,

$$\min _{w \in X} f(w)=\frac{1}{n}\sum _{i=1}^n f_i(w),$$

where X is a convex compact set of diameter D, meaning \(\sup _{x, y \in X}\left\| x - y\right\| \le D\). Problems with this structure are prevalent in machine learning. For example, in supervised learning, n represents the number of training examples, and \(f_i\) is the loss function when classifying or regressing to training example i. Throughout this paper, we assume f and each \(f_i\) are differentiable. We assume that f is convex, implying that there exists a solution \(w^{*}\in X\) that minimizes it, and define \(f^*:= f(w^{*})\). Interestingly we do not need each \(f_i\) to be convex. We further assume that each function \(f_{i}\) in the finite-sum is \(L_i\)-smooth, implying that f is \(L_{\max }\)-smooth, where \(L_{\max } = \max _{i} L_i\). We include the formal definitions of these properties in Appendix “Definitions”.

The classical method for solving such a problem is stochastic gradient descent (SGD). Starting from the iterate \(x_0\), at each iteration t SGD samples (typically uniformly at random) a loss function \(f_{i_t}\) and takes a step in the negative direction of the stochastic gradient \(\nabla f_{i_t}(x_t)\) using a step-size \(\eta _t\). This update can be expressed as

$$\begin{aligned} x_{t+1}&= x_t - \eta _t \nabla f_{i_t}(x_t) \end{aligned}$$

In order to ensure convergence to the minimizer, the sequence of step-sizes in SGD needs to be decreasing, typically at an \(O(1/\sqrt{t})\) rate (Moulines & Bach, 2011). This has the effect of slowing down the convergence and results in an \(\Theta (1/\sqrt{t})\) convergence to the minimizer for convex functions (compared to the O(1/t) convergence for gradient descent). Variance reduction methods were developed to overcome this slower convergence by exploiting the finite-sum structure of the objective.

We focus on the SVRG algorithm (Johnson & Zhang, 2013) since it is more memory efficient than other variance reduction alternatives like SAG (Schmidt et al., 2017) or SAGA (Defazio et al., 2014). SVRG has a nested inner-outer loop structure. In every outer-loop k, it computes the full gradient \(\nabla f(w_{k})\) at a snapshot point \(w_{k}\). An outer-loop k consists of \(m_k\) inner-loops indexed by \(t = 1, 2, \ldots m_k\) and the inner-loop iterate \(x_1\) is initialized to \(w_{k}\). In outer-loop k and inner-loop t, SVRG samples an example \(i_t\) (typically uniformly at random) and takes a step in the direction of the variance-reduced gradient \(g_t\) using a constant step-size \(\eta\). This update can be expressed as:

$$\begin{aligned} g_t&= \nabla f_{i_t}(x_t) - \nabla f_{i_t}(w_{k}) + \nabla f(w_{k}) \nonumber \\ x_{t+1}&= \Pi _{X} [x_{t} - \eta \, g_t], \end{aligned}$$

where \(\Pi _{X}\) denotes the Euclidean projection onto the set X. The variance-reduced gradient is unbiased, meaning that \(\mathbb {E}_{i_t}[g_t \vert x_t] = \nabla f(x_t)\). At the end of the inner-loop, the next snapshot point is typically set to either the last or averaged iterate in the inner-loop.

SVRG requires the knowledge of both the strong-convexity and smoothness constants in order to set the step-size and the number of inner-loops. These requirements were relaxed in Hofmann et al. (2015), Kovalev et al. (2020), Gower et al. (2020) that only require knowledge of the smoothness.

In order to set the step-size for SVRG without requiring knowledge of the smoothness, line-search techniques are an attractive option. Such techniques are a common approach to automatically set the step-size for (stochastic) gradient descent (Armijo, 1966; Vaswani et al., 2019b). However, we show that an intuitive Armijo-like line-search to set the SVRG step-size is not guaranteed to converge to the solution. Specifically, we prove the following proposition in Appendix “Counter-example for line-search for SVRG”.

Proposition 1

If in each inner-loop t of SVRG, \(\eta _t\) is set as the largest step-size satisfying the condition: \(\eta _t \le \eta _{{\max}}\) and

$$\begin{aligned} f_{{i_t}}(x_{t} - \eta _t g_t) \le f_{{i_t}}(x_t) - c \eta _t \left\| g_t\right\| ^2 \quad {\text {where}} \quad (c >0), \end{aligned}$$

then for any \(c > 0\), \(\eta _{{\max}} > 0\), there exists a 1-dimensional convex smooth function such that if \(\vert x_t - w^{*}\vert \le \min \{ \frac{1}{c}, 1\}\), then \(\vert x_{t+1} - w^{*}\vert \ge \vert x_t - w^{*}\vert\), implying that the update moves the iterate away from the solution when it is close to it, preventing convergence.

In the next section, we suggest a novel approach using AdaGrad (Duchi et al., 2011) to propose AdaSVRG, a provably-convergent VR method that is more robust to the choice of step-size. To justify our decision to use AdaGrad, we note that in general, there are (roughly) three common ways of designing methods that do not require knowledge of problem-dependent constants: (i) BB step-size, but it still requires knowledge of \(L_{{\max}}\) to guarantee convergence in the VR setting (Tan et al., 2016; Li et al., 2020), (ii) Line-search methods that can fail to converge in the VR setting (Proposition 1), (iii) Adaptive gradient methods such as AdaGrad.

figure a
figure b

3 Adaptive SVRG

Like SVRG, AdaSVRG has a nested inner-outer loop structure and relies on computing the full gradient in every outer-loop. However, it uses AdaGrad in the inner-loop, using the variance reduced gradient \(g_t\) to update the preconditioner \(A_t\) in the inner-loop t. AdaSVRG computes the step-size \(\eta _k\) in every outer-loop (see Sect. 5 for details) and uses a preconditioned variance-reduced gradient step to update the inner-loop iterates: \(x_{t+1} = \Pi _{X}\left( x_t - \eta _{k} A_t^{-1} g_t\right)\). AdaSVRG then sets the next snapshot \(w_{k+1}\) to be the average of the inner-loop iterates.

We now analyze the convergence of AdaSVRG (see Algorithm 1 for the pseudo-code). Throughout the main paper, we will only focus on the scalar variant (Ward et al., 2019) of AdaGrad. We defer the general diagonal and matrix variants (see Appendix “Algorithm in general case” for the pseudo-code) and their corresponding theory to the Appendix. We start with the analysis of a single outer-loop, and prove the following lemma in Appendix “Proof of Lemma 1”.

Lemma 1

(AdaSVRG with single outer-loop) Assume (i) convexity of f, (ii) \(L_{\max }\)-smoothness of \(f_i\) and (iii) bounded feasible set with diameter D. Defining \(\rho := \big ( \frac{D^2}{\eta _{k}} + 2\eta _{k}\big ) \sqrt{L_{{\max}}}\), for any outer loop k of AdaSVRG, with (a) inner-loop length \(m_k\) and (b) step-size \(\eta _{k}\),

$$\begin{aligned} \mathbb {E}[f(w_{k+1}) - f^*] \le \frac{\rho ^2}{m_k} + \frac{\rho \sqrt{\mathbb {E}[f(w_k) - f^*]}}{\sqrt{m_k}} \end{aligned}$$

The proof of the above lemma leverages the theoretical results of AdaGrad (Duchi et al., 2011; Levy et al., 2018). Specifically, the standard AdaGrad analysis bounds the “noise” term by the variance in the stochastic gradients. On the other hand, we use the properties of the variance reduced gradient in order to upper-bound the noise in terms of the function suboptimality.

Lemma 1 shows that a single outer-loop of AdaSVRG converges to the minimizer as \(O(1/\sqrt{m})\), where m is the number of inner-loops. This implies that in order to obtain an \(\epsilon\)-error, a single outer-loop of AdaSVRG requires \(O(n + 1/\epsilon ^2)\) gradient evaluations. This result holds for any bounded step-size and requires setting \(m = O(1/\epsilon ^2)\). This “single outer-loop convergence” property of AdaSVRG is unlike SVRG or any of its variants; running only a single-loop of SVRG is ineffective, as it stops making progress at some point, resulting in the iterates oscillating in a neighbourhood of the solution. The favourable behaviour of AdaSVRG is similar to SARAH, but unlike SARAH, the above result does not require computing a recursive gradient or knowing the smoothness constant.

Next, we consider the convergence of AdaSVRG with a fixed-size inner-loop and multiple outer-loops. In the following theorems, we assume that we have a bounded range of step-sizes implying that for all k, \(\eta _{k}\in [\eta _{{\min}}, \eta _{{\max}}]\). For brevity, similar to Lemma 1, we define \(\rho :=\big ( \frac{D^2}{\eta _{{\min}}} + 2\eta _{{\max}}\big ) \sqrt{L_{{\max}}}\).

Theorem 1

(AdaSVRG with fixed-size inner-loop) Under the same assumptions as Lemma 1, AdaSVRG with (a) step-sizes \(\eta _{k}\in [\eta _{{\min}}, \eta _{{\max}}]\), (b) inner-loop size \(m_k = n\) for all k, results in the following convergence rate after \(K \le n\) iterations.

$$\begin{aligned} \mathbb {E}[f(\bar{w}_K) - f^*] \le \frac{\rho ^2 ( 1+ \sqrt{5}) + \rho \sqrt{2\left( f(w_0) - f^*\right) }}{K} \end{aligned}$$

where \(\bar{w}_K = \frac{1}{K} \sum _{k = 1}^{K} w_{k}\).

The proof (refer to Appendix “Proof of Theorem 1”) recursively uses the result of Lemma 1 for K outer-loops.

The above result requires a fixed inner-loop size \(m_k = n\), a setting typically used in practice (Babanezhad Harikandeh et al., 2015; Gower et al., 2020). Notice that the above result holds only when \(K \le n\). Since K is the number of outer-loops, it is typically much smaller than n, the number of functions in the finite sum, justifying the theorem’s \(K \le n\) requirement. Moreover, in the sense of generalization error, it is not necessary to optimize below an O(1/n) accuracy (Boucheron et al., 2005; Sridharan et al., 2008).

Theorem 1 implies that AdaSVRG can reach an \(\epsilon\)-error (for \(\epsilon = \Omega (1/n)\)) using \(\mathcal O(n/\epsilon )\) gradient evaluations. This result matches the complexity of constant step-size SVRG (with \(m_k = n\)) of Reddi et al. (2016, Corollary 10) but without requiring the knowledge of the smoothness constant. However, unlike SVRG and SARAH, the convergence rate depends on the diameter D rather than \(\left\| w_0 - w^{*}\right\|\), the initial distance to the solution. This dependence arises due to the use of AdaGrad in the inner-loop, and is necessary for adaptive gradient methods. Specifically, Cutkosky and Boahen (2017) prove that any adaptive (to problem-dependent constants) method will necessarily incur such a dependence on the diameter. Hence, such a diameter dependence can be considered to be the “cost” of the lack of knowledge of problem-dependent constants.

Since the above result only holds for \(\epsilon = \Omega (1/n)\), we propose a multi-stage variant (Algorithm 2) of AdaSVRG that requires \(O((n+1/\epsilon )\log (1/\epsilon ))\) gradient evaluations to attain an \(O(\epsilon )\)-error for any \(\epsilon\). To reach a target suboptimality of \(\epsilon\), we consider \(I = \log (1/\epsilon )\) stages. For each stage i, Algorithm 2 uses a fixed number of outer-loops K and inner-loops \(m^i\) with stage i is initialized to the output of the \((i-1)\)-th stage. In Appendix “Proof of Theorem 2”, we prove the following rate for multi-stage AdaSVRG.

Theorem 2

(Multi-stage AdaSVRG) Under the same assumptions as Theorem 1, multi-stage AdaSVRG with \(I = \log (1/\epsilon )\) stages, \(K \ge 3\) outer-loops and \(m^i = 2^{i+1}\) inner-loops at stage i, requires \(O(n\log {1}/{\epsilon } + {1}/{\epsilon })\) gradient evaluations to reach a \(\left( \rho ^2 (1+\sqrt{5}) \right) \epsilon\)-sub-optimality.

We see that multi-stage AdaSVRG matches the convergence rate of SARAH (upto constants), but does so without requiring the knowledge of the smoothness constant to set the step-size. Observe that the number of inner-loops increases with the stage i.e. \(m^i = 2^{i+1}\). The intuition behind this is that the convergence of AdaGrad (used in the k-th inner-loop of AdaSVRG) is slowed down by a “noise” term proportional to \(f(w_k) - f^*\) (see Lemma 1). When this “noise” term is large in the earlier stages of multi-stage AdaSVRG, the inner-loops have to be short in order to maintain the overall \(O(1/{\epsilon })\) convergence. However, as the stages progress and the suboptimality decreases, the “noise” term becomes smaller, and the algorithm can use longer inner-loops, which reduces the number of full gradient computations, resulting in the desired convergence rate.

Thus far, we have focused on using AdaSVRG with fixed-size inner-loops. Next, we consider variants that can adaptively determine the inner-loop size.

4 Adaptive termination of inner-loop

Recall that the convergence of a single outer-loop k of AdaSVRG (Lemma 1) is slowed down by the \(\sqrt{ {\left( f(w_k) - f^*\right)}/{m_k}}\) term. Similar to the multi-stage variant, the suboptimality \(f(w_k) - f^*\) decreases as AdaSVRG progresses. This allows the use of longer inner-loops as k increases, resulting in fewer full-gradient evaluations. We instantiate this idea by setting \(m_k = O\left( {1}/{\left( f(w_k) - f^*\right) }\right)\). Since this choice requires the knowledge of \(f(w_k) - f^*\), we alternatively consider using \(m_k = O(1/\epsilon )\), where \(\epsilon\) is the desired sub-optimality. We prove the following theorem in Appendix “Proof of Theorem 3”.

Theorem 3

(AdaSVRG with adaptive-sized inner-loops) Under the same assumptions as Lemma 1, AdaSVRG with (a) step-sizes \(\eta _{k}\in [\eta _{{\min}}, \eta _{{\max}}]\), (b1) inner-loop size \(m_k = \frac{4\rho ^2}{\epsilon }\) for all k or (b2) inner-loop size \(m_k = \frac{4\rho ^2}{f(w_k) - f^*}\) for outer-loop k, results in the following convergence rate,

$$\begin{aligned} \mathbb {E}[f(w_K) - f^*] \le (3/4)^K [f(w_0) - f^*]. \end{aligned}$$

The above result implies a linear convergence in the number of outer-loops, but each outer-loop requires \(O(1/\epsilon )\) inner-loops. Hence, Theorem 3 implies that AdaSVRG with adaptive-sized inner-loops requires \(O\left( (n + 1/\epsilon ) \log (1/\epsilon ) \right)\) gradient evaluations to reach an \(\epsilon\)-error. This improves upon the rate of SVRG and matches the convergence rate of SARAH that also requires inner-loops of length \(O(1/\epsilon )\). Compared to Theorem 1 that has an average iterate convergence (for \(\bar{w}_K\)), Theorem 3 has the desired convergence for the last outer-loop iterate \(w_{K}\) and also holds for any bounded sequence of step-sizes. However, unlike Theorem 1, this result (with either setting of \(m_k\)) requires the knowledge of problem-dependent constants in \(\rho\).

To address this issue, we design a heuristic for adaptive termination in the next sections. We start by describing the two phase behaviour of AdaGrad and subsequently utilize it for adaptive termination in AdaSVRG.

4.1 Two phase behaviour of AdaGrad

Diagnostic tests (Pflug, 1983; Yaida, 2018; Lang et al., 2019; Pesme et al., 2020) study the behaviour of the SGD dynamics to automatically control its step-size. Similarly, designing the adaptive termination test requires characterizing the behaviour of AdaGrad used in the inner loop of AdaSVRG.

We first investigate the dynamics of constant step-size AdaGrad in the stochastic setting. Specifically, we monitor the evolution of \(\sqrt{G_t}\) across iterations. We define \(\sigma ^2 :=\sup _{x \in X} \mathbb E_{i} \Vert \nabla f_i(x) - \nabla f(x) \Vert ^2\) as a uniform upper-bound on the variance in the stochastic gradients for all iterates. We prove the following theorem showing that there exists an iteration \(T_0\) when the evolution of \(\sqrt{G_t}\) undergoes a phase transition. Note again that such a phase transition happens for the scalar, diagonal and matrix variants of AdaGrad. We only present the scalar result in the main paper and defer the other variants to the Appendix.

Theorem 4

(Phase Transition in AdaGrad Dynamics) Under the same assumptions as Lemma 1 and (iv) \(\sigma ^2\)-bounded stochastic gradient variance and defining \(T_0 = \frac{\rho ^2 L_{{\max}}}{\sigma ^2}\), for constant step-size AdaGrad we have \(\mathbb {E}[\sqrt{G_t}] = O(1) \text { for } t \le T_0\), and \(\mathbb {E}[\sqrt{G_t}] = O (\sqrt{t-T_0}) \text { for } t \ge T_0\).

Theorem 4 (proved in Appendix “Proof of Theorem 4”) indicates that \(G_t\) is bounded by a constant for all \(t \le T_0\), implying that its rate of growth is slower than \(\log (t)\). This implies that the step-size of AdaGrad is approximately constant (similar to gradient descent in the full-batch setting) in this first phase until iteration \(T_0\). Indeed, if \(\sigma = 0\), \(T_0 = \infty\) and AdaGrad is always in this deterministic phase. This result generalizes (Qian & Qian, 2019, Theorem 3.1) that analyzes the diagonal variant of AdaGrad in the deterministic setting. After iteration \(T_0\), the noise \(\sigma ^2\) starts to dominate, and AdaGrad transitions into the stochastic phase where \(G_t\) grows as O(t). In this phase, the step-size decreases as \(O(1/\sqrt{t})\), resulting in slower convergence to the minimizer. AdaGrad thus results in an overall \(O \left( {1}/{T} + {\sigma ^2}/{\sqrt{T}} \right)\) rate (Levy et al., 2018), where the first term corresponds to the deterministic phase and the second to the stochastic phase.

Since the exact detection of this phase transition is not possible, we design a heuristic to detect it without requiring the knowledge of problem-dependent constants.

figure c

4.2 Heuristic for adaptive termination

Similar to tests used to detect stalling for SGD (Pflug, 1983; Pesme et al., 2020), the proposed diagnostic test has a burn-in phase of \({n}/{2}\) inner-loop iterations that allows the initial AdaGrad dynamics to stabilize. After this burn-in phase, for every even iteration, we compute the ratio \(R = \frac{G_{t} - G_{t/2}}{G_{t/2}}\). Given a threshold hyper-parameter \(\theta\), the test terminates the inner-loop when \(R \ge \theta\). In the first deterministic phase, since the growth of \(G_{t}\) is slow, \(G_{2t} \approx G_{t}\) and \(R \approx 0\). In the stochastic phase, \(G_{t} = O(t)\), and \(R \approx 1\), justifying that the test can distinguish between the two phases. AdaSVRG with this test is fully specified in Algorithm 3. Experimentally, we use \(\theta = 0.5\) to give an early indication of the phase transition.

5 Experimental evaluation of AdaSVRG

We first describe the practical considerations for implementing AdaSVRG and then evaluate its performance on real and synthetic datasets. The code to reproduce the experiments can be found at the following link: https://github.com/bpauld/AdaSVRG. We do not use projections in our experiments as these problems have an unconstrained \(w^*\) with finite norm (we thus assume D is big enough to include it), and that we empirically observed that our iterates always stayed bounded, thus not requiring any projection.Footnote 1

5.1 Implementing AdaSVRG

Though our theoretical results hold for any bounded sequence of step-sizes, its choice affects the practical performance of AdaGrad (Vaswani et al., 2020) (and hence AdaSVRG). Theoretically, the optimal step-size minimizing the bound in Lemma 1 is given by \(\eta ^* = \frac{D}{\sqrt{2}}\). Since we do not have access to D, we use the following heuristic to set the step-size for each outer-loop of AdaSVRG. In outer-loop k, we approximate D by \(\Vert w_{k}- w^{*}\Vert\), that can be bounded using the co-coercivity of smooth convex functions as \(\Vert w_{k}- w^{*}\Vert \ge {1}/{L_{{\max}}} \Vert \nabla f(w_k) \Vert\) (Nesterov, 2004, Thm. 2.1.5 (2.1.8)). We have access to \(\nabla f(w_k)\) for the current outer-loop, and store the value of \(\nabla f(w_{k-1})\) in order to approximate the smoothness constant. Specifically, by co-coercivity, \(L_{{\max}} \ge L_k :=\frac{\Vert \nabla f(w_k) - \nabla f(w_{k-1})\Vert }{\Vert w_k - w_{k-1}\Vert }\). Putting these together, \(\eta _{k}= \frac{\Vert \nabla f(w_k)\Vert }{\sqrt{2} \, \max _{i=0, \dots , k} L_i}\).Footnote 2 Although a similar heuristic could be used to estimate \(L_{{\max}}\) for SVRG or SARAH, the resulting step-size is larger than \(1/L_{{\max}}\) implying that it would not have any theoretical guarantee, while our results hold for any bounded sequence of step-sizes. Although Algorithm 1 requires setting \(w_{k+1}\) to be the average of the inner-loop iterates, we use the last-iterate and set \(w_{k}= x_{m_k}\), as this is a more common choice (Johnson & Zhang, 2013; Tan et al., 2016) and results in better empirical performance. We compare two variants of AdaSVRG, with (i) fixed-size inner-loop Algorithm 1 and (ii) adaptive termination Algorithm 3. We handle a general batch-size b, and set \(m = {n}/{b}\) for Algorithm 1. This is a common practical choice (Babanezhad Harikandeh et al., 2015; Gower et al., 2020; Kovalev et al., 2020). For Algorithm 3, the burn-in phase consists of \({n}/{2b}\) iterations and \(M = {10n}/{b}\).

5.2 Evaluating AdaSVRG

In order to assess the effectiveness of AdaSVRG, we experiment with binary classification on standard LIBSVM datasets (Chang & Lin, 2011). In particular, we consider \(\ell _2\)-regularized problems (with regularization set to \({1}/{n}\)) with three losses—logistic loss, the squared loss or the Huber loss. For each experiment we plot the median and standard deviation across 5 independent runs. In the main paper, we show the results for four of the datasets and relegate the results for the three others to Appendix “Additional experiments”. Similarly, we consider batch-sizes in the range [1, 8, 64, 128], but only show the results for \(b = 64\) in the main paper.

We compare the AdaSVRG variants against SVRG (Johnson & Zhang, 2013), loopless-SVRG (Kovalev et al., 2020), SARAH (Nguyen et al., 2017), and SVRG-BB (Tan et al., 2016), the only other tune-free VR method.Footnote 3 Since each of these methods requires a step-size, we search over the grid \([10^{-3}, 10^{-2}, 10^{-1}, 1, 10, 100]\), and select the best step-size for each algorithm and each experiment. As is common, we set \(m = {n}/{b}\) for each of these methods. We note that though the theoretical results of SVRG-BB require a small \(O(1/\kappa ^2)\) step-size and \(O(\kappa ^2)\) inner-loops, Tan et al. (2016) recommends setting \(m = O(n)\) in practice. Since AdaGrad results in the slower \(O(1/\epsilon ^2)\) rate (Levy et al., 2018; Vaswani et al., 2020) compared to the \(O(n + \frac{1}{\epsilon })\) rate of VR methods, we do not include it in the main paper. We demonstrate the poor performance of AdaGrad on two example datasets in Fig. 4 in Appendix “Additional experiments”.

We plot the gradient norm of the training objective (for the best step-size) against the number of gradient evaluations normalized by the number of examples. We show the results for the logistic loss (Fig. 1a), Huber loss (Fig. 1b), and squared loss (Fig. 2). Our results show that (i) both variants of AdaSVRG (without any step-size tuning) are competitive with the other best-tuned VR methods, often out-performing them or matching their performance; (ii) SVRG-BB often has an oscillatory behavior, even for the best step-size; and (iii) the performance of AdaSVRG with adaptive termination (that has superior theoretical complexity) is competitive with that of the practically useful O(n) fixed inner-loop setting.

Fig. 1
figure 1

Comparison of AdaSVRG against SVRG variants, SVRG-BB and SARAH with batch-size = 64 for logistic loss (top 2 rows) and Huber loss (bottom 2 rows). For both losses, we compare AdaSVRG against the best-tuned variants, and show the sensitivity to step-size (we limit the gradient norm to a maximum value of 10)

Fig. 2
figure 2

Comparison of AdaSVRG against SVRG variants, SVRG-BB and SARAH with batch-size = 64 for squared loss. We compare AdaSVRG against the best-tuned variants, and show the sensitivity to step-size (we limit the gradient norm to a maximum value of 10). In cases where SVRG-BB diverged, we remove these curves

In order to evaluate the effect of the step-size on a method’s performance, we plot the gradient norm after 50 outer-loops vs step-size for each of the competing methods. For the AdaSVRG variants, we set the step-size according to the heuristic described earlier. For the logistic loss (Fig. 1a), Huber loss (Fig. 1b) and squared loss (Fig. 2), we observe that (i) the performance of typical VR methods heavily depends on the choice of the step-size; (ii) the step-size corresponding to the minimum loss is different for each method, loss and dataset; and (iii) AdaSVRG with the step-size heuristic results in competitive performance. Additional results plotted in Figs. 5a, b, 6a, b, 7a, b, 8a–c, 9a–c, 10a–c, 11a–c, 12a–c, 13a–c in Appendix “Additional experiments” confirm that the good performance of AdaSVRG is consistent across losses, batch-sizes and datasets.

Finally, in Fig. 14a in Appendix “Evaluating the diagonal variant of AdaSVRG”, we give preliminary results benchmarking the performance of the diagonal variant of AdaSVRG and comparing it to the scalar variant. We do not compare to the full matrix variant since inverting a \(d \times d\) matrix in each iteration makes it impractical for most machine learning tasks. Our results demonstrate that with the current heuristic for setting \(\eta\), the performance of the diagonal variant does not significantly improve over the scalar variant. Moreover, since each iteration of the diagonal variant incurs an additional O(d) cost compared to the scalar variant, we did not conduct further experiments with the diagonal variant. In the future, we aim to develop robust heuristics for setting the step-size for this variant of AdaSVRG.

6 Heuristic for adaptivity to over-parameterization

figure d

In this section, we reason that the poor empirical performance of SVRG when training over-parameterized models (Defazio & Bottou, 2019) can be partially explained by the interpolation property (Schmidt & Le Roux, 2013; Ma et al., 2018; Vaswani et al., 2019a) satisfied by these models (Zhang et al., 2017). In particular, we focus on smooth convex losses, but assume that the model is capable of completely fitting the training data, and that \(w^*\) lies in the interior of X. For example, these properties are simultaneously satisfied when minimizing the squared hinge-loss for linear classification on separable data or unregularized kernel regression (Belkin et al., 2019; Liang et al., 2020) with \(\Vert w^*\Vert \le 1\).

Formally, the interpolation condition means that the gradient of each \(f_i\) in the finite-sum converges to zero at an optimum. Additionally, we assume that each function \(f_i\) has finite minimum \(f_i^*\). If the overall objective f is minimized at \(w^{*}\), \(\nabla f(w^{*}) = 0\), then for all \(f_{i}\) we have \(\nabla f_{i}(w^{*}) = 0\). Since the interpolation property is rarely exactly satisfied in practice, we allow for a weaker version that uses \(\zeta ^2 :=\mathbb {E}_i [f^* - f_i^*] \in [0,\infty )\) (Loizou et al., 2020; Vaswani et al., 2020) to measure the extent of the violation of interpolation. If \(\zeta ^2 = 0\), interpolation is exactly satisfied.

When \(\zeta ^2 = 0\), both constant step-size SGD and AdaGrad have a gradient complexity of \(O(1/\epsilon )\) in the smooth convex setting (Schmidt & Le Roux, 2013; Vaswani et al., 2019a, 2020). In contrast, typical VR methods have an \(\tilde{O}(n + \frac{1}{\epsilon })\) complexity. For example, both SVRG and AdaSVRG require computing the full gradient in every outer-loop, and will thus unavoidably suffer an \(\Omega (n)\) cost. For large n, typical VR methods will thus be necessarily slower than SGD when training models that can exactly interpolate the data. This provides a partial explanation for the ineffectiveness of VR methods when training over-parameterized models. When \(\zeta ^2 > 0\), AdaGrad has an \(O(1/\epsilon + \zeta /\epsilon ^2)\) rate (Vaswani et al., 2020). Here \(\zeta\), the violation of interpolation plays the role of noise and slows down the convergence to an \(O(1/\epsilon ^2)\) rate. On the other hand, AdaSVRG results in an \(\tilde{O}(n + 1/\epsilon )\) rate, regardless of \(\zeta\).

Following the reasoning in Sect. 4, if an algorithm can detect the slower convergence of AdaGrad and switch from AdaGrad to AdaSVRG, it can attain a faster convergence rate. It is straightforward to show that AdaGrad has a a similar phase transition as Theorem 4 when interpolation is only approximately satisfied. This enables the use of the test in Sect. 4 to terminate AdaGrad and switch to AdaSVRG, resulting in the hybrid algorithm described in Algorithm 4. If the diagnostic test can detect the phase transition accurately, Algorithm 4 will attain an \(O(1/\epsilon )\) convergence when interpolation is exactly satisfied (no switching in this case). When interpolation is only approximately satisfied, it will result in an \(O(1/\epsilon )\) convergence for \(\epsilon \ge \zeta\) (corresponding to the AdaGrad rate in the deterministic phase) and will attain an \(O(1/\zeta ^2 + ((n + 1/\epsilon ) \log (\zeta /\epsilon ))\) convergence thereafter (corresponding to the AdaSVRG rate). This implies that Algorithm 4 can indeed obtain the best of both worlds between AdaGrad and AdaSVRG.

6.1 Evaluating Algorithm 4

We use synthetic experiments to demonstrate the effect of interpolation on the convergence of stochastic and VR methods. Following the protocol in Meng et al. (2020), we generate a linearly separable dataset with \(n=10^4\) data points of dimension \(d=200\) and train a linear model with a convex loss. This setup ensures that interpolation is satisfied, but allows to eliminate other confounding factors such as non-convexity and other implementation details. In order to smoothly violate interpolation, we show results with a mislabel fraction of points in the grid [0, 0.1, 0.2].

We use AdaGrad as a representative (fully) stochastic method, and to eliminate possible confounding because of its step-size, we set it using the stochastic line-search procedure (Vaswani et al., 2020). We compare the performance of AdaGrad, SVRG, AdaSVRG and the hybrid AdaGrad-AdaSVRG (Algorithm 4) each with a budget of 50 epochs (passes over the data). For SVRG, as before, we choose the best step-size via a grid-search. For AdaSVRG, we use the fixed-size inner-loop variant and the step-size heuristic described earlier. In order to evaluate the quality of the “switching” metric in Algorithm 4, we compare against a hybrid method referred to as “Optimal Manual Switching” in the plots. This method runs a grid-search over switching points—after epoch \(\{1, 2, \ldots , 50\}\) and chooses the point that results in the minimum loss after 50 epochs.

In Fig. 3, we plot the results for the logistic loss using a batch-size of 64 (refer to Figs. 15a–c, 16a–d in Appendix “Additional experiments on adaptivity to over-parameterization” for other losses and batch-sizes). We observe that (i) when interpolation is exactly satisfied (no mislabeling), AdaGrad results in superior performance over SVRG and AdaSVRG, confirming the theory in Sect. 6. In this case, both the optimal manual switching and Algorithm 4 do not switch; (ii) when interpolation is not exactly satisfied (with \(10\%, 20\%\) mislabeling), the AdaGrad progress slows down to a stall in a neighbourhood of the solution, whereas both SVRG and AdaSVRG converge to the solution; (iii) in both cases, Algorithm 4 detects the slowdown in AdaGrad and switches to AdaSVRG, resulting in competitive performance with the optimal manual switching. For all three datasets, Algorithm 4 matches or out-performs the better of AdaGrad and AdaSVRG, showing that it can achieve the best-of-both-worlds.

Fig. 3
figure 3

Comparison of AdaGrad, AdaSVRG and Algorithm 4 (denoted “Hybrid” in the plots) with logistic loss and batch-size 64 on datasets with different fraction of mislabeled data-points. Interpolation is exactly satisfied for the left-most plot

7 Discussion

Although there have been numerous papers on VR methods in the past ten years, all of the provably convergent methods require knowledge of problem-dependent constants such as L. On the other hand, there has been substantial progress in designing adaptive gradient methods that have effectively replaced SGD for training ML models. Unfortunately, this progress has not been leveraged for developing better VR methods. Our work is the first to marry these lines of literature by designing AdaSVRG, that achieves a gradient complexity comparable to typical VR methods, but without needing to know the objective’s smoothness constant. Our results illustrate that it is possible to design principled techniques that can “painlessly” reduce the variance, achieving good theoretical and practical performance. We believe that our paper will help open up an exciting research direction. In the future, we aim to extend our theory to the strongly-convex setting.