Appendix A: Details About Deep Learning Experiments
In addition to the method described in Sect. 5.1, we provide in Table 1 a summary of each problem considered.
Table 1 Setting of the four different deep learning experiments In the DL experiments of Sect. 5, we display the training error and the test accuracy of each algorithm as a function of the number of stochastic gradient estimates computed. Due to their adaptive procedures, ADAM, RMSprop and Step-Tuned SGD have additional sub-routines in comparison to SGD. Thus, in Table 2 we additionally provide the wall-clock time per epoch of these methods relatively to SGD. Unlike the number of back-propagations performed, wall-clock time depends on many factors: the network and datasets considered, the computer used, and most importantly, the implementation. Regarding implementation, we would like to emphasize the fact that we used the versions of SGD, ADAM and RMSprop provided in PyTorch, which are fully optimized (and in particular parallelized). Table 2 indicates that Step-Tuned SGD is slower than other adaptive methods for large networks but this is due to our non-parallel implementation. Actually on small networks (where the benefits of parallel computing is small), we observe that running Step-Tuned SGD for one epoch is actually faster than for SGD. As a conclusion, the number of back-propagations is a more suitable metric for comparing the algorithms, and all methods considered require a single back-propagation per iteration.
Table 2 Relative wall-clock time per epoch compared to SGD Appendix B: Proof of the Theoretical Results
We state a lemma that we will use to prove Theorem 1.
Preliminary Lemma
The result is the following.
Lemma 1
( [1, Proposition 2]) Let \((u_k)_{k\in {\mathbb {N}}}\) and \((v_k)_{k\in {\mathbb {N}}}\) two non-negative real sequences. Assume that \(\sum _{k=0}^{+\infty } u_k v_k <+\infty \), and \(\sum _{k=0}^{+\infty } v_k =+\infty \). If there exists a constant \(C>0\) such that \(\forall k\in {\mathbb {N}}, \vert u_{k+1} - u_k \vert \le C v_k\), then \(u_k\xrightarrow [k\rightarrow +\infty ]{}0\).
Proof of the main theorem
We can now prove Theorem 1.
Proof of Theorem 1
We first clarify the random process induced by the draw of the mini-batches. Algorithm 2 takes a sequence of mini-batches as input. This sequence is represented by the random variables \(({\mathsf {B}}_k)_{k\in {\mathbb {N}}}\) as described in Sect. 3.2. Each of these random variables is independent of the others. In particular, for \(k\in {\mathbb {N}}_{>0}\), \({\mathsf {B}}_k\) is independent of the previous mini-batches \({\mathsf {B}}_0,\ldots , {\mathsf {B}}_{k-1}\). For convenience, we will denote \(\underline{{\mathsf {B}}}_k = \left\{ {\mathsf {B}}_0,\ldots ,{\mathsf {B}}_k\right\} \), the mini-batches up to iteration k. Due to the randomness of the mini-batches, the algorithm is a random process as well. As such, \(\theta _{k}\) is a random variable with a deterministic dependence on \(\underline{{\mathsf {B}}}_{k-1}\) and is independent of \({\mathsf {B}}_k\). However, \(\theta _{k+\frac{1}{2}}\) and \({\mathsf {B}}_{k}\) are not independent. Similarly, we constructed \(\gamma _k\) such that it is a random variable with a deterministic dependence on \(\underline{{\mathsf {B}}}_{k-1}\), which is independent of \({\mathsf {B}}_k\). This dependency structure will be crucial to derive and bound conditional expectations. Finally, we highlight the following important identity, for any \(k\in {\mathbb {N}}_{>0}\),
Indeed, the iterate \(\theta _{k}\) is a deterministic function of \(\underline{{\mathsf {B}}}_{k-1}\), so taking the expectation over \({\mathsf {B}}_k\), which is independent of \(\underline{{\mathsf {B}}}_{k-1}\), we recover the full gradient of \({\mathcal {J}}\) as the distribution of \({\mathsf {B}}_k\) is the same as that of \({\mathsf {S}}\) in Sect. 3.2. Notice in addition that a similar identity does not hold for \(\theta _{k+\frac{1}{2}}\) (as it depends on \({\mathsf {B}}_k\)).
We now provide estimates that will be used extensively in the rest of the proof. The gradient of the loss function \(\nabla {\mathcal {J}}\) is locally Lipschitz continuous as \({\mathcal {J}}\) is twice continuously differentiable. By assumption, there exists a compact convex set \({\mathsf {C}}\subset {\mathbb {R}}^P\), such that with probability 1, the sequence of iterates \((\theta _{k})_{k\in \frac{1}{2}{\mathbb {N}}}\) belongs to \({\mathsf {C}}\). Therefore, by local Lipschitz continuity, the restriction of \(\nabla {\mathcal {J}}\) to \({\mathsf {C}}\) is Lipschitz continuous on \({\mathsf {C}}\). Similarly, each \(\nabla {\mathcal {J}}_n\) is also Lipschitz continuous on \({\mathsf {C}}\). We denote by \(L>0\) a Lipschitz constant common to each \(\nabla {\mathcal {J}}_n\), \(n=1,\ldots , N\). Notice that the Lipschitz continuity is preserved by averaging, in other words,
$$\begin{aligned} \forall {\mathsf {B}}\subseteq \left\{ 1,\ldots ,N\right\} ,\forall \psi _1,\psi _2\in {\mathsf {C}}, \quad \Vert \nabla {\mathcal {J}}_{\mathsf {B}}(\psi _1) -\nabla {\mathcal {J}}_{\mathsf {B}}(\psi _2) \Vert \le L\Vert \psi _1-\psi _2\Vert . \end{aligned}$$
(26)
In addition, using the continuity of the \(\nabla {\mathcal {J}}_n\)’s, there exists a constant \(C_2>0\), such that,
$$\begin{aligned} \forall {\mathsf {B}}\subseteq \left\{ 1,\ldots ,N\right\} ,\forall \psi \in {\mathsf {C}}, \quad \Vert \nabla {\mathcal {J}}_{\mathsf {B}}(\psi )\Vert \le C_2. \end{aligned}$$
(27)
Finally, for a function \(g:{\mathbb {R}}^P\rightarrow {\mathbb {R}}\) with L-Lipschitz continuous gradient, we recall the following inequality called descent lemma (see for example [7, Proposition A.24]). For any \(\theta \in {\mathbb {R}}^P\) and any \(d\in {\mathbb {R}}^P\),
$$\begin{aligned} g(\theta +d) \le g(\theta ) + \langle \nabla g(\theta ), d\rangle + \frac{L}{2}\Vert d \Vert ^2. \end{aligned}$$
(28)
In our case since we only have the L-Lipschitz continuity of \(\nabla {\mathcal {J}}\) on \({\mathsf {C}}\) which is convex, we have a similar bound for \(\nabla {\mathcal {J}}\) on \({\mathsf {C}}\): for any \(\theta \in {\mathsf {C}}\) and any \(d\in {\mathbb {R}}^P\) such that \(\theta +d\in {\mathsf {C}}\),
$$\begin{aligned} {\mathcal {J}}(\theta +d) \le {\mathcal {J}}(\theta ) + \langle \nabla {\mathcal {J}}(\theta ), d\rangle + \frac{L}{2}\Vert d \Vert ^2. \end{aligned}$$
(29)
Let \(\theta _0\in {\mathbb {R}}^P\) and let \((\theta _{k})_{k\in \frac{1}{2}{\mathbb {N}}}\) a sequence generated by Algorithm 2 initialized at \(\theta _0\). By assumption this sequence belongs to \({\mathsf {C}}\) almost surely. To simplify, for \(k\in {\mathbb {N}}\), we denote \(\eta _k = \alpha \gamma _k (k+1)^{-(1/2+\delta )}\). Fix an iteration \(k\in {\mathbb {N}}\), we can use (29) with \(\theta = \theta _k\) and \(d = -\eta _k\nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _k)\), almost surely (with respect to the boundedness assumption),
$$\begin{aligned} {\mathcal {J}}(\theta _{k+\frac{1}{2}}) \le {\mathcal {J}}(\theta _k) - \eta _k \langle \nabla {\mathcal {J}}(\theta _k), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _k) \rangle + \frac{\eta _k^2}{2}L \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _k)\Vert ^2. \end{aligned}$$
(30)
Similarly with \(\theta = \theta _{k+\frac{1}{2}}\) and \(d = -\eta _k\nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _{k+\frac{1}{2}})\), almost surely,
$$\begin{aligned} {\mathcal {J}}(\theta _{k+1}) \le {\mathcal {J}}(\theta _{k+\frac{1}{2}}) - \eta _k \langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k+\frac{1}{2}}) \rangle + \frac{\eta _k^2}{2}L \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _{k+\frac{1}{2}})\Vert ^2. \end{aligned}$$
(31)
We combine (30) and (31), almost surely,
$$\begin{aligned} \begin{aligned}&{\mathcal {J}}(\theta _{k+1}) \le {\mathcal {J}}(\theta _{k}) - \eta _k \left( \langle \nabla {\mathcal {J}}(\theta _k), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _k)\rangle + \langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k+\frac{1}{2}}) \rangle \right) \\&\quad + \frac{\eta _k^2}{2}L \left( \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _k)\Vert ^2+ \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _{k+\frac{1}{2}})\Vert ^2\right) . \end{aligned} \end{aligned}$$
(32)
Using the boundedness assumption and (27), almost surely,
$$\begin{aligned} \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _k)\Vert ^2 \le C_2 \quad \text {and}\quad \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _{k+\frac{1}{2}})\Vert ^2 \le C_2. \end{aligned}$$
(33)
So almost surely,
$$\begin{aligned} \begin{aligned} {\mathcal {J}}(\theta _{k+1}) \le {\mathcal {J}}(\theta _{k})&- \eta _k \left( \langle \nabla {\mathcal {J}}(\theta _k), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _k)\rangle + \langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k+\frac{1}{2}}) \rangle \right) \\ {}&+ \eta _k^2L C_2. \end{aligned} \end{aligned}$$
(34)
Then, we take the conditional expectation of (34) over \({\mathsf {B}}_k\) conditionally on \(\underline{{\mathsf {B}}}_{k-1}\) (the mini-batches used up to iteration \(k-1\)), we have,
As explained at the beginning of the proof, \(\theta _{k}\) is a deterministic function of \(\underline{{\mathsf {B}}}_{k-1}\), thus,
. Similarly, by construction
\(\eta _k\) is independent of the current mini-batch
\({\mathsf {B}}_k\), it is a deterministic function of \(\underline{{\mathsf {B}}}_{k-1}\). Hence, (35) reads,
Then, we use the fact that
. Overall, we obtain,
We will now bound the last term of (37). First we write,
$$\begin{aligned} \begin{aligned}&-\langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k+\frac{1}{2}}) \rangle \\&\quad =-\langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k+\frac{1}{2}})- \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k})\rangle - \langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \rangle . \end{aligned} \end{aligned}$$
(38)
Using the Cauchy-Schwarz inequality, as well as (26) and (27), almost surely,
$$\begin{aligned} \begin{aligned} |\langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k+\frac{1}{2}})- \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k})\rangle |&\le \Vert \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}) \Vert \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k+\frac{1}{2}})- \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k})\Vert \\&\le \Vert \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}) \Vert L\Vert \theta _{k+\frac{1}{2}}-\theta _{k}\Vert \\&\le \Vert \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}) \Vert L\Vert -\eta _k\nabla {\mathcal {J}}_{{\mathsf {B}}_k}(\theta _{k})\Vert \\&\le LC_2^2\eta _k. \end{aligned} \end{aligned}$$
(39)
Hence,
$$\begin{aligned} \begin{aligned}&-\langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k+\frac{1}{2}}) \rangle \le LC_2^2\eta _k - \langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \rangle . \end{aligned} \end{aligned}$$
(40)
We perform similar computations on the last term of (40), almost surely
$$\begin{aligned} \begin{aligned}&-\langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \rangle \\&= -\langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}})-\nabla {\mathcal {J}}(\theta _{k}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \rangle - \langle \nabla {\mathcal {J}}(\theta _{k}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \rangle \\&\le \Vert \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}})-\nabla {\mathcal {J}}(\theta _{k})\Vert \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \Vert - \langle \nabla {\mathcal {J}}(\theta _{k}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \rangle \\&\le LC_2\Vert \theta _{k+\frac{1}{2}}-\theta _{k}\Vert - \langle \nabla {\mathcal {J}}(\theta _{k}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \rangle \\&\le LC_2^2\eta _k - \langle \nabla {\mathcal {J}}(\theta _{k}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \rangle . \end{aligned} \end{aligned}$$
(41)
Finally we obtain by combining (38), (40) and (41), almost surely,
$$\begin{aligned} \begin{aligned}&-\langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k+\frac{1}{2}}) \rangle \le 2LC_2^2\eta _k - \langle \nabla {\mathcal {J}}(\theta _{k}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \rangle . \end{aligned} \end{aligned}$$
(42)
Going back to the last term of (37), we have, taking the conditional expectation of (42), almost surely
In the end we obtain, for an arbitrary iteration \(k\in {\mathbb {N}}\), almost surely
To simplify we assume that \({{\tilde{M}}}\ge \nu \) (otherwise set \(\tilde{M} = \max ({{\tilde{M}}},\nu )\)). We use the fact that, \(\eta _k\in [\frac{\alpha \tilde{m}}{(k+1)^{1/2+\delta }},\frac{\alpha {{\tilde{M}}}}{(k+1)^{1/2+\delta }}]\), to obtain almost surely,
Since by assumption, the last term is summable, we can now invoke Robbins-Siegmund convergence theorem [37] to obtain that, almost surely, \(({\mathcal {J}}(\theta _{k}))_{k\in {\mathbb {N}}}\) converges and,
$$\begin{aligned} \sum _{k=0}^{+\infty }\frac{1}{(k+1)^{1/2+\delta }}\Vert \nabla {\mathcal {J}}(\theta _k) \Vert ^2 < + \infty . \end{aligned}$$
(46)
Since \(\sum _{k=0}^{+\infty }\frac{1}{(k+1)^{1/2+\delta }}=+\infty \), this implies at least that almost surely,
$$\begin{aligned} \liminf _{k\rightarrow \infty }\Vert \nabla {\mathcal {J}}(\theta _k) \Vert ^2=0. \end{aligned}$$
(47)
To prove that in addition \(\displaystyle \lim _{k\rightarrow \infty }\Vert \nabla {\mathcal {J}}(\theta _k)\Vert ^2 = 0\), we will use Lemma 1 with \(u_k = \Vert \nabla {\mathcal {J}}(\theta _{k})\Vert ^2\) and \(v_k = \frac{1}{(k+1)^{1/2+\delta }}\), for all \(k\in {\mathbb {N}}\). So we need to prove that there exists \(C_3>0\) such that \(\vert u_{k+1} - u_k\vert \le C_3 v_k\). To do so, we use the L-Lipschitz continuity of the gradients on \({\mathsf {C}}\), triangle inequalities and (27). It holds, almost surely, for all \(k \in {\mathbb {N}}\)
$$\begin{aligned}&\left| \Vert \nabla {\mathcal {J}}(\theta _{k+1})\Vert ^2-\Vert \nabla {\mathcal {J}}(\theta _{k})\Vert ^2\right| \nonumber \\&\quad = \;\left( \;\Vert \nabla {\mathcal {J}}(\theta _{k+1})\Vert + \Vert \nabla {\mathcal {J}}(\theta _{k})\Vert \;\right) \;\times \; \left| \;\Vert \;\nabla {\mathcal {J}}(\theta _{k+1})\Vert - \Vert \nabla {\mathcal {J}}(\theta _{k})\;\Vert \;\right| \nonumber \\&\quad \le 2C_2 \left| \Vert \nabla {\mathcal {J}}(\theta _{k+1})\Vert - \Vert \nabla {\mathcal {J}}(\theta _{k})\Vert \right| \nonumber \\&\quad \le 2C_2 \Vert \nabla {\mathcal {J}}(\theta _{k+1})-\nabla {\mathcal {J}}(\theta _{k})\Vert \nonumber \\&\quad \le 2C_2 L \Vert \theta _{k+1}-\theta _{k}\Vert \\&\quad \le 2C_2 L \left\| -\eta _k\nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _k) -\eta _k\nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _{k+\frac{1}{2}})\right\| \nonumber \\&\quad \le 2C_2 L\frac{\alpha {\tilde{M}}}{(k+1)^{1/2+\delta }} \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _k)+\nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _{k+\frac{1}{2}})\Vert \nonumber \\&\quad \le 4C_2^2 L\frac{\alpha {\tilde{M}}}{(k+1)^{1/2+\delta }}.\nonumber \end{aligned}$$
(48)
So taking \(C_3 =4C_2^2 L\alpha {\tilde{M}} \), by Lemma 1, almost surely, \(\lim _{k\rightarrow +\infty } \Vert \nabla {\mathcal {J}}(\theta _{k})\Vert ^2=0\). This concludes the almost sure convergence proof.
As for the rate, consider the expectation of (45) (with respect to the random variables \(({\mathsf {B}}_k)_{k\in {\mathbb {N}}}\)). The tower property of the conditional expectation gives \( {\mathbb {E}}[{\mathbb {E}}[{\mathcal {J}}(\theta _{k+1})|\underline{{\mathsf {B}}}_{k-1}]]={\mathbb {E}}\left[ {\mathcal {J}}(\theta _{k+1})\right] \), so we obtain, for all \(k\in {\mathbb {N}}\),
$$\begin{aligned} \begin{aligned} 2\frac{\alpha {\tilde{m}}}{(k+1)^{1/2+\delta }}{\mathbb {E}}\left[ \Vert \nabla {\mathcal {J}}(\theta _k)\Vert ^2\right] \le&{\mathbb {E}}\left[ {\mathcal {J}}(\theta _k)\right] - {\mathbb {E}}\left[ {\mathcal {J}}(\theta _{k+1})\right] + \frac{\alpha ^2{\tilde{M}}^2}{(k+1)^{1+2\delta }}L(C_2+2C_2^2). \end{aligned} \end{aligned}$$
(49)
Then for \(K\ge 1\), we sum from 0 to \(K-1\),
$$\begin{aligned} \begin{aligned} \sum _{k=0}^{K-1}2\frac{\alpha {\tilde{m}}}{(k+1)^{1/2+\delta }}&{\mathbb {E}}\left[ \Vert \nabla {\mathcal {J}}(\theta _k)\Vert ^2\right] \\&\le \sum _{k=0}^{K-1}{\mathbb {E}}\left[ {\mathcal {J}}(\theta _k)\right] -\sum _{k=0}^{K-1} {\mathbb {E}}\left[ {\mathcal {J}}(\theta _{k+1})\right] + \sum _{k=0}^{K-1} \frac{\alpha ^2{\tilde{M}}^2}{(k+1)^{1+2\delta }}L(C_2+2C_2^2)\\&={\mathcal {J}}(\theta _0) - {\mathbb {E}}\left[ {\mathcal {J}}(\theta _{K})\right] + \sum _{k=0}^{K-1}\frac{\alpha ^2{\tilde{M}}^2}{(k+1)^{1+2\delta }}L(C_2+2C_2^2)\\&\le {\mathcal {J}}(\theta _0) - \inf _{\psi \in {\mathbb {R}}^P}{\mathcal {J}}(\psi ) + \sum _{k=0}^{K-1}\frac{\alpha ^2{\tilde{M}}^2}{(k+1)^{1+2\delta }}L(C_2+2C_2^2),\ \end{aligned} \end{aligned}$$
(50)
The right-hand side is finite, so there is a constant \(C_4>0\) such that for any \(K\in {\mathbb {N}}\), it holds,
$$\begin{aligned} C_4\ge \sum _{k=0}^K \frac{1}{(k+1)^{1/2+\delta }} {\mathbb {E}}\left[ \Vert \nabla {\mathcal {J}}(\theta _{k})\Vert ^2\right]&\ge \min _{k\in \left\{ 1,\ldots ,K\right\} }{\mathbb {E}}\left[ \Vert \nabla {\mathcal {J}}(\theta _{k})\Vert ^2\right] \sum _{k=0}^K \frac{1}{(k+1)^{1/2+\delta }} \nonumber \\&\ge \left( K+1\right) ^{1/2-\delta }\min _{k\in \left\{ 1,\ldots ,K\right\} } {\mathbb {E}}\left[ \Vert \nabla {\mathcal {J}}(\theta _{k})\Vert ^2\right] , \end{aligned}$$
(51)
and we obtain the rate. \(\square \)
C Proof of the Corollary
Before proving the corollary we recall the following result.
Lemma 2
Let \(g:{\mathbb {R}}^P\rightarrow {\mathbb {R}}\) a L-Lipschitz continuous and differentiable function. Then \(\nabla g\) is uniformly bounded on \({\mathbb {R}}^P\).
We can now prove the corollary.
Proof of Corollary 1
The proof is very similar to the one of Theorem 1. Denote L the Lipschitz constant of \(\nabla {\mathcal {J}}\). Then, the descent lemma (30) holds surely. Furthermore, since for all \(n\in \{1,\ldots ,N\}\), each \({\mathcal {J}}_n\) is Lipschitz, so is \({\mathcal {J}}\), and globally Lipschitz functions have uniformly bounded gradients so \(\nabla {\mathcal {J}}\) has bounded gradient. This is enough to obtain (45). Similarly, at iteration \(k\in {\mathbb {N}}\), \({\mathbb {E}}\left[ \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _k)\Vert \right] \) is also uniformly bounded. Overall these arguments allows to follow the lines of the proof of Theorem 1 and the same conclusions follow by repeating the same arguments. \(\square \)
Appendix C: Details on the Synthetic Experiments
We detail the non-convex regression problem that we presented in Figs. 2 and 3. Given a matrix \(A\in {\mathbb {R}}^{N \times P}\) and a vector \(b\in {\mathbb {R}}^N\), denote \(A_n\) the n-th line of A. The problem consists in minimizing a loss function of the form,
$$\begin{aligned} \theta \in {\mathbb {R}}^P\mapsto {\mathcal {J}}(\theta ) = \frac{1}{N}\sum _{n}^{N} \phi (A_n^T\theta -b_n), \end{aligned}$$
(52)
where the non-convexity comes from the function \(t\in {\mathbb {R}}\mapsto \phi (t) = t^2/(1+t^2)\). For more details on the initialization of A and b we refer to [10] where this problem is initially proposed. In the experiments of Fig. 3, the mini-batch approximation was made by selecting a subset of the lines of A, which amounts to computing only a few terms of the full sum in (52). We used \(N=500\), \(P=30\) and mini-batches of size 50.
In the deterministic setting we ran each algorithm during 250 iterations and selected the hyper-parameters of each algorithm such that they achieved \(\vert {\mathcal {J}}(\theta )-{\mathcal {J}}^\star \vert <10^{-1}\) as fast as possible. In the mini-batch experiments we ran each algorithm during 250 epochs and selected the hyper-parameters that yielded the smallest value of \({\mathcal {J}}(\theta )\) after 50 epochs.
Appendix D: Description of Auxiliary Algorithms
We precise the heuristic algorithms used in Fig. 3 and discussed in Sect. 3.3. Note that the step-size in Algorithm 5 is equivalent to Expected-GV but is written differently to avoid storing an additional gradient estimate.