Appendix
1.1 A SAGA variance bound
We define the SAGA gradient estimators \(\widetilde{\nabla }_x(u_k,y_k)\) and \(\widetilde{\nabla }_y(x_{k+1},v_k)\) as follows:
$$\begin{aligned} \widetilde{\nabla }_x(u_k,y_k)=&\frac{1}{b}\sum _{i\in I_k^x}\left( \nabla _xH_i(u_k,y_k)- \nabla _xH_i(\varphi _{k}^{i},y_{k}) \right) + \frac{1}{n}\sum _{j=1}^n\nabla _xH_j(\varphi _{k}^{j},y_{k}), \\ \widetilde{\nabla }_y(x_{k+1},v_k)=&\frac{1}{b}\sum _{i\in I_k^y}\left( \nabla _yH_i(x_{k+1},v_k)\!-\! \nabla _yH_i(x_{k+1},\xi _{k}^{i}) \right) \!+\! \frac{1}{n}\sum _{j=1}^n\nabla _yH_j(x_{k+1},\xi _{k}^{j}),\nonumber \end{aligned}$$
(A.1)
where \(I_k^x\) and \(I_k^y\) are mini-batches containing b indices. The variables \(\varphi _{k}^{i}\) and \(\xi _{k}^{i}\) follow the update rules \(\varphi _{k+1}^{i}=u_k\) if \(i\in I_k^x\) and \(\varphi _{k+1}^{i}=\varphi _{k}^{i}\) otherwise, and \(\xi _{k+1}^{i}=v_k\) if \(i\in I_k^y\) and \(\xi _{k+1}^{i}=\xi _{k}^{i}\) otherwise.
To prove our variance bounds, we require the following lemma.
Lemma A.1
Suppose \(X_1,\cdots ,X_t\) are independent random variables satisfying \(\mathbb {E}_{k}X_i\)\(=0\) for \(1\le i\le t\). Then
$$\begin{aligned} \mathbb {E}_{k}\left\| X_1+\cdots +X_t \right\| ^2=\mathbb {E}_{k}\left[ \left\| X_1 \right\| ^2 +\cdots +\left\| X_t \right\| ^2 \right] . \end{aligned}$$
(A.2)
Proof
Our hypotheses on these random variables imply \(\mathbb {E}_{k}\left\langle X_i,X_j \right\rangle =0\) for \(i\ne j\). Therefore,
$$\begin{aligned} \mathbb {E}_{k}\left\| X_1+\cdots +X_t \right\| ^2= \mathbb {E}_{k}\sum _{i,j=1}^{t}\left\langle X_i,X_j \right\rangle =\mathbb {E}_{k}\left[ \left\| X_1 \right\| ^2 +\cdots +\left\| X_t \right\| ^2 \right] . \end{aligned}$$
\(\square \)
We are now prepared to prove that the SAGA gradient estimator is variance-reduced.
Lemma A.2
The SAGA gradient estimator satisfies
$$\begin{aligned}{} & {} \mathbb {E}_{k}\left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| ^2\le \frac{1}{bn}\sum _{j=1}^n\left\| \nabla _xH_j(u_k,y_k)- \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2,\nonumber \\{} & {} \mathbb {E}_{k}\left\| \widetilde{\nabla }_y(x_{k+1},v_k)\!-\!\nabla _yH(x_{k+1},v_k) \right\| ^2\!\le \!\frac{4}{bn}\sum _{j=1}^n\left\| \nabla _yH_j(x_k,v_k)\!-\! \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| ^2\nonumber \\{} & {} +\frac{16N^2\gamma ^2}{b}\left( \mathbb {E}_{k}\left\| z_{k+1}-z_{k}\right\| ^2+\left\| z_{k}-z_{k-1}\right\| ^2+\left\| z_{k-1}-z_{k-2}\right\| ^2\right) , \end{aligned}$$
(A.3)
as well as
$$\begin{aligned}{} & {} \mathbb {E}_{k}\left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| \le \frac{1}{\sqrt{bn}}\sum _{j=1}^n\left\| \nabla _xH_j(u_k,y_k)- \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ,\nonumber \\{} & {} \mathbb {E}_{k}\left\| \widetilde{\nabla }_y(x_{k+1},v_k)\!-\!\nabla _yH(x_{k+1},v_k) \right\| \!\le \!\frac{2}{\sqrt{bn}}\sum _{j=1}^n\left\| \nabla _yH_j(x_k,v_k)\!-\! \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| \nonumber \\{} & {} +\frac{4N\gamma }{\sqrt{b}}\left( \mathbb {E}_{k}\left\| z_{k+1}-z_{k}\right\| +\left\| z_{k}-z_{k-1}\right\| +\left\| z_{k-1}-z_{k-2}\right\| \right) , \end{aligned}$$
(A.4)
where \(N=\max \left\{ M,L \right\} \), \(\gamma =\max \left\{ \gamma _1,\gamma _2 \right\} \).
Proof
According to (A.1), we have
$$\begin{aligned}&\mathbb {E}_{k}\left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| ^2\\ =&\mathbb {E}_{k}\left\| \frac{1}{b} \sum _{i\in I_k^x}\left( \nabla _xH_i(u_k,y_k)\!-\! \nabla _xH_i(\varphi _{k}^{i},y_{k}) \right) \!-\!\nabla _xH(u_k,y_k)\!+\! \frac{1}{n}\sum _{j=1}^n\nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2\nonumber \\ \overset{(1)}{\le }\&\frac{1}{b^2}\mathbb {E}_{k}\sum _{i\in I_k^x}\left\| \nabla _xH_i(u_k,y_k)- \nabla _xH_i(\varphi _{k}^{i},y_{k}) \right\| ^2\nonumber \\ =&\frac{1}{bn}\sum _{j=1}^n\left\| \nabla _xH_j(u_k,y_k)- \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2.\nonumber \end{aligned}$$
(A.5)
Inequality (1) follows from Lemma A.1. By the Jensen’s inequality, we can say that
$$\begin{aligned} \mathbb {E}_{k}\left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| \le&\sqrt{\mathbb {E}_{k}\left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| ^2}\\ \le&\frac{1}{\sqrt{bn}}\sqrt{\sum _{j=1}^n\left\| \nabla _xH_j(u_k,y_k)- \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2}\nonumber \\ \le&\frac{1}{\sqrt{bn}}\sum _{j=1}^n\left\| \nabla _xH_j(u_k,y_k)- \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| .\nonumber \end{aligned}$$
(A.6)
We use an analogous argument for \(\widetilde{\nabla }_y(x_{k+1},v_k)\). Let \(\mathbb {E}_{k,x}\) denote the expectation conditional on the first k iterations and \(I_k^x\). By the same reasoning as in (A.5), applying the Lipschitz continuity of \(\nabla _yH_j\), we obtain that
$$\begin{aligned}&\mathbb {E}_{k,x}\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| ^2\nonumber \\ \le&\frac{1}{bn}\sum _{j=1}^n\left\| \nabla _yH_j(x_{k+1},v_k)- \nabla _yH_j(x_{k+1},\xi _{k}^{j}) \right\| ^2\nonumber \\ \le&\frac{4}{bn}\sum _{j=1}^n\left\| \nabla _yH_j(x_{k+1},v_k)\!-\! \nabla _yH_j(x_{k},y_{k}) \right\| ^2\!+\!\frac{4}{bn}\sum _{j=1}^n\left\| \nabla _yH_j(x_{k},y_k)\!-\! \nabla _yH_j(x_{k},v_{k}) \right\| ^2\nonumber \\&+\!\frac{4}{bn}\sum _{j=1}^n\left\| \nabla _yH_j(x_{k},v_k)\!-\! \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| ^2\!+\!\frac{4}{bn}\sum _{j=1}^n\left\| \nabla _yH_j(x_{k},\xi _{k}^{j})\!-\! \nabla _yH_j(x_{k+1},\xi _{k}^{j}) \right\| ^2\nonumber \\ \le&\frac{4M^2}{b}\left\| x_{k+1}-x_{k}\right\| ^2+\frac{4M^2}{b}\left\| v_{k}-y_{k}\right\| ^2+\frac{4L^2}{b}\left\| y_{k}-v_{k}\right\| ^2\nonumber \\&+\frac{4}{bn}\sum _{j=1}^n\left\| \nabla _yH_j(x_{k},v_k)- \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| ^2+\frac{4M^2}{b}\left\| x_{k+1}-x_{k}\right\| ^2\nonumber \\ \le&\frac{4}{bn}\sum _{j=1}^n\left\| \nabla _yH_j(x_{k},v_k)- \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| ^2+\frac{8M^2}{b}\left\| x_{k+1}-x_{k}\right\| ^2\nonumber \\&+\frac{4(M^2+L^2)}{b}\left( 2\gamma _1^2\left\| y_{k}-y_{k-1}\right\| ^2+2\gamma _2^2\left\| y_{k-1}-y_{k-2}\right\| ^2\right) \nonumber \\ \le&\frac{4}{bn}\sum _{j=1}^n\left\| \nabla _yH_j(x_{k},v_k)\!-\! \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| ^2\!+\!\frac{16N^2\gamma ^2}{b}\left( \left\| z_{k+1}\!-\!z_{k}\right\| ^2\!+\!\left\| z_{k}\!-\!z_{k-1}\right\| ^2\right. \nonumber \\&\left. +\left\| z_{k-1}-z_{k-2}\right\| ^2\right) , \end{aligned}$$
(A.7)
where \(N=\max \left\{ M,L \right\} \), \(\gamma =\max \left\{ \gamma _1,\gamma _2 \right\} \). Also, by the same reasoning as in (A.6),
$$\begin{aligned}&\mathbb {E}_{k,x}\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| \\ \le&\sqrt{\mathbb {E}_{k,x}\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| ^2}\nonumber \\ \le&\frac{2}{\sqrt{bn}}\sum _{j=1}^n\left\| \nabla _yH_j(x_{k},v_k)- \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| +\frac{4N\gamma }{\sqrt{b}}\left( \left\| z_{k+1}-z_{k}\right\| +\left\| z_{k}-z_{k-1}\right\| \right. \nonumber \\&\left. +\left\| z_{k-1}-z_{k-2}\right\| \right) ,\nonumber \end{aligned}$$
(A.8)
Applying the operator \(\mathbb {E}_{k}\) to (A.7) and (A.8), we get the desired result. \(\square \)
Now, define
$$\begin{aligned} \Upsilon _{k+1}=&\frac{1}{bn}\sum _{j=1}^n \left( \left\| \nabla _xH_j(u_{k+1},y_{k+1})- \nabla _xH_j(\varphi _{k+1}^{j},y_{k+1}) \right\| ^2 \right. \\&\left. +4\left\| \nabla _yH_j(x_{k+1},v_{k+1})- \nabla _yH_j(x_{k+1},\xi _{k+1}^{j}) \right\| ^2 \right) ,\nonumber \\ \Gamma _{k+1}=&\frac{1}{\sqrt{bn}}\sum _{j=1}^n \left( \left\| \nabla _xH_j(u_{k+1},y_{k+1})- \nabla _xH_j(\varphi _{k+1}^{j},y_{k+1}) \right\| ^2 \right. \nonumber \\&\left. +2\left\| \nabla _yH_j(x_{k+1},v_{k+1})- \nabla _yH_j(x_{k+1},\xi _{k+1}^{j}) \right\| ^2 \right) .\nonumber \end{aligned}$$
(A.9)
By Lemma A.2, we have
$$\begin{aligned}&\mathbb {E}_k\left[ \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| ^2+\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| ^2\right] \\ \le&\Upsilon _k+V_1\left( \mathbb {E}_k\left\| z_{k+1}-z_{k} \right\| ^{2}+\left\| z_{k}-z_{k-1} \right\| ^{2} +\left\| z_{k-1}-z_{k-2} \right\| ^{2}\right) , \end{aligned}$$
and
$$\begin{aligned}&\mathbb {E}_k\left[ \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| +\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| \right] \\ \le&\Gamma _k+V_2\left( \mathbb {E}_k\left\| z_{k+1}-z_{k} \right\| +\left\| z_{k}-z_{k-1} \right\| +\left\| z_{k-1}-z_{k-2} \right\| \right) . \end{aligned}$$
This is exactly the MSE bound, where \(V_{1}=\frac{16N^2\gamma ^2}{b}\) and \(V_{2}=\frac{4N\gamma }{\sqrt{b}}\).
Lemma A.3
(Geometric decay) Let \(\Upsilon _{k}\) be defined as in (A.9), then we can establish the geometric decay property:
$$\begin{aligned} \mathbb {E}_{k}\Upsilon _{k+1}\le \left( 1-\rho \right) \Upsilon _{k}+V_{\Upsilon }\left( \mathbb {E}_k\left\| z_{k+1}-z_{k} \right\| ^{2}+\left\| z_{k}-z_{k-1} \right\| ^{2} +\left\| z_{k-1}-z_{k-2} \right\| ^{2}\right) , \end{aligned}$$
(A.10)
where \(\rho =\frac{b}{2n}\), \(V_{\Upsilon }=\frac{408nN^2(1+2\gamma _1^2+\gamma _2^2)}{b^2}\).
Proof
We show that \(\mathbb {E}_{k}\Upsilon _{k+1}\) is decreasing at a geometric rate. By applying the inequality \(\left\| a-c \right\| ^2\le (1+\varepsilon )\left\| a-b \right\| ^2+(1+\varepsilon ^{-1} )\left\| b-c \right\| ^2\) twice, it follows that
$$\begin{aligned}&\frac{1}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _xH_j(u_{k+1},y_{k+1})- \nabla _xH_j(\varphi _{k+1}^{j},y_{k+1}) \right\| ^2\nonumber \\ \le&\frac{1+\varepsilon }{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _xH_j(u_{k},y_{k})- \nabla _xH_j(\varphi _{k+1}^{j},y_{k+1}) \right\| ^2+\frac{1+\varepsilon ^{-1}}{bn}\sum _{j=1}^n\mathbb {E}_{k} \left\| \nabla _xH_j(u_{k+1},y_{k+1})\right. \nonumber \\&\left. - \nabla _xH_j(u_{k},y_{k})\right\| ^2\nonumber \\ \le&\frac{(1+\varepsilon )^2 }{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _xH_j(u_{k},y_{k})- \nabla _xH_j(\varphi _{k+1}^{j},y_{k}) \right\| ^2\nonumber \\&+\frac{(1+\varepsilon )(1+\varepsilon ^{-1} ) }{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _xH_j(\varphi _{k+1}^{j},y_{k})- \nabla _xH_j(\varphi _{k+1}^{j},y_{k+1}) \right\| ^2\nonumber \\&+\frac{1+\varepsilon ^{-1}}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _xH_j(u_{k+1},y_{k+1})- \nabla _xH_j(u_{k},y_{k})\right\| ^2\nonumber \\ \le&\frac{(1\!+\varepsilon )^2 (1-\!b/n)}{bn}\sum _{j=1}^n \left\| \nabla _xH_j(u_{k},y_{k})\!-\! \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2\!+\!\frac{(1+\varepsilon )(1+\varepsilon ^{-1} )M^2 }{b}\mathbb {E}_{k}\left\| y_{k}\!-\! y_{k+1} \right\| ^2\nonumber \\&+\frac{(1+\varepsilon ^{-1})M^2}{b}\mathbb {E}_{k}\left( \left\| u_{k+1}-u_{k}\right\| ^2+\left\| y_{k+1}-y_{k}\right\| ^2\right) \nonumber \\ \le&\frac{(1\!+\varepsilon )^2 (1-b/n)}{bn}\sum _{j=1}^n \left\| \nabla _xH_j(u_{k},y_{k})\!-\! \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2\!+\!\frac{(2+\varepsilon )(1+\varepsilon ^{-1} )M^2 }{b}\mathbb {E}_{k}\left\| y_{k+1}\!- y_{k} \right\| ^2\nonumber \\&+\frac{(1+\varepsilon ^{-1})M^2}{b}\mathbb {E}_{k}\left( 3\left\| u_{k+1}-x_{k+1}\right\| ^2+3\left\| x_{k+1}-x_{k}\right\| ^2+3\left\| x_{k}-u_{k}\right\| ^2\right) \nonumber \\ \le&\frac{(1+\!\varepsilon )^2 (1-b/n)}{bn}\sum _{j=1}^n \left\| \nabla _xH_j(u_{k},y_{k})\!-\! \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2\!+\!\frac{(2+\varepsilon )(1+\varepsilon ^{-1} )M^2 }{b}\mathbb {E}_{k}\left\| y_{k+1}\!-\! y_{k} \right\| ^2\nonumber \\&+\frac{3M^2(1+\varepsilon ^{-1})(1+2\gamma _{1}^2)}{b}\mathbb {E}_{k}\left\| x_{k+1}-x_{k}\right\| ^2+\frac{6M^2(1+\varepsilon ^{-1})(\gamma _{1}^2+\gamma _{2}^2)}{b}\left\| x_{k}-x_{k-1}\right\| ^2\nonumber \\&+\frac{6M^2(1+\varepsilon ^{-1})\gamma _{2}^2}{b}\left\| x_{k-1}-x_{k-2}\right\| ^2. \end{aligned}$$
(A.11)
Similarly,
$$\begin{aligned}&\frac{1}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _yH_j(x_{k+1},v_{k+1})- \nabla _yH_j(x_{k+1},\xi _{k+1}^{j}) \right\| ^2\nonumber \\ \le&\frac{1+\varepsilon }{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _yH_j(x_{k+1},v_{k})- \nabla _yH_j(x_{k+1},\xi _{k+1}^{j}) \right\| ^2\nonumber \\&+\frac{1+\varepsilon ^{-1}}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _yH_j(x_{k+1},v_{k+1})- \nabla _yH_j(x_{k+1},v_{k})\right\| ^2\nonumber \\ \le&\frac{(1+\varepsilon )^2 (1-b/n)}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _yH_j(x_{k},v_{k})- \nabla _yH_j(x_{k+1},\xi _{k}^{j}) \right\| ^2\nonumber \\&+\frac{(1+\varepsilon )(1+\varepsilon ^{-1}) (1-b/n)}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _yH_j(x_{k+1},v_{k})- \nabla _yH_j(x_{k},v_{k}) \right\| ^2\nonumber \\&+\frac{1+\varepsilon ^{-1}}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _yH_j(x_{k+1},v_{k+1})- \nabla _yH_j(x_{k+1},v_{k})\right\| ^2\nonumber \\ \le&\frac{(1+\varepsilon )^3 (1-b/n)}{bn}\sum _{j=1}^n \left\| \nabla _yH_j(x_{k},v_{k})- \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| ^2\nonumber \\&+\frac{(1+\varepsilon )^2(1+\varepsilon ^{-1}) (1-b/n)}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _yH_j(x_{k},\xi _{k}^{j})- \nabla _yH_j(x_{k+1},\xi _{k}^{j}) \right\| ^2\nonumber \\&+\frac{(1+\varepsilon )(1+\varepsilon ^{-1}) (1-b/n)}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _yH_j(x_{k+1},v_{k})- \nabla _yH_j(x_{k},v_{k}) \right\| ^2\nonumber \\&+\frac{1+\varepsilon ^{-1}}{bn}\sum _{j=1}^n \mathbb {E}_{k}\left\| \nabla _yH_j(x_{k+1},v_{k+1})- \nabla _yH_j(x_{k+1},v_{k})\right\| ^2\nonumber \\ \le&\frac{(1\!+\!\varepsilon )^3 (1-b/n)}{bn}\sum _{j=1}^n \left\| \nabla _yH_j(x_{k},v_{k})\!-\! \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| ^2\!+\!\frac{(1\!+\!\varepsilon )^2(1\!+\!\varepsilon ^{-1}) (1\!-\!b/n)M^2}{b}\nonumber \\&\mathbb {E}_{k}\left\| x_{k+1}\!-\!x_{k} \right\| ^2\!+\!\frac{(1\!+\!\varepsilon )(1\!+\!\varepsilon ^{-1}) (1\!-\!b/n)M^2}{b}\mathbb {E}_{k}\left\| x_{k+1}\!-\!x_{k} \right\| ^2\!+\!\frac{(1\!+\!\varepsilon ^{-1})L^2}{b}\mathbb {E}_{k}\left\| v_{k+1}\!-\!v_{k} \right\| ^2\nonumber \\ \le&\frac{(1\!+\!\varepsilon )^3 (1\!-\!b/n)}{bn}\sum _{j=1}^n \left\| \nabla _yH_j(x_{k},v_{k})\!-\! \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| ^2\!+\!\frac{(2\!+\!\varepsilon )(1\!+\!\varepsilon )(1\!+\!\varepsilon ^{-1}) (1\!-\!b/n)M^2}{b}\nonumber \\&\mathbb {E}_{k}\left\| x_{k+1}-x_{k} \right\| ^2+\frac{(1+\varepsilon ^{-1})L^2}{b}\mathbb {E}_{k}\left( 3\left\| v_{k+1}-y_{k+1}\right\| ^2+3\left\| y_{k+1}-y_{k}\right\| ^2+3\left\| y_{k}-v_{k}\right\| ^2\right) \nonumber \\ \le&\frac{(1\!+\!\varepsilon )^3 (1\!-\!b/n)}{bn}\sum _{j=1}^n \left\| \nabla _yH_j(x_{k},v_{k})\!-\! \nabla _yH_j(x_{k},\xi _{k}^{j}) \right\| ^2\!+\!\frac{(2\!+\!\varepsilon )(1\!+\!\varepsilon )(1\!+\!\varepsilon ^{-1}) (1\!-\!b/n)M^2}{b}\nonumber \\&\mathbb {E}_{k}\left\| x_{k+1}-x_{k} \right\| ^2+\frac{3L^2(1+\varepsilon ^{-1})(1+2\gamma _{1}^2)}{b}\mathbb {E}_{k}\left\| y_{k+1}-y_{k}\right\| ^2+\frac{6L^2(1+\varepsilon ^{-1})(\gamma _{1}^2+\gamma _{2}^2)}{b}\nonumber \\&\left\| y_{k}-y_{k-1}\right\| ^2+\frac{6L^2(1+\varepsilon ^{-1})\gamma _{2}^2}{b}\left\| y_{k-1}-y_{k-2}\right\| ^2. \end{aligned}$$
(A.12)
With
$$\begin{aligned} \Upsilon _{k+1}=&\frac{1}{bn}\sum _{j=1}^n \left( \left\| \nabla _xH_j(u_{k+1},y_{k+1})- \nabla _xH_j(\varphi _{k+1}^{j},y_{k+1}) \right\| ^2 \right. \\&\left. +4\left\| \nabla _yH_j(x_{k+1},v_{k+1})- \nabla _yH_j(x_{k+1},\xi _{k+1}^{j}) \right\| ^2 \right) , \end{aligned}$$
adding (A.11) and (A.12), we can obtain
$$\begin{aligned}&\mathbb {E}_{k}\Upsilon _{k+1}\\ \le&\!(1\!+\!\varepsilon )^3 (\!1\!-\!b/n)\Upsilon _{k}\!+\!\frac{(2\!+\!\varepsilon )(1\!+\!\varepsilon ^{-1} )M^2 }{b}\mathbb {E}_{k}\left\| y_{k\!+1}\!-\! y_{k} \right\| ^2\!+\!\frac{3M^2(\!1\!+\!\varepsilon ^{-1})(\!1\!+\!2\gamma _{1}^2)}{b}\\&\mathbb {E}_{k}\left\| x_{k+1}-x_{k}\right\| ^2+\frac{6M^2(1+\varepsilon ^{-1})(\gamma _{1}^2+\gamma _{2}^2)}{b}\left\| x_{k}-x_{k-1}\right\| ^2+\frac{6M^2(1+\varepsilon ^{-1})\gamma _{2}^2}{b}\\&\left\| x_{k-1}-x_{k-2}\right\| ^2+\frac{4(1+\varepsilon )(1+\varepsilon ^{-1}) (1-b/n)M^2(2+\varepsilon )}{b}\mathbb {E}_{k}\left\| x_{k+1}-x_{k} \right\| ^2\\&\!+\!\frac{12L^2(1\!+\!\varepsilon ^{-1})(1\!+\!2\gamma _{1}^2)}{b}\mathbb {E}_{k}\left\| y_{k+1}\!-\!y_{k}\right\| ^2\!+\!\frac{24L^2(1\!+\!\varepsilon ^{-1})(\gamma _{1}^2\!+\!\gamma _{2}^2)}{b}\left\| y_{k}\!-\!y_{k-1}\right\| ^2\\&+\frac{24L^2(1+\varepsilon ^{-1})\gamma _{2}^2}{b}\left\| y_{k-1}-y_{k-2}\right\| ^2\\ \le&(1+\varepsilon )^3 (1-b/n)\Upsilon _{k}+\frac{13N^2(1+\varepsilon )(2+\varepsilon )(1+\varepsilon ^{-1} )(1+2\gamma _1^2)}{b}\mathbb {E}_{k}\left\| z_{k+1}- z_{k} \right\| ^2\\&+\frac{24N^2(1+\varepsilon ^{-1})(\gamma _1^2+\gamma _{2}^2)}{b}\left\| z_{k}-z_{k-1}\right\| ^2+\frac{24N^2\gamma _{2}^2(1+\varepsilon ^{-1})}{b}\left\| z_{k-1}-z_{k-2}\right\| ^2\\ \le&(1\!+\!\varepsilon )^3 (1\!-\!b/n)\Upsilon _{k}\!+\!\frac{24N^2(1\!+\!\varepsilon )(2\!+\!\varepsilon )(1\!+\!\varepsilon ^{-1})(1\!+\!2\gamma _1^2\!+\!\gamma _2^2)}{b}\left( \mathbb {E}_k\left\| z_{k+1}\!-\!z_{k} \right\| ^{2}\right. \\&\left. +\left\| z_{k}-z_{k-1} \right\| ^{2} +\left\| z_{k-1}-z_{k-2} \right\| ^{2}\right) , \end{aligned}$$
where \(N=\max \left\{ M,L \right\} \). Choosing \(\varepsilon =\frac{b}{6n}\), we have \((1+\varepsilon )^3(1-\frac{b}{n} ) \le 1-\frac{b}{2n}\), producing the inequality
$$\begin{aligned} \mathbb {E}_{k}\Upsilon _{k+1}\le&(1-\frac{b}{2n})\Upsilon _{k}+\frac{24N^2(1+\frac{b}{6n})(2+\frac{b}{6n})(1+\frac{6n}{b})(1+2\gamma _1^2+\gamma _2^2)}{b}\left( \mathbb {E}_k\left\| z_{k+1}-z_{k} \right\| ^{2}\right. \nonumber \\&\left. +\left\| z_{k}-z_{k-1} \right\| ^{2}+\left\| z_{k-1}-z_{k-2} \right\| ^{2}\right) \nonumber \\ \le&(1\!-\!\frac{b}{2n})\Upsilon _{k}\!+\!\frac{408nN^2(1\!+\!2\gamma _1^2\!+\!\gamma _2^2)}{b^2}\left( \mathbb {E}_k\left\| z_{k+1}\!-\!z_{k} \right\| ^{2}\!+\!\left\| z_{k}\!-\!z_{k-1} \right\| ^{2} \!+\!\left\| z_{k-1}\!-\!z_{k-2} \right\| ^{2}\right) . \end{aligned}$$
(A.13)
This completes the proof. \(\square \)
Lemma A.4
(Convergence of estimator) If \(\left\{ z_k \right\} _{k\in \mathbb {N} }\) satisfies \(\lim _{k \rightarrow \infty } \mathbb {E}\left\| z_{k}\!-\!z_{k-1} \right\| ^{2}\!=0\), then \( \mathbb {E}\Upsilon _k\rightarrow 0\) and \(\mathbb {E}\Gamma _k\rightarrow 0\) as \(k\rightarrow \infty \).
Proof
We frist show that \(\sum _{j=1}^n \mathbb {E}\left\| \nabla _xH_j(u_{k},y_{k})- \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2\rightarrow 0\) as \(k\rightarrow \infty \). Indeed,
$$\begin{aligned}&\sum _{j=1}^n \mathbb {E}\left\| \nabla _xH_j(u_{k},y_{k})- \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2\le L^2\sum _{j=1}^n \mathbb {E}\left\| u_{k}-\varphi _{k}^{j} \right\| ^2\nonumber \\ \le&nL^2(1+\frac{2n}{b})\mathbb {E}\left\| u_{k}-u_{k-1}\right\| ^2+L^2(1+\frac{b}{2n})\sum _{j=1}^n \mathbb {E}\left\| u_{k-1}-\varphi _{k}^{j}\right\| ^2\nonumber \\ \le&nL^2(1+\frac{2n}{b})\mathbb {E}\left\| u_{k}-u_{k-1}\right\| ^2+L^2(1+\frac{b}{2n})(1-\frac{b}{n})\sum _{j=1}^n \mathbb {E}\left\| u_{k-1}-\varphi _{k-1}^{j}\right\| ^2\nonumber \\ \le&nL^2(1+\frac{2n}{b})\mathbb {E}\left\| u_{k}-u_{k-1}\right\| ^2+L^2(1-\frac{b}{2n})\sum _{j=1}^n \mathbb {E}\left\| u_{k-1}-\varphi _{k-1}^{j}\right\| ^2\nonumber \\ \le&nL^2(1+\frac{2n}{b})\sum _{l=1}^k(1-\frac{b}{2n})^{k-l} \mathbb {E}\left\| u_{l}-u_{l-1}\right\| ^2. \end{aligned}$$
(A.14)
As \(\mathbb {E}\left\| z_{k}-z_{k-1} \right\| ^{2}\rightarrow 0\), so \(\mathbb {E}\left\| u_{k}-u_{k-1} \right\| ^{2}\rightarrow 0\), it is clear that \(\sum _{l=1}^k(1-\frac{b}{2n})^{k-l} \mathbb {E}\left\| u_{l}\!-\!u_{l-1}\right\| ^2 \!\rightarrow \! 0\), and hence \(\sum _{j=1}^n \mathbb {E}\left\| \nabla _xH_j(u_{k},y_{k})\!-\! \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| ^2\! \rightarrow \! 0\) as \(k\!\rightarrow \! \infty \). An analogous argument shows that \(\sum _{j=1}^n \!\mathbb {E}\!\left\| \nabla _yH_j(x_{k},v_{k})\!-\! \nabla _yH_j(x_{k},\xi _{k}^{j}) \!\right\| ^2\)\(\rightarrow 0\) as \(k\rightarrow \infty \). So \(\mathbb {E}\Upsilon _k\rightarrow 0\) as \(k\rightarrow \infty \). Similarly, we can get \(\mathbb {E}\Gamma _k\rightarrow 0\) as \(k\rightarrow \infty \). Indeed,
$$\begin{aligned}&\sum _{j=1}^n \mathbb {E}\left\| \nabla _xH_j(u_{k},y_{k})- \nabla _xH_j(\varphi _{k}^{j},y_{k}) \right\| \le L\sum _{j=1}^n \mathbb {E}\left\| u_{k}-\varphi _{k}^{j} \right\| \nonumber \\ \le&nL\mathbb {E}\left\| u_{k}-u_{k-1}\right\| +L\sum _{j=1}^n \mathbb {E}\left\| u_{k-1}-\varphi _{k}^{j}\right\| \nonumber \\ \le&nL\mathbb {E}\left\| u_{k}-u_{k-1}\right\| +L(1-\frac{b}{n})\sum _{j=1}^n \mathbb {E}\left\| u_{k-1}-\varphi _{k-1}^{j}\right\| \nonumber \\ \le&nL\sum _{l=1}^k(1-\frac{b}{n})^{k-l} \mathbb {E}\left\| u_{l}-u_{l-1}\right\| . \end{aligned}$$
(A.15)
Because \(\mathbb {E}\left\| z_{k}-z_{k-1} \right\| ^{2}\rightarrow 0\), it follows that \(\mathbb {E}\left\| z_{k}-z_{k-1} \right\| \rightarrow 0\) (because Jensen’s inequality implies \(\mathbb {E}\left\| z_{k}-z_{k-1} \right\| \le \sqrt{\mathbb {E}\left\| z_{k}-z_{k-1} \right\| ^{2}}\rightarrow 0\)). So \(\mathbb {E}\left\| u_{k}-u_{k-1} \right\| \rightarrow 0\), then it follows that the bound on the right goes to zero as \(k\rightarrow \infty \), hence \(\mathbb {E}\Gamma _k\rightarrow 0\).
\(\square \)
1.2 B SARAH variance bound
As in the previous section, we use \(I_k^x\) and \(I_k^y\) to denote the mini-batches used to approximate \(\nabla _xH(u_k,y_k) \) and \(\nabla _yH(x_{k+1},v_k)\), respectively.
Lemma B.1
The SARAH gradient estimator satisfies
$$\begin{aligned}&\mathbb {E}_{k}\left( \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| ^2+\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| ^2\right) \\ \le&\left( \! 1\!-\!\frac{1}{p} \!\right) \left( \! \left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})\!-\!\nabla _xH(u_{k-1},y_{k-1})\right\| ^2\!+\! \left\| \widetilde{\nabla }_y(x_{k},v_{k-1})\!-\!\nabla _yH(x_{k},v_{k-1})\right\| ^2\right) \\&+V_{1}\left( \mathbb {E}_k\left\| z_{k+1}-z_{k} \right\| ^{2}+\left\| z_{k}-z_{k-1} \right\| ^{2} +\left\| z_{k-1}-z_{k-2} \right\| ^{2}+\left\| z_{k-2}-z_{k-3} \right\| ^{2}\right) , \end{aligned}$$
as well as
$$\begin{aligned}&\mathbb {E}_k\left( \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| +\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| \right) \\ \le&\sqrt{1-\frac{1}{p} } \left( \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| +\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| \right) \\&+V_2\left( \mathbb {E}_k\left\| z_{k+1}-z_{k} \right\| +\left\| z_{k}-z_{k-1} \right\| +\left\| z_{k-1}-z_{k-2} \right\| +\left\| z_{k-2}-z_{k-3} \right\| \right) , \end{aligned}$$
where \(V_{1}=6\left( 1-\frac{1}{p} \right) M^2(1+2\gamma _{1}^2+\gamma _{2}^2)\) and \(V_{2}=M\sqrt{6(1-\frac{1}{p})(1+2\gamma _{1}^2+\gamma _{2}^2) }\).
Proof
Let \(\mathbb {E}_{k,p}\) denote the expectation conditional on the first k iterations and the event that we do not compute the full gradient at iteration k. The conditional expectation of the SARAH gradient estimator in this case is
$$\begin{aligned} \mathbb {E}_{k,p}\widetilde{\nabla }_x(u_k,y_k)=&\frac{1}{b}\mathbb {E}_{k,p}\left( \! \sum _{i\in I_k^x} \nabla _xH_i(u_k,y_k)\!-\! \nabla _xH_i(u_{k-1},y_{k-1}) \!\right) \!+\!\widetilde{\nabla }_x(u_{k-1},y_{k-1}) \nonumber \\ =&\nabla _xH(u_k,y_k)-\nabla _xH(u_{k-1},y_{k-1})+\widetilde{\nabla }_x(u_{k-1},y_{k-1}), \end{aligned}$$
(B.1)
and further
$$\begin{aligned}&\mathbb {E}_{k,p}\left\| \widetilde{\nabla }_x(u_k,y_k) -\nabla _xH(u_k,y_k) \right\| ^2 \nonumber \\ =&\mathbb {E}_{k,p}\left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})-\nabla _xH(u_{k-1},y_{k-1})+\nabla _xH(u_{k-1},y_{k-1}) -\nabla _xH(u_k,y_k)\right. \nonumber \\&\left. +\widetilde{\nabla }_x(u_k,y_k)-\widetilde{\nabla }_x(u_{k-1},y_{k-1}) \right\| ^2 \nonumber \\ =&\left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})-\nabla _xH(u_{k-1},y_{k-1})\right\| ^2+\left\| \nabla _xH(u_{k-1},y_{k-1}) -\nabla _xH(u_k,y_k) \right\| ^2\nonumber \\&+\mathbb {E}_{k,p}\left\| \widetilde{\nabla }_x(u_k,y_k)-\widetilde{\nabla }_x(u_{k-1},y_{k-1})\right\| ^2 \nonumber \\&+2\left\langle \widetilde{\nabla }_x(u_{k-1},y_{k-1})-\nabla _xH(u_{k-1},y_{k-1}), \nabla _xH(u_{k-1},y_{k-1}) -\nabla _xH(u_k,y_k) \right\rangle \nonumber \\&-2\left\langle \nabla _xH(u_{k-1},y_{k-1})-\widetilde{\nabla }_x(u_{k-1},y_{k-1}), \mathbb {E}_{k,p}\left( \widetilde{\nabla }_x(u_{k},y_{k})-\widetilde{\nabla }_x(u_{k-1},y_{k-1}) \right) \right\rangle \nonumber \\&-2\left\langle \nabla _xH(u_k,y_k)-\nabla _xH(u_{k-1},y_{k-1}), \mathbb {E}_{k,p}\left( \widetilde{\nabla }_x(u_{k},y_{k})-\widetilde{\nabla }_x(u_{k-1},y_{k-1}) \right) \right\rangle . \end{aligned}$$
(B.2)
By (B.1), we see that
$$\mathbb {E}_{k,p}\left( \widetilde{\nabla }_x(u_{k},y_{k})-\widetilde{\nabla }_x(u_{k-1},y_{k-1}) \right) =\nabla _xH(u_{k},y_{k})-\nabla _xH(u_{k-1},y_{k-1}).$$
Thus, the first two inner products in (B.2) sum to zero and the third one is equal to
$$\begin{aligned}&-2\left\langle \nabla _xH(u_k,y_k)-\nabla _xH(u_{k-1},y_{k-1}), \mathbb {E}_{k,p}\left( \widetilde{\nabla }_x(u_{k},y_{k})-\widetilde{\nabla }_x(u_{k-1},y_{k-1}) \right) \right\rangle \\ =&-2\left\langle \nabla _xH(u_k,y_k)-\nabla _xH(u_{k-1},y_{k-1}), \nabla _xH(u_{k},y_{k})-\nabla _xH(u_{k-1},y_{k-1})\right\rangle \\ =&-2\left\| \nabla _xH(u_k,y_k)-\nabla _xH(u_{k-1},y_{k-1}) \right\| ^2. \end{aligned}$$
This yields
$$\begin{aligned}&\mathbb {E}_{k,p}\left\| \widetilde{\nabla }_x(u_k,y_k) -\nabla _xH(u_k,y_k) \right\| ^2 \\ =&\left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})-\nabla _xH(u_{k-1},y_{k-1})\right\| ^2-\left\| \nabla _xH(u_{k-1},y_{k-1}) -\nabla _xH(u_k,y_k) \right\| ^2 \\&+\mathbb {E}_{k,p}\left\| \widetilde{\nabla }_x(u_k,y_k)-\widetilde{\nabla }_x(u_{k-1},y_{k-1})\right\| ^2\\ \le&\left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})\!-\!\nabla _xH(u_{k-1},y_{k-1})\right\| ^2\!+\!\mathbb {E}_{k,p}\left\| \widetilde{\nabla }_x(u_k,y_k)\!-\!\widetilde{\nabla }_x(u_{k-1},y_{k-1})\right\| ^2. \end{aligned}$$
We can bound the second term by computing the expectation.
$$\begin{aligned}&\mathbb {E}_{k,p}\left\| \widetilde{\nabla }_x(u_k,y_k)-\widetilde{\nabla }_x(u_{k-1},y_{k-1})\right\| ^2\\ =&\mathbb {E}_{k,p}\left\| \frac{1}{b}\left( \sum _{i\in I_k^x} \nabla _xH_i(u_k,y_k)- \nabla _xH_i(u_{k-1},y_{k-1}) \right) \right\| ^2\\ \le&\frac{1}{b}\mathbb {E}_{k,p}\left[ \sum _{i\in I_k^x} \left\| \nabla _xH_i(u_k,y_k)- \nabla _xH_i(u_{k-1},y_{k-1}) \right\| ^2 \right] \\ =&\frac{1}{n} \sum _{j=1}^{n} \left\| \nabla _xH_j(u_k,y_k)- \nabla _xH_j(u_{k-1},y_{k-1}) \right\| ^2. \end{aligned}$$
The inequality is due to the convexity of the function \(x\mapsto \left\| x \right\| ^2\). This results in the recursive inequality
$$\begin{aligned}&\mathbb {E}_{k,p}\left\| \widetilde{\nabla }_x(u_k,y_k) -\nabla _xH(u_k,y_k) \right\| ^2 \\ \le&\left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})\!-\!\nabla _xH(u_{k-1},y_{k-1})\right\| ^2\!+\!\frac{1}{n} \sum _{j=1}^{n} \left\| \nabla _xH_j(u_k,y_k)\!-\! \nabla _xH_j(u_{k-1},y_{k-1}) \right\| ^2. \end{aligned}$$
This bounds the MSE under the condition that the full gradient is not computed. When the full gradient is computed, the MSE is equal to zero, so taking the M-Lipschitz continuity of the gradients of the \(H_j\) into account, we get
$$\begin{aligned}&\mathbb {E}_{k}\left\| \widetilde{\nabla }_x(u_k,y_k) -\nabla _xH(u_k,y_k) \right\| ^2 \\ \le&\left( \! 1\!-\!\frac{1}{p} \right) \left( \left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})\!-\!\nabla _xH(u_{k-1},y_{k-1})\right\| ^2\!+\!\frac{1}{n} \sum _{j=1}^{n} \left\| \nabla _xH_j(u_k,y_k)\!-\! \nabla _xH_j(u_{k-1},y_{k-1}) \right\| ^2 \right) \\ \le&\left( 1-\frac{1}{p} \right) \left( \left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})-\nabla _xH(u_{k-1},y_{k-1})\right\| ^2+M^2\left\| (u_k,y_k)- (u_{k-1},y_{k-1}) \right\| ^2 \right) . \end{aligned}$$
Using \((a+b+c) ^2\le 3(a^2+b^2+c^2)\), we can estimate
$$\begin{aligned}&\left\| (u_k,y_k)- (u_{k-1},y_{k-1}) \right\| ^2=\left\| u_k-u_{k-1}\right\| ^2+\left\| y_k-y_{k-1}\right\| ^2\\ \le&3\left\| u_k-x_{k}\right\| ^2+3\left\| x_k-x_{k-1}\right\| ^2+3\left\| x_{k-1}-u_{k-1}\right\| ^2+\left\| y_k-y_{k-1}\right\| ^2\\ \le&3(1\!+\!2\gamma _{1}^2)\left\| x_k\!-\!x_{k-1}\right\| ^2\!+\!6(\gamma _{1}^2\!+\!\gamma _{2}^2)\left\| x_{k-1}\!-\!x_{k-2}\right\| ^2\!+\!6\gamma _{2}^2\left\| x_{k-2}\!-\!x_{k-3}\right\| ^2\!+\!\left\| y_k-y_{k-1}\right\| ^2. \end{aligned}$$
Substituting the above inequality, we can obtain
$$\begin{aligned}&\mathbb {E}_{k}\left\| \widetilde{\nabla }_x(u_k,y_k) -\nabla _xH(u_k,y_k) \right\| ^2 \nonumber \\ \le&\left( 1\!-\!\frac{1}{p} \right) \left( \left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})\!-\!\nabla _xH(u_{k-1},y_{k-1})\right\| ^2\!+\!3M^2(1\!+\!2\gamma _{1}^2)\left\| x_k\!-\!x_{k-1}\right\| ^2\right. \nonumber \\&\left. +\!6M^2(\gamma _{1}^2\!+\!\gamma _{2}^2)\left\| x_{k-1}\!-\!x_{k-2}\right\| ^2\!+\!6M^2\gamma _{2}^2\left\| x_{k-2}\!-\!x_{k-3}\right\| ^2\!+\!M^2\left\| y_k\!-\!y_{k-1}\right\| ^2 \right) . \end{aligned}$$
(B.3)
By symmetric arguments, it holds
$$\begin{aligned}&\mathbb {E}_{k}\left\| \widetilde{\nabla }_y(x_{k+1},v_k) -\nabla _yH(x_{k+1},v_k) \right\| ^2 \nonumber \\ \le&\left( \! 1\!-\!\frac{1}{p} \right) \left( \left\| \widetilde{\nabla }_y(x_{k},v_{k-1})\!-\!\nabla _yH(x_{k},v_{k-1})\right\| ^2\!+\!M^2\mathbb {E}_{k}\left\| (x_{k+1},v_k)\!-\! (x_{k},v_{k-1}) \right\| ^2 \right) \nonumber \\ \le&\left( \! 1\!-\!\frac{1}{p} \right) \left( \left\| \widetilde{\nabla }_y(x_{k},v_{k-1})\!-\!\nabla _yH(x_{k},v_{k-1})\right\| ^2\!+\!M^2\mathbb {E}_{k}\left\| x_{k+1}\!-\!x_{k}\right\| ^2\!+\!3M^2(1\!+\!2\mu _{1k}^2)\right. \nonumber \\&\left. \left\| y_k\!-\!y_{k-1}\right\| ^2\!+\!6M^2(\mu _{1,k-1}^2\!+\!\mu _{2k}^2)\left\| y_{k-1}\!-\!y_{k-2}\right\| ^2\!+\!6M^2\mu _{2,k-1}^2\left\| y_{k-2}\!-\!y_{k-3}\right\| ^2 \right) \nonumber \\ \le&\left( \! 1\!-\!\frac{1}{p} \right) \left( \left\| \widetilde{\nabla }_y(x_{k},v_{k-1})\!-\!\nabla _yH(x_{k},v_{k-1})\right\| ^2\!+\!M^2\mathbb {E}_{k}\left\| x_{k+1}\!-\!x_{k}\right\| ^2\!+\!3M^2(1\!+\!2\gamma _{1}^2)\right. \nonumber \\&\left. \left\| y_k-y_{k-1}\right\| ^2+6M^2(\gamma _{1}^2+\gamma _{2}^2)\left\| y_{k-1}-y_{k-2}\right\| ^2+6M^2\gamma _{2}^2\left\| y_{k-2}-y_{k-3}\right\| ^2 \right) . \end{aligned}$$
(B.4)
Combining (B.3) and (B.4), we can obtain
$$\begin{aligned}&\mathbb {E}_{k}\left( \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| ^2+\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| ^2\right) \\ \le&\left( \!1\!-\!\frac{1}{p} \right) \left( \left\| \widetilde{\nabla }_x(u_{k-1},y_{k-1})\!-\!\nabla _xH(u_{k-1},y_{k-1})\right\| ^2\!+\! \left\| \widetilde{\nabla }_y(x_{k},v_{k-1})\!-\!\nabla _yH(x_{k},v_{k-1})\right\| ^2\right. \\&\left. +M^2\mathbb {E}_{k}\left\| x_{k+1}-x_{k}\right\| ^2+M^2\left\| y_{k}-y_{k-1}\right\| ^2+3M^2(1+2\gamma _{1}^2)\left\| z_k-z_{k-1}\right\| ^2\right. \\&\left. +6M^2(\gamma _{1}^2+\gamma _{2}^2)\left\| z_{k-1}-z_{k-2}\right\| ^2+6M^2\gamma _{2}^2\left\| z_{k-2}-z_{k-3}\right\| ^2 \right) \\ \le&\left( 1-\frac{1}{p} \right) \Upsilon _{k}+6\left( 1-\frac{1}{p} \right) M^2(1+2\gamma _{1}^2+\gamma _{2}^2)\left( \mathbb {E}_k\left\| z_{k+1}-z_{k} \right\| ^{2}+\left\| z_{k}-z_{k-1} \right\| ^{2}\right. \\&\left. +\left\| z_{k-1}-z_{k-2} \right\| ^{2}+\left\| z_{k-2}-z_{k-3} \right\| ^{2}\right) . \end{aligned}$$
Similar bounds hold for \(\Gamma _k\) due to Jensen’s inequality:
$$\begin{aligned}&\mathbb {E}_k\left( \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| +\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| \right) \\ \le&\sqrt{1-\frac{1}{p} } \left( \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| +\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| \right) \\&\!+\!M\sqrt{6(1\!-\!\frac{1}{p})(1\!+\!2\gamma _{1}^2\!+\!\gamma _{2}^2) }\left( \mathbb {E}_k\left\| z_{k+1}\!-\!z_{k} \right\| \!+\!\left\| z_{k}\!-\!z_{k-1} \right\| \!+\!\left\| z_{k-1}\!-\!z_{k-2} \right\| \!+\!\left\| z_{k-2}\!-\!z_{k-3} \right\| \right) . \end{aligned}$$
This completes the proof. \(\square \)
Now, define
$$\begin{aligned} \Upsilon _{k+1}=&\left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| ^2+\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| ^2,\nonumber \\ \Gamma _{k+1}=&\left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| +\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| . \end{aligned}$$
(B.5)
By Lemma B.1, we have
$$\begin{aligned}&\mathbb {E}_k\left[ \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| ^2+\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| ^2\right] \\ \le&\Upsilon _k\!+\!V_1\left( \mathbb {E}_k\left\| z_{k+1}\!-\!z_{k} \right\| ^{2}\!+\!\left\| z_{k}\!-\!z_{k-1} \right\| ^{2} \!+\!\left\| z_{k-1}\!-\!z_{k-2} \right\| ^{2}\!+\!\left\| z_{k-2}\!-\!z_{k-3} \right\| ^{2}\right) , \end{aligned}$$
and
$$\begin{aligned}&\mathbb {E}_k\left[ \left\| \widetilde{\nabla }_x(u_k,y_k)-\nabla _xH(u_k,y_k) \right\| +\left\| \widetilde{\nabla }_y(x_{k+1},v_k)-\nabla _yH(x_{k+1},v_k) \right\| \right] \\ \le&\Gamma _k+V_2\left( \mathbb {E}_k\left\| z_{k+1}-z_{k} \right\| +\left\| z_{k}-z_{k-1} \right\| +\left\| z_{k-1}-z_{k-2} \right\| +\left\| z_{k-2}-z_{k-3} \right\| \right) . \end{aligned}$$
This is exactly the MSE bound, where \(V_{1}=6\left( 1-\frac{1}{p} \right) M^2(1+2\gamma _{1}^2+\gamma _{2}^2)\) and
\(V_{2}=M\sqrt{6(1-\frac{1}{p})(1+2\gamma _{1}^2+\gamma _{2}^2) }\).
Lemma B.2
(Geometric decay) Let \(\Upsilon _{k}\) be defined as in (B.5), then we can establish the geometric decay property:
$$\begin{aligned} \mathbb {E}_{k}\Upsilon _{k+1}\!\le \!\left( 1\!-\!\rho \right) \Upsilon _{k}\!+\!V_{\Upsilon }\left( \mathbb {E}_k\left\| z_{k+1}\!-\!z_{k} \right\| ^{2}\!+\!\left\| z_{k}\!-\!z_{k-1} \right\| ^{2} \!+\!\left\| z_{k-1}\!-\!z_{k-2} \right\| ^{2}\!+\!\left\| z_{k-2}\!-\!z_{k-3} \right\| ^{2}\right) , \end{aligned}$$
(B.6)
where \(\rho = \frac{1}{p}\), \(V_{\Upsilon }=6\left( 1-\frac{1}{p} \right) M^2(1+2\gamma _{1}^2+\gamma _{2}^2)\).
Proof
This is a direct result of Lemma B.1. \(\square \)
Lemma B.3
(Convergence of estimator) If \(\left\{ z_k \right\} _{k\in \mathbb {N} }\) satisfies \(\lim _{k \rightarrow \infty } \mathbb {E}\left\| z_{k}\!-\!z_{k-1} \right\| ^{2}\!=0\), then \( \mathbb {E}\Upsilon _k\rightarrow 0\) and \(\mathbb {E}\Gamma _k\rightarrow 0\) as \(k \rightarrow \infty \).
Proof
By (B.6), we have
$$\begin{aligned}&\mathbb {E}\Upsilon _{k}\\ \le&\left( 1\!-\!\rho \right) \mathbb {E}\Upsilon _{k-1}\!+\!V_{\Upsilon }\mathbb {E}\left( \left\| z_{k}\!-\!z_{k-1} \right\| ^{2} \!+\!\left\| z_{k-1}\!-\!z_{k-2} \right\| ^{2}\!+\left\| z_{k-2}\!-z_{k-3} \right\| ^{2}+\left\| z_{k-3}-z_{k-4} \right\| ^{2}\right) \\ \le&V_{\Upsilon }\sum _{l=1}^{k}\left( 1-\rho \right) ^{k-l}\mathbb {E}\left( \left\| z_{l}-z_{l-1} \right\| ^{2} +\left\| z_{l-1}-z_{l-2} \right\| ^{2}+\left\| z_{l-2}-z_{l-3} \right\| ^{2}+\left\| z_{l-3}-z_{l-4} \right\| ^{2}\right) , \end{aligned}$$
which implies \( \mathbb {E}\Upsilon _k\rightarrow 0\) as \(k \rightarrow \infty \). By Jensen’s inequality, we have \(\mathbb {E}\Gamma _k\rightarrow 0\) as \(k \rightarrow \infty \). \(\square \)