Skip to main content

Advertisement

Log in

SGEM: stochastic gradient with energy and momentum

  • Original Paper
  • Published:
Numerical Algorithms Aims and scope Submit manuscript

Abstract

In this paper, we propose SGEM, stochastic gradient with energy and momentum, to solve a class of general non-convex stochastic optimization problems, based on the AEGD method introduced in AEGD (adaptive gradient descent with energy) Liu and Tian (Numerical Algebra, Control and Optimization, 2023). SGEM incorporates both energy and momentum so as to inherit their dual advantages. We show that SGEM features an unconditional energy stability property and provide a positive lower threshold for the energy variable. We further derive energy-dependent convergence rates in the general non-convex stochastic setting, as well as a regret bound in the online convex setting. Our experimental results show that SGEM converges faster than AEGD and generalizes better or at least as well as SGDM in training some deep neural networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability

The data that support the findings of this study are publicly available online at https://www.cs.toronto.edu/~kriz/cifar.html and https://www.image-net.org/.

Notes

  1. Code is available at https://github.com/txping/SGEM

References

  1. Allen-Zhu, Z.: Katyusha: the first direct acceleration of stochastic gradient methods. J. Mach. Learn. Res. 18(221), 1–51 (2018)

    MathSciNet  Google Scholar 

  2. Bottou, L: Stochastic gradient descent tricks neural networks, tricks of the trade, reloaded ed., Lecture Notes in Computer Science (LNCS), vol. 7700. Springer (2012)

  3. Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)

    Article  MathSciNet  Google Scholar 

  4. Chen, X., Liu, S., Sun, R., Hong, M.: On the convergence of a class of Adam-type algorithms for non-convex optimization. International conference on learning representations (2019)

  5. Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. Adv. Neural. Inf. Proc. Syst. 27 (2014)

  6. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)

    MathSciNet  Google Scholar 

  7. Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. MIT Press (2016)

  8. Hazan, E.: Introduction to online convex optimization. arXiv:1909.05207 (2019)

  9. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. (2016)

  10. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269. (2017)

  11. Jin, C., Netrapalli, P., Jordan, M.I.: Accelerated gradient descent escapes saddle points faster than gradient descent. In: Proceedings of the 31st Conference on learning theory, vol. 75, pp. 1042–1085. (2018)

  12. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. Adv. Neural Inf. Proc. Syst. 26 (2013)

  13. Karimi, H., Nutini, J., Schmidt, M.: Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. 9851(09), 795–811 (2016)

  14. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv:1412.6980 (2017)

  15. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. University of Toronto (2009)

  16. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–44 (2015)

    Article  CAS  PubMed  Google Scholar 

  17. Lei, L., Ju, C., Chen, J., Jordan, M.I.: Non-convex finite-sum optimization via SCSG methods. Adv. Neural Inf. Proc. Syst. 30 (2017)

  18. Li, X., Orabona, F.: On the convergence of stochastic gradient descent with adaptive stepsizes. In: Proceedings of the 22nd International conference on artificial intelligence and statistics, proceedings of machine learning research, vol. 89, pp. 983–992. (2019)

  19. Liu, H., Tian, X.: AEGD: adaptive gradient decent with energy. Numerical Algebra, Control and Optimization. https://doi.org/10.3934/naco.2023015 (2023)

  20. Liu, H., Tian, X.: Dynamic behavior for a gradient algorithm with energy and momentum. arXiv:2203.12199 (2022)

  21. Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., Han, J.: On the variance of the adaptive learning rate and beyond. International conference on learning representations (2020)

  22. Liu, Y., Gao, Y., Yin, W.: An improved analysis of stochastic gradient descent with momentum. NeurIPS (2020)

  23. Luo, L., Xiong, Y., Liu, Y.: Adaptive gradient methods with dynamic bound of learning rate. International conference on learning representations (2019)

  24. Osher, S., Wang, B., Yin, P., Luo, X., Barekat, F., Pham, M., Lin, A.: Laplacian smoothing gradient descent. arXiv:1806.06317 (2019)

  25. Polyak, B.T.: Some methods of speeding up the convergence of iterative methods. Ž. Vyčisl. Mat i Mat. Fiz. 4, 791–803 (1964)

    MathSciNet  Google Scholar 

  26. Qian, N.: On the momentum term in gradient descent learning algorithms. Neural Networks 12(1), 145–151 (1999)

    Article  CAS  PubMed  Google Scholar 

  27. Reddi, S., Kale, S., Kumar, S.: On the convergence of Adam and beyond. International conference on learning representations (2018)

  28. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)

    Article  MathSciNet  Google Scholar 

  29. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. International Journal of Computer Vision (IJCV) 115(3), 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  30. Shapiro, A., Wardi, Y.: Convergence analysis of gradient descent stochastic algorithms. J. Optim. Theory Appl. 91(2), 439–454 (1996)

    Article  MathSciNet  Google Scholar 

  31. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2015)

  32. Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: Proceedings of the 30th International conference on machine learning, vol. 28, pp. 1139–1147 (2013)

  33. Tieleman, T., Hinton, G.: RMSprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw. Mach. Learn. 4(2), 26–31 (2012)

    Google Scholar 

  34. Wilson, A.C., Roelofs, R., Stern, M., Srebro, N., Recht, B.: The marginal value of adaptive gradient methods in machine learning. arXiv:1705.08292 (2018)

  35. Yan, Y., Yang, T., Li, Z., Lin, Q., Yang, Y.: A unified analysis of stochastic momentum methods for deep learning. In: Proceedings of the 27th International joint conference on artificial intelligence, pp. 2955–2961 (2018)

  36. Yang, X.: Linear, first and second-order, unconditionally energy stable numerical schemes for the phase field model of homopolymer blends. J. Comput. Phys. 327, 294–316 (2016)

    Article  MathSciNet  CAS  Google Scholar 

  37. Yu, H., Jin, R., Yang, S.: On the linear speedup analysis of communication efficient momentum SGD for distributed non-convex optimization. In: Proceedings of the 36th International conference on machine learning vol. 97, pp. 7184–7193 (2019)

  38. Zaheer, M., Reddi, S., Sachan, D., Kale, S., Kumar, S.: Adaptive methods for nonconvex optimization. Advances in Neural Information Processing Systems (S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds.), vol. 31, Curran Associates, Inc., (2018)

  39. Zhao, J., Wang, Q., Yang, X.: Numerical approximations for a phase field dendritic crystal growth model based on the invariant energy quadratization approach. Int. J. Numer. Methods Eng. 110, 279–300 (2017)

    Article  MathSciNet  Google Scholar 

  40. Zhuang, J., Tang, T., Ding, Y., Tatikonda, S.C., Dvornek, N., Papademetris, X., Duncan, J.: AdaBelief Optimizer: adapting stepsizes by the belief in observed gradients. Adv. Neural Inf. Process. Syst. 33, 18795–18806 (2020)

    Google Scholar 

  41. Zinkevich, M.: Online convex programming and generalized infinitesimal gradient ascent. ICML, pp. 928–935 (2003)

  42. Zou, F., Shen, L., Jie, Z., Zhang, W., Liu, W.: A sufficient condition for convergences of Adam and RMSProp. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11119–11127 (2019)

Download references

Funding

This work was supported by the National Science Foundation under Grant DMS1812666.

Author information

Authors and Affiliations

Authors

Contributions

The authors have equal contributions.

Corresponding author

Correspondence to Hailiang Liu.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix 1. Proof of Theorem 2

For the proofs of Theorems 2 and 3, we introduce notation

$$\begin{aligned} {\tilde{F}}_t:=\sqrt{f(\theta _t; \xi _t)+c}. \end{aligned}$$
(A.1)

The initial data for \(r_i\) is taken as \(r_{1, i}={\tilde{F}}_1\). We also denote the update rule presented in Algorithm 1 as

$$\begin{aligned} \theta _{t+1}=\theta _{t}-2\eta r_{t+1} v_{t}, \end{aligned}$$
(A.2)

where \(r_{t+1}\) is viewed as a \(n\times n\) diagonal matrix that is made up of \([r_{t+1,1},\) \(\ldots ,r_{t+1,i},\ldots ,r_{t+1,n}]\).

Lemma 1

Under the assumptions in Theorem 2, we have for all \(t\in [T]\),

  1. (i)

    \(\Vert \nabla f(\theta _t)\Vert _\infty \le G_\infty \).

  2. (ii)

    \(\mathbb {E}[({\tilde{F}}_t)^2]= F^2(\theta _t)=f(\theta _t)+c\).

  3. (iii)

    \(\mathbb {E}[{\tilde{F}}_t]\le F(\theta _t)\). In particular, \(\mathbb {E}[r_{1,i}]= \mathbb {E}[{\tilde{F}}_1]\le F(\theta _1)\) for all \(i\in [n]\).

  4. (iv)

    \(\sigma ^2_g=\mathbb {E}[\Vert g_t-\nabla f(\theta _t)\Vert ^2]\le G^2_\infty \) and \(\sigma ^2_f=\mathbb {E}[|f(\theta _t;\xi _t)- f(\theta _t)|^2]\le B^2.\)

  5. (v)

    \(\mathbb {E}[|F(\theta _t)-{\tilde{F}}_t|]\le \frac{1}{2\sqrt{a}}\sigma _f\).

  6. (vi)

    \(\mathbb {E}[\Vert \nabla F(\theta _t)-\frac{g_t}{2\tilde{F}_t}\Vert ^2]\le \frac{G^2_\infty }{8a^3}\sigma ^2_f+\frac{1}{2a}\sigma ^2_g.\)

Proof

  1. (i)

    By assumption \(\Vert g_t\Vert _\infty \le G_\infty \), we have

    $$\begin{aligned} \Vert \nabla f(\theta _t)\Vert _\infty =\Vert \mathbb {E}[g_t]\Vert _\infty \le \mathbb {E}[\Vert g_t\Vert _\infty ]\le G_\infty . \end{aligned}$$
  2. (ii)

    This follows from the unbiased sampling of

    $$\begin{aligned} f(\theta _t)=\mathbb {E}_{\xi _t}[ f(\theta _t; \xi _t)]. \end{aligned}$$
  3. (iii)

    By Jensen’s inequality, we have

    $$\begin{aligned} \mathbb {E}[\tilde{F}_t] \le \sqrt{\mathbb {E}[\tilde{F}_t^2]}=\sqrt{F(\theta _t)^2}=F(\theta _t). \end{aligned}$$
  4. (iv)

    By assumptions \(\Vert g_t\Vert _\infty \le G_\infty \) and \(f(\theta _t;\xi _t)+c<B\), we have

    $$\begin{aligned} \sigma ^2_g=\mathbb {E}[\Vert g_t-\nabla f(\theta _t)\Vert ^2] = \mathbb {E}[\Vert g_t\Vert ^2] - \Vert \nabla f(\theta _t)\Vert ^2\le G^2_\infty , \end{aligned}$$
    $$\begin{aligned} \sigma ^2_f=\mathbb {E}[\Vert f(\theta _t;\xi _t)-f(\theta _t)\Vert ^2] = \mathbb {E}[\Vert f(\theta _t;\xi _t)\Vert ^2] - \Vert f(\theta _t)\Vert ^2\le B^2. \end{aligned}$$
  5. (v)

    By the assumption \(0<a\le f(\theta _t;\xi _t)+c=\tilde{F}_t^2\), we have

    $$\begin{aligned} \quad \mathbb {E}[|F(\theta _t)-\tilde{F}_t|] \le \mathbb {E}\Bigg [\bigg |\frac{f(\theta _t)-f(\theta _t;\xi _t)}{F(\theta _t)+\tilde{F}_t}\bigg |\Bigg ] \le \frac{1}{2\sqrt{a}}\mathbb {E}[|f(\theta _t)-f(\theta _t;\xi _t)|] \le \frac{1}{2\sqrt{a}}\sigma _f. \end{aligned}$$
  6. (vi)

    By the definition of \(F(\theta )\), we have

    $$\begin{aligned} \Vert \nabla F(\theta _t)-\frac{g_t}{2\tilde{F}_t}\Vert ^2= & {} \bigg \Vert \frac{\nabla f(\theta _t)}{2F(\theta _t)}-\frac{g_t}{2\tilde{F}_t}\bigg \Vert ^2\\= & {} \frac{1}{4}\bigg \Vert \frac{\nabla f(\theta _t)(\tilde{F}_t-F(\theta _t)) }{F(\theta _t)\tilde{F}_t}+\frac{\nabla f(\theta _t)-g_t}{\tilde{F}_t}\bigg \Vert ^2\\\le & {} \frac{1}{2} \bigg \Vert \frac{\nabla f(\theta _t)(\tilde{F}_t-F(\theta _t)) }{F(\theta _t)\tilde{F}_t}\bigg \Vert ^2 + \frac{1}{2}\bigg \Vert \frac{\nabla f(\theta _t)-g_t}{\tilde{F}_t}\bigg \Vert ^2\\\le & {} \frac{G^2_\infty }{2a^{2}}|\tilde{F}_t-F(\theta _t)|^2+\frac{1}{2a}\Vert \nabla f(\theta _t)-g_t\Vert ^2, \end{aligned}$$

    where both the gradient bound and the assumption that \(0<a\le f(\theta _t;\xi _t)+c=\tilde{F}^2_t\) are essentially used. Take an expectation to get

    $$\begin{aligned} \mathbb {E}[\Vert \nabla F(\theta _t)-\frac{g_t}{2\tilde{F}_t}\Vert ^2]\le \frac{G^2_\infty }{2a^{2}}\mathbb {E}[|\tilde{F}_t-F(\theta _t)|^2]+\frac{1}{2a}\mathbb {E}[\Vert \nabla f(\theta _t)-g_t\Vert ^2]. \end{aligned}$$

    Similar to the proof for (iv), we have

    $$\begin{aligned} \mathbb {E}[|\tilde{F}_t-F(\theta _t)|^2]\le \frac{1}{4a}\sigma ^2_f. \end{aligned}$$

    This together with the variance assumption for \(g_t\) gives

    $$\begin{aligned} \mathbb {E}[\Vert \nabla F(\theta _t)-\frac{g_t}{2\tilde{F}_t}\Vert ^2]\le \frac{G^2_\infty }{8a^3}\sigma ^2_f+\frac{1}{2a}\sigma ^2_g. \end{aligned}$$

Lemma 2

For any \(T\ge 1\), we have

  1. (i)

    \(\mathbb {E}\Big [\sum _{t=1}^{T} v_t^\top r_{t+1} v_t\Big ] \le \frac{n F(\theta _1)}{2\eta }\).

  2. (ii)

    \(\mathbb {E}\Big [\sum _{t=1}^{T} m_{t-1}^\top r_{t+1} m_{t-1}\Big ]\le \mathbb {E}\Big [\sum _{t=1}^{T} m_t^\top r_{t+1} m_t\Big ] \le \frac{2Bn F(\theta _1)}{\eta }\).

  3. (iii)

    \(\mathbb {E}\Big [\sum _{t=1}^{T}\Vert r_{t+1}m_t\Vert ^2\Big ] \le \frac{2Bn F^2(\theta _1)}{\eta }\).

  4. (iv)

    \(\mathbb {E}\Big [\sum _{t=1}^{T} g_t^\top r_{t+1} g_t\Big ] \le \frac{8Bn F(\theta _1)}{(1-\beta )^2\eta }\).

  5. (v)

    \(\mathbb {E}\Big [\sum _{t=1}^{T}\Vert r_{t+1}g_t\Vert ^2\Big ] \le \frac{8Bn F^2(\theta _1)}{(1-\beta )^2\eta }\).

Proof

From Algorithm 1 line 5, we have

$$\begin{aligned} r_{t,i}-r_{t+1,i} = 2\eta r_{t+1,i}v^2_{t,i}. \end{aligned}$$

Taking summation over t from 1 to T gives

$$\begin{aligned} r_{1,i} - r_{T+1,i} = 2\eta \sum \limits _{t=1}^{T} r_{t+1,i}v^2_{t,i} \quad \Rightarrow \quad \sum \limits _{t=1}^{T} r_{t+1,i}v^2_{t,i}\le \frac{r_{1,i}}{2\eta }. \end{aligned}$$

From which we get

$$\begin{aligned} \sum _{t=1}^{T}v_t^\top r_{t+1}v_t=\sum \limits _{i=1}^{n}\sum \limits _{t=1}^{T}r_{t+1,i}v^2_{t,i}\le \frac{n\tilde{F}_1}{2\eta }. \end{aligned}$$

Taking expectation and using (iii) in Lemma 1 gives (i). Recall that \(m_t=2(1-\beta ^t)\tilde{F}_tv_t\) and \(\tilde{F}_t\le \sqrt{B}\), we further get

$$\begin{aligned} \sum _{t=1}^{T} m_t^\top r_{t+1} m_t \le 4B \sum \limits _{t=1}^{T} v_t^\top r_{t+1}v_t = \frac{2Bn\tilde{F}_1}{\eta }. \end{aligned}$$

Using \(r_{t+1,i}\le r_{t,i}\) and \(m_{0,i}=0\), we also have

$$\begin{aligned} \sum \limits _{i=1}^{n}\sum \limits _{t=1}^{T}r_{t+1,i}m^2_{t-1,i} \le \sum \limits _{i=1}^{n}\sum \limits _{t=1}^{T}r_{t,i}m^2_{t-1,i} = \sum \limits _{i=1}^{n}\sum \limits _{t=1}^{T-1}r_{t+1,i}m^2_{t,i}\le \sum \limits _{i=1}^{n}\sum \limits _{t=1}^{T}r_{t+1,i}m^2_{t,i}. \end{aligned}$$
(A.3)

Connecting the above two inequalities and taking expectation gives (ii). Using \(r_{t+1,i}\le r_{1,i}\), the above inequality further implies

$$\begin{aligned} \sum _{t=1}^{T}\Vert r_{t+1}m_t\Vert ^2= & {} \sum \limits _{i=1}^n\sum \limits _{t=1}^{T} r^2_{t+1, i}m_{t, i}^2 \le \sum \limits _{i=1}^n\sum \limits _{t=1}^{T} r_{1,i} r_{t+1, i}m_{t, i}^2 \\= & {} \bigg (\sum \limits _{i=1}^n\sum \limits _{t=1}^{T} r_{t, i}m_{t, i}^2\bigg )\tilde{F}_1 \le 2Bn\tilde{F}^2_1/\eta . \end{aligned}$$

Taking expectation and using (ii) in Lemma 1 gives (iii). By \(m_t=\beta m_{t-1}+(1-\beta )g_t\), we have

$$\begin{aligned} \sum \limits _{t=1}^{T} g_t^\top r_{t+1} g_t= & {} \sum \limits _{i=1}^{n}\sum \limits _{t=1}^{T}r_{t+1,i}g^2_{t,i} = \sum \limits _{i=1}^{n}\sum \limits _{t=1}^{T}r_{t+1,i}\bigg (\frac{1}{1-\beta }m_{t,i}-\frac{\beta }{1-\beta }m_{t-1,i}\bigg )^2\\\le & {} \frac{2}{(1-\beta )^2}\sum \limits _{i=1}^{n}\sum \limits _{t=1}^{T}r_{t+1,i}m^2_{t,i} + \frac{2\beta ^2}{(1-\beta )^2}\sum \limits _{i=1}^{n}\sum \limits _{t=1}^{T}r_{t+1,i}m^2_{t-1,i}\\\le & {} \frac{2(1+\beta ^2)}{(1-\beta )^2}\sum \limits _{t=1}^{T} m_t^\top r_{t+1} m_t\le \frac{8Bn\tilde{F}_1}{(1-\beta )^2\eta }. \end{aligned}$$

Here, the third inequality is by \((a+b)^2\le 2a^2+2b^2\); (A.3) and \(0<\beta <1\) are used in the fourth inequality. Taking expectation and using (iii) in Lemma 1 gives (iv). Similar as the derivation for (ii), we have

$$\begin{aligned} \sum \limits _{t=1}^{T}\Vert r_{t+1}g_t\Vert ^2 \le \bigg (\sum \limits _{i=1}^n\sum _{t=1}^{T} r_{t, i}g_{t, i}^2\bigg )\tilde{F}_1 \le \frac{8Bn\tilde{F}^2_1}{(1-\beta )^2\eta }. \end{aligned}$$

Taking expectation and using (ii) in Lemma 1 gives (v).\(\square \)

First, note that by (iv) in Lemma 1, \(\max \{\sigma _g,\sigma _f\}\le \max \{G_\infty ,B\}\).

Recall that \(F(\theta )=\sqrt{f(\theta )+c}\), then for any \(x, y\in \{\theta _t\}_{t=0}^T\) we have

$$\begin{aligned} \Vert \nabla F(x)-\nabla F(y)\Vert= & {} \bigg \Vert \frac{\nabla f(x)}{2F(x)}-\frac{\nabla f(y)}{2F(y)}\bigg \Vert \\= & {} \frac{1}{2}\bigg \Vert \frac{\nabla f(x)(F(y)-F(x))}{F(x)F(y)} + \frac{\nabla f(x)-\nabla f(y)}{F(y)}\bigg \Vert \\\le & {} \frac{G_\infty }{2(F(\theta ^*))^2}|F(y)-F(x)| + \frac{1}{2F(\theta ^*)}\Vert \nabla f(x)-\nabla f(y)\Vert . \end{aligned}$$

One may check that

$$\begin{aligned} |F(y)-F(x)|\le \frac{G_\infty }{2F(\theta ^*)}\Vert x-y\Vert . \end{aligned}$$

These together with the L-smoothness of f lead to

$$\begin{aligned} \Vert \nabla F(x)-\nabla F(y)\Vert \le L_F \Vert x-y\Vert , \end{aligned}$$

where

$$\begin{aligned} L_F=\frac{1}{2\sqrt{f(\theta ^*)+c}} \left( L+ \frac{G^2_\infty }{2(f(\theta ^*)+c)}\right) . \end{aligned}$$

This confirms the \(L_F\)-smoothness of F, which yields

$$\begin{aligned} F(\theta _{t+1}) - F(\theta _t)\le & {} \nabla F(\theta _t)^\top (\theta _{t+1}-\theta _t) +\frac{L_F}{2}\Vert \theta _{t+1}-\theta _t\Vert ^2 \\= & {} (\nabla F(\theta _t)-\frac{g_t}{2\tilde{F}_t})^\top (\theta _{t+1}-\theta _t) + (\frac{g_t}{2\tilde{F}_t}-\frac{1-\beta ^t}{1-\beta }v_t)^\top (\theta _{t+1}-\theta _t)\\{} & {} \quad +(\frac{1-\beta ^t}{1-\beta }v_t)^\top (\theta _{t+1}-\theta _t) +\frac{L_F}{2}\Vert \theta _{t+1}-\theta _t\Vert ^2. \\ \end{aligned}$$

Summation of the above over t from 1 to T and taken with the expectation gives

$$\begin{aligned} \mathbb {E}[F(\theta _{T+1})-F(\theta _{1})]\le \sum \limits _{i=1}^{4} S_i, \end{aligned}$$
(A.4)

where

$$\begin{aligned} S_1= & {} \mathbb {E}\Bigg [\sum _{t=1}^{T}\frac{1-\beta ^t}{1-\beta }v_{t}^\top (\theta _{t+1}-\theta _t)\Bigg ],\\ S_2= & {} \mathbb {E}\Bigg [\sum _{t=1}^{T}(\frac{g_t}{2\tilde{F}_t}-\frac{1-\beta ^t}{1-\beta }v_t)^\top (\theta _{t+1}-\theta _t)\Bigg ],\\ S_3= & {} \mathbb {E}\Bigg [\sum _{t=1}^{T}(\nabla F(\theta _t)-\frac{g_t}{2\tilde{F}_t})^\top (\theta _{t+1}-\theta _t)\Bigg ],\\ S_4= & {} \mathbb {E}\Bigg [\sum _{t=1}^{T}\frac{L_F}{2}\Vert \theta _{t+1}-\theta _t\Vert ^2\Bigg ]. \end{aligned}$$

Below, we bound \(S_1, S_2, S_3, S_4\) separately. To bound \(S_1\), we first note that

$$\begin{aligned} r_{t+1,i}-r_{t,i}=-2\eta r_{t+1,i}v^2_{t,i} = v_{t,i}(-2\eta r_{t+1,i}v_{t,i})=v_{t,i}(\theta _{t+1,i}-\theta _i) \end{aligned}$$

from which we get

$$\begin{aligned} S_1= & {} \mathbb {E}\Bigg [\sum _{t=1}^{T}\frac{1-\beta ^{t}}{1-\beta }v_{t}^{\top } (\theta _{t+1}-\theta _{t})\Bigg ]\\= & {} \mathbb {E}\Bigg [\sum _{i=1}^{n}\sum _{t=1}^{T} \frac{1-\beta ^t}{1-\beta } (r_{t+1,i}-r_{t,i})\Bigg ]\\\le & {} \mathbb {E}\Bigg [\sum _{i=1}^{n}\sum _{t=1}^{T} r_{t+1,i}-r_{t,i}\Bigg ]\quad (\text {Since}~ r_{t+1,i}\le r_{t,i})\\= & {} \sum _{i=1}^{n}\mathbb {E}[r_{T+1,i}]-n\mathbb {E}[\tilde{F}_1]. \end{aligned}$$

For \(S_2\), we have

$$\begin{aligned} S_2= & {} \mathbb {E}\Bigg [\sum \limits _{t=1}^{T}(\frac{g_t}{2\tilde{F}_t}-\frac{1-\beta ^t}{1-\beta }v_t)^\top (\theta _{t+1}-\theta _t)\Bigg ]\\= & {} \mathbb {E}\Bigg [\sum \limits _{i=1}^{n}\sum \limits _{t=1}^{T}(-\frac{1}{2\tilde{F}_t}\frac{\beta }{1-\beta }m_{t-1,i})^\top (-2\eta r_{t+1,i}v_{t,i})\Bigg ]\\\le & {} \frac{\beta \eta }{(1-\beta )\sqrt{a}}\mathbb {E}\Bigg [\left|\sum \limits _{i=1}^{n}\sum \limits _{t=1}^{T}r_{t+1,i}m_{t-1,i}v_{t,i}\right|\Bigg ]\\\le & {} \frac{\beta \eta }{(1-\beta )\sqrt{a}}\mathbb {E}\Bigg [\bigg (\sum \limits _{i=1}^{n}\sum \limits _{t=1}^{T}r_{t+1,i}m^2_{t-1,i}\bigg )^{1/2}\bigg (\sum \limits _{i=1}^{n}\sum \limits _{t=1}^{T}r_{t+1,i}v^2_{t,i}\bigg )^{1/2}\Bigg ]\\\le & {} \frac{\beta \sqrt{B}nF(\theta _1)}{(1-\beta )\sqrt{a}}, \end{aligned}$$

where the fourth inequality is by the Cauchy-Schwarz inequality, and the last inequality is by Lemma 1 (i) (ii).

For \(S_3\), by the Cauchy-Schwarz inequality, we have

$$\begin{aligned} S_3= & {} \mathbb {E}\Bigg [\sum \limits _{t=1}^{T}(\nabla F(\theta _t)-\frac{g_t}{2\tilde{F}_t})^\top (\theta _{t+1}-\theta _t)\Bigg ]\\\le & {} \mathbb {E}\Bigg [\sum \limits _{t=1}^{T}\Vert \nabla F(\theta _t)-\frac{g_t}{2\tilde{F}_t})\Vert \Vert \theta _{t+1}-\theta _t)\Vert \Bigg ]\\\le & {} \mathbb {E}\Bigg [\bigg (\sum \limits _{t=1}^{T}\Vert \nabla F(\theta _t)-\frac{g_t}{2\tilde{F}_t})\Vert ^2\bigg )^{1/2}\bigg (\sum \limits _{t=1}^{T}\Vert \theta _{t+1}-\theta _t\Vert ^2\bigg )^{1/2}\Bigg ]\\\le & {} \Bigg (\mathbb {E}\Bigg [\sum \limits _{t=1}^{T}\Vert \nabla F(\theta _t)-\frac{g_t}{2\tilde{F}_t})\Vert ^2\Bigg ]\Bigg )^{1/2}\Bigg (\mathbb {E}\Bigg [\sum \limits _{t=1}^{T}\Vert \theta _{t+1}-\theta _t\Vert ^2\Bigg ]\Bigg )^{1/2}\\\le & {} F(\theta _1)\sqrt{\eta n T}\sqrt{\frac{G^2_\infty }{8a^3}\sigma ^2_f+\frac{1}{2a}\sigma ^2_g}, \end{aligned}$$

where the last inequality is by (vi) in Lemma 1 and (4.2) in Theorem 1.

For \(S_4\), also by (4.2) in Theorem 1, we have

$$\begin{aligned} S_4 = \frac{L_F}{2}\mathbb {E}\Bigg [\sum _{t=1}^{T}\Vert \theta _{t+1}-\theta _t\Vert ^2\Bigg ] \le \frac{L_F\eta n F^2(\theta _1)}{2} . \end{aligned}$$

With the above bounds on \(S_1, S_2, S_3, S_4\), (A.4) can be rearranged as

$$\begin{aligned}{} & {} \quad F(\theta ^*) - \frac{\beta \sqrt{B}nF(\theta _1)}{(1-\beta )\sqrt{a}} -F(\theta _1)\sqrt{\eta n T}\sqrt{\frac{G^2_\infty }{4a^3}\sigma ^2_f+\frac{1}{a}\sigma ^2_g}- \frac{L_F\eta n F^2(\theta _1)}{2}\\{} & {} \le \sum \limits _{i=1}^{n}\mathbb {E}[r_{T+1,i}] - n \mathbb {E}[\tilde{F}_1] + F(\theta _1) \\{} & {} \le \Big (\min \limits _i \mathbb {E}[r_{T+1,i}]+(n-1)\mathbb {E}[\tilde{F}_1]\Big )- (n-1)\mathbb {E}[\tilde{F}_1] + \Big (F(\theta _1)-\mathbb {E}[\tilde{F}_1]\Big )\\{} & {} \le \min \limits _i\mathbb {E}[r_{T+1,i}]+ \mathbb {E}[|F(\theta _1)-\tilde{F}_1|] \\{} & {} \le \min \limits _i\mathbb {E}[r_{T+1,i}]+ \frac{1}{2\sqrt{a}}\sigma _f, \end{aligned}$$

where (iii) in Lemma 1 was used. Hence,

$$\begin{aligned} \min \limits _i\mathbb {E}[r_{T,i}]\ge \max \{F(\theta ^*)-\eta D_1-\beta D_2-\sigma D_3,0\}, \end{aligned}$$

where \(\sigma =\max \{\sigma _f,\sigma _g\}\) and

$$\begin{aligned}{} & {} D_1 = \frac{L_F n F^2(\theta _1)}{2}, \quad D_2 =\frac{\sqrt{B}nF(\theta _1)}{(1-\beta )\sqrt{a}},\\{} & {} D_3 = \frac{1}{2\sqrt{a}} + F(\theta _1)\sqrt{\eta n T}\sqrt{\frac{G^2_\infty }{4a^3}+\frac{1}{a}}. \end{aligned}$$

In the case \(\sigma =0\), we obtain the stated estimate in Theorem 2.

Appendix 2. Proof of Theorem 3

The upper bound on \(\sigma _g\) is given by (iv) in Lemma 1. Since f is L-smooth, we have

$$\begin{aligned} f(\theta _{t+1})\le f(\theta _t)+\nabla f(\theta _t)^\top (\theta _{t+1}-\theta _t)+\frac{L}{2}\Vert \theta _{t+1}-\theta _t\Vert ^2. \end{aligned}$$
(B.1)

Denoting \(\eta _t=\eta /\tilde{F}_t\), the second term in the RHS of (B.1) can be expressed as

$$\begin{aligned}{} & {} \quad \nabla f(\theta _t)^\top (\theta _{t+1}-\theta _t)\nonumber \\= & {} \nabla f(\theta _t)^\top (-2\eta r_{t+1}v_{t})\nonumber \\= & {} -\frac{1}{1-\beta ^t}\nabla f(\theta _t)^\top \eta _t r_{t+1}m_{t}\quad (\text {since}\;m_t=2(1-\beta ^t)\tilde{F}_t v_t)\nonumber \\= & {} -\frac{1}{1-\beta ^t}\nabla f(\theta _t)^\top \eta _t r_{t+1}(\beta m_{t-1}+(1-\beta )g_t) \nonumber \\ \end{aligned}$$
$$\begin{aligned}= & {} -\frac{1-\beta }{1-\beta ^t}\nabla f(\theta _t)^\top \eta _t r_{t+1}g_{t} - \frac{\beta }{1-\beta ^t}\nabla f(\theta _t)^\top \eta _t r_{t+1}m_{t-1}\nonumber \\= & {} -\frac{1-\beta }{1-\beta ^t}\nabla f(\theta _t)^\top \eta _{t-1} r_{t}g_{t} + \frac{1-\beta }{1-\beta ^t}\nabla f(\theta _t)^\top (\eta _{t-1} r_{t}-\eta _t r_{t+1}) g_{t}\nonumber \\{} & {} \quad -\frac{\beta }{1-\beta ^t}\nabla f(\theta _t)^\top \eta _t r_{t+1}m_{t-1}. \end{aligned}$$
(B.2)

We further bound the second term and third term in the RHS of (B.2), respectively. For the second term, we note that \(|\frac{1-\beta }{1-\beta ^t}|\le 1\) and

$$\begin{aligned}&\quad |&\!\!\nabla f(\theta _t)^\top (\eta _{t-1}r_{t}-\eta _tr_{t+1}) g_{t}|\nonumber \\= & {} |\nabla f(\theta _t)^\top \eta _{t-1}(r_{t}-r_{t+1}) g_{t}+ \nabla f(\theta _t)^\top (\eta _{t-1}-\eta _t)r_{t+1}g_{t}|\nonumber \\= & {} |\nabla f(\theta _t)^\top \eta _{t-1}(r_{t}-r_{t+1}) g_{t}+ (\eta _{t-1}-\eta _t)g_t^\top r_{t+1}g_{t}\nonumber \\{} & {} \!\!\!\!\!\!\quad + (\eta _{t-1}-\eta _t)(\nabla f(\theta _t)-g_t)^\top r_{t+1}g_{t}|\nonumber \\\le & {} \Vert \nabla f(\theta _t)\Vert _\infty |\eta _{t-1}|\Vert r_{t}-r_{t+1}\Vert _{1,1} \Vert g_{t}\Vert _\infty + |\eta _{t-1}-\eta _t|g_t^\top r_{t+1}g_{t}\nonumber \\{} & {} \!\!\!\!\!\!\quad + |\eta _{t-1}-\eta _t||(\nabla f(\theta _t)-g_t)^\top r_{t+1}g_{t}|\nonumber \\\le & {} (\eta G^2_\infty /\sqrt{a})(\Vert r_t\Vert _{1,1}-\Vert r_{t+1}\Vert _{1,1})+(2\eta /\sqrt{a})g_t^\top r_{t+1}g_{t}\nonumber \\{} & {} \!\!\!\!\!\!\quad + (2\eta /\sqrt{a})|(\nabla f(\theta _t)-g_t)^\top r_{t+1}g_{t}|. \end{aligned}$$
(B.3)

The third inequality holds because for a positive diagonal matrix A, \(x^\top Ay\le \Vert x\Vert _\infty \Vert A\Vert _{1,1}\Vert y\Vert _\infty \), where \(\Vert A\Vert _{1,1}=\sum _{i}a_{ii}\). The last inequality follows from the result \(r_{t+1,i}\le r_{t,i}\) for \(i\in [n]\), the assumption \(\Vert g_t\Vert _\infty \le G_\infty \), \(\tilde{F}_t\ge \sqrt{a}\), and (i) in Lemma (1).

For the third term in the RHS of (B.2), we note that

$$\begin{aligned} -\frac{\beta }{1-\beta ^t} \nabla f(\theta _t)^\top \eta _t r_{t+1}m_{t-1} \le \frac{\beta \eta }{(1-\beta )\sqrt{a}}|\nabla f(\theta _t)^\top \eta _t r_{t+1}m_{t-1}|, \end{aligned}$$

in which

$$\begin{aligned}{} & {} \!\!\!\!\!\!\quad |\nabla f(\theta _t)^\top r_{t+1}m_{t-1}|\nonumber \\= & {} |g_{t}^\top r_{t+1}m_{t-1}+(\nabla f(\theta _t)-g_t)^\top r_{t+1}m_{t-1}|\nonumber \\\le & {} \frac{1}{2}g_t^\top r_{t+1}g_t+\frac{1}{2}m_{t-1}^\top r_{t+1}m_{t-1}+ |(\nabla f(\theta _t)-g_t)^\top r_{t+1}m_{t-1}|, \end{aligned}$$
(B.4)

where the last inequality is because for a positive diagonal matrix A, \(x^\top Ay\le \frac{1}{2}x^\top Ax+\frac{1}{2}y^\top Ay\). Substituting (B.3) and (B.4) into (B.2), we get

$$\begin{aligned} \nabla f(\theta _t)^\top{} & {} \!\!\!(\theta _{t+1}-\theta _t) \le -\frac{1-\beta }{1-\beta ^t}\nabla f(\theta _t)^\top \eta _{t-1}r_{t} g_{t} + \frac{\eta G^2_\infty }{\sqrt{a}}(\Vert r_{t}\Vert _{1,1}-\Vert r_{t+1}\Vert _{1,1})\nonumber \\{} & {} +\bigg (\frac{2\eta }{\sqrt{a}}+\frac{\beta \eta }{2(1-\beta )\sqrt{a}}\bigg )g_t^\top r_{t+1}g_t + \frac{\beta \eta }{2(1-\beta )\sqrt{a}}m_{t-1}^\top r_{t+1}m_{t-1}\nonumber \\{} & {} +\frac{2\eta }{\sqrt{a}}|(\nabla f(\theta _t)-g_t)^\top r_{t+1}g_{t}|+\frac{\beta \eta }{(1-\beta )\sqrt{a}}|(\nabla f(\theta _t)-g_t)^\top r_{t+1}m_{t-1}|. \end{aligned}$$
(B.5)

With (B.5), we take an conditional expectation on (B.1) with respect to \((\theta )\) and rearrange to get

$$\begin{aligned}{} & {} \quad \frac{1-\beta }{1-\beta ^t}\nabla f(\theta _t)^\top \eta _{t-1}r_{t}\nabla f(\theta _t) =\mathbb {E}_{\xi _t}\bigg [\frac{1-\beta }{1-\beta ^t}\nabla f(\theta _t)^\top \eta _{t-1}r_{t}g_t\bigg ]\nonumber \\{} & {} \le \mathbb {E}_{\xi _t}\Bigg [f(\theta _t)-f(\theta _{t+1})+ \frac{\eta G^2_\infty }{\sqrt{a}}(\Vert r_{t}\Vert _{1,1}-\Vert r_{t+1}\Vert _{1,1})\nonumber \\{} & {} \quad \quad +\bigg (\frac{2\eta }{\sqrt{a}}+\frac{\beta \eta }{2(1-\beta )\sqrt{a}}\bigg )g_t^\top r_{t+1}g_t + \frac{\beta \eta }{2(1-\beta )\sqrt{a}}m_{t-1}^\top r_{t+1}m_{t-1}\nonumber \\{} & {} \quad \quad +\frac{2\eta }{\sqrt{a}}|(\nabla f(\theta _t)-g_t)^\top r_{t+1}g_{t}|\nonumber \\{} & {} \quad \quad +\frac{\beta \eta }{(1-\beta )\sqrt{a}}|(\nabla f(\theta _t)-g_t)^\top r_{t+1}m_{t-1}|+\frac{L}{2}\Vert \theta _{t+1}-\theta _t\Vert ^2\Bigg ], \end{aligned}$$
(B.6)

where the assumption \(\mathbb {E}_{\xi _t}[g_t]=\nabla f(\theta _t)\) is used in the first equality. Since \(\xi _1,...,\xi _t\) are independent random variables, we set \(\mathbb {E}=\mathbb {E}_{\xi _1}\mathbb {E}_{\xi _2}...\mathbb {E}_{\xi _T}\) and take a summation on (B.6) over t from 1 to T to get

$$\begin{aligned}{} & {} \quad \mathbb {E}\Bigg [\sum \limits _{t=1}^{T}\frac{1-\beta }{1-\beta ^t}\nabla f(\theta _t)^\top \eta _{t-1}r_{t}\nabla f(\theta _t)\Bigg ]\nonumber \\{} & {} \le \mathbb {E}\Big [f(\theta _1)-f(\theta _{T+1})\Big ] + \frac{\eta G^2_\infty }{\sqrt{a}}\mathbb {E}\Big [\Vert r_{1}\Vert _{1,1}-\Vert r_{T+1}\Vert _{1,1}\Big ]\nonumber \\{} & {} \quad + \bigg (\frac{2\eta }{\sqrt{a}}+\frac{\beta \eta }{2(1-\beta )\sqrt{a}}\bigg )\mathbb {E}\Bigg [\sum \limits _{t=1}^{T}g_t^\top r_{t+1}g_t\Bigg ] + \frac{\beta \eta }{2(1-\beta )\sqrt{a}}\mathbb {E}\Bigg [\sum \limits _{t=1}^{T}m_{t-1}^\top r_{t}m_{t-1}\Bigg ]\nonumber \\{} & {} \quad +\frac{2\eta }{\sqrt{a}}\mathbb {E}\Bigg [\sum \limits _{t=1}^{T}|(\nabla f(\theta _t)-g_t)^\top r_{t+1}g_{t}|\Bigg ]\nonumber \\{} & {} \quad +\frac{\beta \eta }{(1-\beta )\sqrt{a}}\mathbb {E}\Bigg [\sum \limits _{t=1}^{T}|(\nabla f(\theta _t)-g_t)^\top r_{t+1}m_{t-1}|\Bigg ]+\frac{L}{2}\mathbb {E}\Bigg [\sum \limits _{t=1}^{T}\Vert \theta _{t+1}-\theta _{t}\Vert ^2\Bigg ]. \end{aligned}$$
(B.7)

Below, we bound each term in (B.7) separately. By the Cauchy-Schwarz inequality, we get

$$\begin{aligned}{} & {} \quad \mathbb {E}\Bigg [\sum \limits _{t=1}^{T}|(\nabla f(\theta _t)-g_t)^\top r_{t+1}m_{t-1}|\Bigg ]\nonumber \\{} & {} \le \mathbb {E}\Bigg [\sum \limits _{t=1}^{T}\Vert \nabla f(\theta _t)-g_t\Vert \Vert r_{t+1}m_{t-1}\Vert \Bigg ]\nonumber \\{} & {} \le \mathbb {E}\Bigg [\bigg (\sum \limits _{t=1}^{T}\Vert \nabla f(\theta _t)-g_t\Vert ^2\bigg )^{1/2}\bigg (\sum \limits _{t=1}^{T}\Vert r_{t+1}m_{t-1}\Vert ^2\bigg )^{1/2}\Bigg ]\nonumber \\ \end{aligned}$$
$$\begin{aligned}{} & {} \le \Bigg (\mathbb {E}\Bigg [\sum \limits _{t=1}^{T}\Vert \nabla f(\theta _t)-g_t\Vert ^2\Bigg ]\Bigg )^{1/2}\Bigg (\mathbb {E}\Bigg [\sum \limits _{t=1}^{T}\Vert r_{t+1}m_{t-1}\Vert ^2\Bigg ]\Bigg )^{1/2}\nonumber \\{} & {} \le \sqrt{2BnT/\eta }F(\theta _1)\sigma _g, \end{aligned}$$
(B.8)

where Lemma 1 (ii) and the bounded variance assumption were used. We replace \(m_{t-1}\) in () by \(g_t\) and use Lemma 1 (v) to get

$$\begin{aligned}{} & {} \quad \mathbb {E}\Bigg [\sum \limits _{t=1}^{T}|(\nabla f(\theta _t)-g_t)^\top r_{t+1}g_{t}|\Bigg ] \nonumber \\{} & {} \le \Bigg (\mathbb {E}\Bigg [\sum \limits _{t=1}^{T}\Vert \nabla f(\theta _t)-g_t\Vert ^2\Bigg ]\Bigg )^{1/2}\Bigg (\mathbb {E}\Bigg [\sum \limits _{t=1}^{T}\Vert r_{t+1}g_{t}\Vert ^2\Bigg ]\Bigg )^{1/2}\nonumber \\{} & {} \le \frac{2\sqrt{2BnT/\eta }F(\theta _1)\sigma _g}{1-\beta }. \end{aligned}$$
(B.9)

By (4.2), the last term in (B.7) is bounded above by

$$\begin{aligned} \frac{L}{2}\mathbb {E}\left[ \sum _{t=0}^\infty \Vert \theta _{t+1}-\theta _t\Vert ^2 \right] \le \frac{L\eta n}{2}F^2(\theta _1). \end{aligned}$$
(B.10)

Substituting Lemma 1 (i) (iii), (B.10), (B.9), () into (B.7) to get

$$\begin{aligned} \mathbb {E}\Bigg [\sum \limits _{t=1}^{T}\frac{1-\beta }{1-\beta ^t}{} & {} \!\!\!\nabla f(\theta _t)^\top \eta _{t-1} r_{t}\nabla f(\theta _t)\Bigg ] \le (f(\theta _1)-f^*)+\frac{\eta G^2_\infty }{\sqrt{a}}nF(\theta _1)\nonumber \\{} & {} +\bigg (\frac{2}{\sqrt{a}}+\frac{\beta }{2(1-\beta )\sqrt{a}}\bigg )\frac{8BnF(\theta _1)}{(1-\beta )^2}+\frac{\beta BnF(\theta _1)}{(1-\beta )\sqrt{a}}\nonumber \\{} & {} +\frac{(4+\beta )\sqrt{2B\eta }}{(1-\beta )\sqrt{a}}F(\theta _1)\sqrt{nT}\sigma _g+\frac{L\eta n}{2}F^2(\theta _1). \end{aligned}$$
(B.11)

Note that the left hand side is bounded from below by

$$\begin{aligned} (1-\beta )\frac{\eta }{\sqrt{B}} \mathbb {E}\Bigg [\min \limits _ir_{T,i}\sum \limits _{t=1}^{T}\Vert \nabla f(\theta _t)\Vert ^2\Bigg ], \end{aligned}$$

where we used \(|\frac{1-\beta }{1-\beta ^t}|\ge 1-\beta \) and \(\eta _t\ge \eta /\sqrt{B}\). Thus, we have

$$\begin{aligned} \mathbb {E}\Bigg [\min \limits _ir_{T,i}\sum \limits _{t=1}^{T}\Vert \nabla f(\theta _t)\Vert ^2\Bigg ] \le \frac{C_1+C_2n+C_3\sigma _g \sqrt{ nT}}{\eta }, \end{aligned}$$
(B.12)

where

$$\begin{aligned} C_1= & {} \frac{(f(\theta _1)-f^*)\sqrt{B}}{1-\beta },\\ C_2= & {} \frac{\sqrt{B}\eta G^2_\infty F(\theta _1)}{(1-\beta )\sqrt{a}} +\bigg (\frac{2}{\sqrt{a}}+\frac{\beta }{2(1-\beta )\sqrt{a}}\bigg )\frac{8B^{3/2}F(\theta _1)}{(1-\beta )^3}\\{} & {} \quad \!\!\!\!+\frac{\beta B^{3/2}F(\theta _1)}{(1-\beta )^2\sqrt{a}}+\frac{\sqrt{B}L\eta }{2(1-\beta )^2}F^2(\theta _1),\\ C_3= & {} \frac{(4+\beta )B\sqrt{2\eta }}{(1-\beta )\sqrt{a}}F(\theta _1). \end{aligned}$$

By the Hölder inequality, we have for any \(\alpha \in (0, 1)\),

$$\begin{aligned} \mathbb {E}[X^\alpha ]\le \mathbb {E}[XY]^{\alpha } \mathbb {E}[Y^{-\alpha /(1-\alpha )}]^{1-\alpha }. \end{aligned}$$

Take \(X=\Delta :=\sum _{t=1}^{T}\Vert \nabla f(\theta _t)\Vert ^2, Y=\min _ir_{T,i}\), we obtain

$$\begin{aligned} \mathbb {E}[\Delta ^\alpha ]\le \mathbb {E}[\min _ir_{T,i}\Delta ]^\alpha \mathbb {E}[(\min _ir_{T,i})^{-\alpha /(1-\alpha )}]^{1-\alpha }. \end{aligned}$$

Using (B.12), and lower bounding \(\mathbb {E}[\Delta ^\alpha ]\) by \(T^{\alpha }\mathbb {E}[\min _{1\le t\le T}\Vert \nabla f(\theta _t)\Vert ^{2\alpha }]\), we obtain

$$\begin{aligned} \mathbb {E}\Bigg [\min _{1\le t \le T}\Vert \nabla f(\theta _t)\Vert ^{2\alpha } \Bigg ] \le \left( \frac{C_1+C_2n+C_3\sigma _g \sqrt{ nT}}{\eta T } \right) ^\alpha \mathbb {E}[(\min _ir_{T,i})^{-\alpha /(1-\alpha )}]^{1-\alpha }. \end{aligned}$$

This by taking \(\alpha =1-\epsilon \) yields the stated bound.

Appendix 3. Proof of Theorem 4

Using the same argument as for (iv) in Lemma 2, we have

$$\begin{aligned} \sum \limits _{i=1}^{n}\sum \limits _{t=1}^{T}r_{t+1,i}g^2_{t,i}\le \frac{8Bn\sqrt{f_1(\theta _1)+c}}{(1-\beta )^2\eta }. \end{aligned}$$

With this estimate and the convexity of \(f_t\), the regret can be bounded by

$$\begin{aligned} R(T)= & {} \sum \limits _{t=1}^{T}f_t(\theta _t)-f_t(\theta ^*) \le \sum \limits _{t=1}^{T} g_t^\top (\theta _t-\theta ^*)\\\le & {} \sum \limits _{i=1}^{n}\sum \limits _{t=1}^{T}|g_{t,i}|\sqrt{r_{t+1,i}} \frac{|\theta _{t,i}-\theta ^*_i|}{\sqrt{r_{t+1,i}}}\\\le & {} \left( \sum \limits _{i=1}^{n}\sum \limits _{t=1}^{T} r_{t+1, i} g_{t, i}^2 \right) ^{1/2}\left( \sum \limits _{i=1}^{n}\sum \limits _{t=1}^{T} \frac{|\theta _{t,i}-\theta ^*_i|^2}{r_{t+1,i}}\right) ^{1/2}\\\le & {} \frac{2D_\infty \sqrt{2B}}{1-\beta }(f_1(\theta _1)+c)^{1/4}\sqrt{nT/\eta }\left( \sum \limits _{i=1}^{n} \frac{1}{r_{T+1,i}}\right) ^{1/2}, \end{aligned}$$

where the fourth inequality is by the Cauchy-Schwarz inequality, and the assumption \(\Vert x-y\Vert _\infty \le D_\infty \) for all \(x,y\in \mathcal {F}\) is used in the last inequality.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, H., Tian, X. SGEM: stochastic gradient with energy and momentum. Numer Algor 95, 1583–1610 (2024). https://doi.org/10.1007/s11075-023-01621-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11075-023-01621-x

Keywords

Navigation