Skip to main content
Log in

SAdaBoundNc: an adaptive subgradient online learning algorithm with logarithmic regret bounds

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Adaptive (sub)gradient methods have received wide applications such as the training of deep networks. The square-root regret bounds are achieved in convex settings. However, how to exploit strong convexity for improving convergence rate and improve the generalization performance of adaptive (sub)gradient methods remain an open problem. For this reason, we devise an adaptive subgradient online learning algorithm called SAdaBoundNc in strong convexity settings. Moreover, we rigorously prove that the logarithmic regret bound can be achieved by choosing a faster diminishing learning rate. Further, we conduct various experiments to evaluate the performance of SAdaBoundNc on four real-world datasets. The results demonstrate that the training speed of SAdaBoundNc outperforms stochastic gradient descent and several adaptive gradient methods in the initial training process. Moreover, the generalization performance of SAdaBou-ndNc is also better than the current state-of-the-art methods on different datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availability

The data that support the findings of this study are CIFAR-10 [25], CIFAR-100 [25], Penn TreeBank [29], and VOC2007 [30].

References

  1. Bottou L (2010) Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010. Physica-Verlag, Heidelberg, pp 177–186

  2. Haykin S (2014) Adaptive filter theory, 5th edn. Pearson Education, London

    MATH  Google Scholar 

  3. Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22:400–407

    Article  MathSciNet  MATH  Google Scholar 

  4. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313:504–507

    Article  MathSciNet  MATH  Google Scholar 

  5. Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12:2121–2159

    MathSciNet  MATH  Google Scholar 

  6. Tieleman T, Hinton G (2012) Rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Netw Mach Learn 4:26–31

  7. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: Proceedings of the 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, Conference Track Proceedings, 2015. arxiv:1412.6980

  8. Wilson AC, Roelofs R, Stern M, Srebro N, Recht B (2017) The marginal value of adaptive gradient methods in machine learning. In: Proceedings of the 31st international conference on neural information processing systems, Curran Associates, Inc., pp 4148–4158

  9. Reddi SJ, Kale S, Kumar S (2018) On the convergence of Adam and beyond. In: Proceedings of the 6th international conference on learning representations, ICLR 2018, Vancouver, BC, Canada, April 30, May 3, Conference Track Proceedings. www.OpenReview.net, 2018. https://openreview.net/forum?id=ryQu7f-RZ

  10. Luo L, Xiong Y, Liu Y, Sun X (2019) Adaptive gradient methods with dynamic bound of learning rate,. In: Proceedings of the 7th international conference on learning representations, ICLR 2019, New Orleans, LA, USA, May 6–9, 2019, www.OpenReview.net. https://openreview.net/forum?id=Bkg3g2R9FX

  11. Chen J, Zhou D, Tang Y, Yang Z, Cao Y, Gu Q (2020) Closing the generalization gap of adaptive gradient methods in training deep neural networks. In: Proceedings of the twenty-ninth international joint conference on artificial intelligence, IJCAI , www.ijcai.org 2020, pp 3267–3275. https://doi.org/10.24963/ijcai.2020/452

  12. Hazan E (2016) Introduction to online convex optimization. Found Trends Optim 2:157–325

    Article  Google Scholar 

  13. Hazan E, Agarwal A, Kale S (2007) Logarithmic regret algorithms for online convex optimization. Mach Learn 69:169–192

    Article  MATH  Google Scholar 

  14. Mukkamala MC, Hein M (2017) Variants of rmsprop and adagrad with logarithmic regret bounds. In: Proceedings of the 34th international conference on machine learning, ICML 2017, Sydney,NSW, Australia, 6–11 August , vol 70 of proceedings of machine learning research, PMLR, 2017, pp 2545–2553. http://proceedings.mlr.press/v70/mukkamala17a.html

  15. Wang G, Lu S, Cheng Q, Tu W, Zhang L (2020) Sadam: a variant of adam for strongly convex functions. In: Proceedings of the 8th international conference on learning representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, www.OpenReview.net, 2020. https://openreview.net/forum?id=rye5YaEtPr

  16. Zinkevich M (2003) Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the Twentieth International conference on machine learning (ICML 2003), August 21–24, Washington, DC, USA, AAAI Press, 2003, pp 928–936. http://www.aaai.org/Library/ICML/2003/icml03-120.php

  17. Cesa-Bianchi N, Conconi A, Gentile C (2004) On the generalization ability of on-line learning algorithms. IEEE Trans Inf Theory 50:2050–2057

    Article  MathSciNet  MATH  Google Scholar 

  18. Zeiler MD (2012) Adadelta: an adaptive learning rate method. CoRR arxiv:1212.5701

  19. Dozat T (2016) Incorporating nesterov momentum into adam. In: Proceedings of the 4th international conference on learning representations, ICLR 2016, San Juan, Puerto Rico, May 2–4, Workshop Track Proceedings,2016. https://openreview.net/forum?id=OM0jvwB8jIp57Z-JjtNEZ &noteId=OM0jvwB8jIp57ZJjtNEZ

  20. Gregor K, Danihelka I, Graves A, Rezende DJ, Wierstra D (2015) DRAW: a recurrent neural network for image generation. In: Proceedings of the 32nd international conference on machine learning, ICML 2015, Lille, France, 6–11 July , vol 37 of JMLR workshop and conference proceedings, www.JMLR.org, 2015, pp. 1462–1471. http://proceedings.mlr.press/v37/gregor15.html

  21. Xu K, Ba J, Kiros R, Cho K, Courville AC, Salakhutdinov R, Zemel RS, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the 32nd international conference on machine learning, ICML 2015, Lille,France, 6–11 July , vol 37 of JMLR Workshop and conference proceedings, www.JMLR.org, 2015, pp 2048–2057. http://proceedings.mlr.press/v37/xuc15.html

  22. Choi D, Shallue CJ, Nado Z, Lee J, Maddison CJ, Dahl GE (2019) On empirical comparisons of optimizers for deep learning. CoRR arxiv:1910.05446

  23. Keskar NS, Socher R (2017) Improving generalization performance by switching from Adam to SGD. CoRR arxiv:1712.07628

  24. Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  25. Krizhevsky A (2009) Learning multiple layers of features from tiny images. Technical Report, Citeseer

  26. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of 2016 IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, IEEE Computer Society, 2016, pp. 770–778. https://doi.org/10.1109/CVPR.2016.90

  27. Chen Z, Xu Y, Chen E, Yang T (2018) SADAGRAD: strongly adaptive stochastic gradient method. In: Dy JG, Krause A (Eds.), Proceedings of the 35th international conference on machine learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10–15, vol 80 of proceedings of machine learning research, PMLR, 2018, pp 912–920. http://proceedings.mlr.press/v80/chen18m.html

  28. McMahan HB, Streeter MJ (2010) Adaptive bound optimization for online convex optimization. In: Proceedings of The 23rd conference on learning theory (COLT 2010), Haifa, Israel, June 27–29, Omnipress, 2010, pp. 244–256. http://colt2010.haifa.il.ibm.com/papers/COLT2010proceedings.pdfpage=252

  29. Marcus M, Santorini B, Marcinkiewicz MA (1993) Building a large annotated corpus of English: The Penn Treebank

  30. Everingham M, Gool LV, Williams CKI, Winn JM, Zisserman A (2009) The Pascal Visual Object Classes (VOC) challenge. Int J Comput Vision 88:303–338

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grants No. 62002102, No. 72002133, and No. 62172142, and in part by the Leading talents of science and technology in the Central Plain of China under Grant No. 224200510004, and in part by the Ministry of Education of China Science Foundation under Grants No. 19YJC630174, and in part by the Program for Science & Technology Innovation Talents in the University of Henan Province under Grant No. 22HASTIT014.

Author information

Authors and Affiliations

Authors

Contributions

LW contributed to methodology, theoretical analysis. XW contributed to validation and software. RZ contributed to conceptualization, writing—review and editing. JZ contributed to resources and investigation. MZ contributed to supervision, project administration, and funding acquisition.

Corresponding author

Correspondence to Xin Wang.

Ethics declarations

Conflict of interest

The authors declare that there is no conflict of interests regarding the publication of this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A

Proof of Theorem 1

Following from Eqs. (1) and (8), we have

$$\begin{aligned}w_{k+1}&=P_{{\mathcal {K}},\Delta _k^{-1}}\left( w_k-\delta _k\odot m_k\right) \\&\quad =\arg \min _{w\in {\mathcal {K}}}\left\| w-\left( w_k-\delta _{k}\odot m_k\right) \right\| _{\Delta _k^{-1}}^2. \\ \end{aligned}$$

Moreover, \(P_{{\mathcal {K}},\Delta _k}\left( w^{*}\right) =w^{*}\) for all \(w^{*}\in {\mathcal {K}}\). Using the non-expansiveness property of the weighted projection [28], we have

$$ \begin{aligned} \left\| {w_{{k + 1}} - w^{*} } \right\|_{{\Delta _{k}^{{ - 1}} }}^{2} & = \left\| {P_{{{\mathcal{K}},\Delta _{k}^{{ - 1}} }} \left( {w_{k} - \delta _{k} \odot m_{k} } \right) - w^{*} } \right\|_{{\Delta _{k}^{{ - 1}} }}^{2} \\ & \le \left\| {w_{k} - \delta _{k} \odot m_{k} - w^{*} } \right\|_{{\Delta _{k}^{{ - 1}} }}^{2} \\ & = \left\| {w_{k} - w^{*} } \right\|_{{\Delta _{k}^{{ - 1}} }}^{2} + \left\| {m_{k} } \right\|_{{\Delta _{k} }}^{2} \\ & \quad - 2\left\langle {m_{k} ,w_{k} - w^{*} } \right\rangle = \left\| {w_{k} - w^{*} } \right\|_{{\Delta _{k}^{{ - 1}} }}^{2} \\ & \quad + \left\| {m_{k} } \right\|_{{\Delta _{k} }}^{2} - 2\left\langle {\lambda _{{1k}} m_{{k - 1}} + \left( {1 - \lambda _{{1k}} } \right)s_{k} ,w_{k} - w^{*} } \right\rangle . \\ \end{aligned} $$
(12)

Rearranging Eq. (12), we obtain

$$\begin{aligned} \left\langle s_k,w_k-w^{*}\right\rangle&\le \frac{\left\| w_k-w^{*}\right\| _{\Delta _k^{-1}}^2-\left\| w_{k+1}-w^{*}\right\| _{\Delta _k^{-1}}^2}{2\left( 1-\lambda _{1k}\right) } \nonumber \\&\quad +\frac{\left\| m_k\right\| _{\Delta _k}^2}{2\left( 1-\lambda _{1k}\right) }+\frac{\lambda _{1k}}{1-\lambda _{1k}}\left\langle m_{k-1},w^{*}-w_k\right\rangle . \end{aligned}$$
(13)

Further, by using Cauchy–Schwarz and Young’s inequality, the term \(\frac{\lambda _{1k}}{1-\lambda _{1k}}\left\langle m_{k-1},w^{*}-w_k\right\rangle \) in Eq. (13) can be upper bounded as follows:

When \(k=1\), \(\frac{\lambda _{1k}}{1-\lambda _{1k}}\left\langle m_{k-1},w^{*}-w_k\right\rangle =0\) due to \(m_0=0\); When \(k\ge 2\), we have

$$\begin{aligned}\frac{\lambda _{1k}}{1-\lambda _{1k}}\left\langle m_{k-1},w^{*}-w_k\right\rangle&\le \frac{\lambda _{1k}}{2(1-\lambda _{1k})}\left\| m_{k-1}\right\| _{\Delta _{k-1}}^2 \\&\quad +\frac{\lambda _{1k}}{2(1-\lambda _{1k})}\left\| w_{k}-w^{*}\right\| _{\Delta _{k-1}^{-1}}^2. \end{aligned}$$
(14)

Plugging Eqs. (14) into  (13), we get

$$\begin{aligned}\left\langle s_k,w_k-w^{*}\right\rangle&\le \frac{\left\| w_k-w^{*}\right\| _{\Delta _k^{-1}}^2-\left\| w_{k+1}-w^{*}\right\| _{\Delta _k^{-1}}^2}{2\left( 1-\lambda _{1k}\right) } \\&\quad +\frac{\left\| m_k\right\| _{\Delta _k}^2}{2\left( 1-\lambda _{1k}\right) }+\frac{\lambda _{1k}}{2(1-\lambda _{1k})}\left\| m_{k-1}\right\| _{\Delta _{k-1}}^2 \\&\quad +\frac{\lambda _{1k}}{2(1-\lambda _{1k})}\left\| w_{k}-w^{*}\right\| _{\Delta _{k-1}^{-1}}^2. \end{aligned}$$
(15)

We now derive the bound of the regret of SAdaBoundNc. By utilizing the strong convexity of \(f_k\) for all \(k\in [K]\), we have

$$\begin{aligned}R_K&=\sum \limits _{k=1}^K\left( f_k\left( w_k\right) -f_k\left( w^{*}\right) \right) \\&\quad \le \sum \limits _{k=1}^K\left\langle s_k, w_k-w^{*}\right\rangle -\sum \limits _{k=1}^K\frac{\mu }{2}\left\| w_k-w^{*}\right\| ^2. \\ \end{aligned}$$
(16)

Plugging Eqs. (15) into (16), we obtain

$$\begin{aligned} R_K&\le \underbrace{\sum \limits _{k=1}^K\frac{\left\| w_k-w^{*}\right\| _{\Delta _k^{-1}}^2-\left\| w_{k+1}-w^{*}\right\| _{\Delta _k^{-1}}^2}{2\left( 1-\lambda _{1k}\right) }}_{T_1} \nonumber \\&\quad +\underbrace{\sum \limits _{k=1}^K \frac{\left\| m_k\right\| _{\Delta _k}^2}{2\left( 1-\lambda _{1k}\right) }+\sum \limits _{k=2}^K\frac{\lambda _{1k}}{2(1-\lambda _{1k})}\left\| m_{k-1}\right\| _{\Delta _{k-1}}^2}_{T_2} \nonumber \\&\quad +\underbrace{\sum \limits _{k=2}^K\frac{\lambda _{1k}}{2(1-\lambda _{1k})}\left\| w_{k}-w^{*}\right\| _{\Delta _{k-1}^{-1}}^2}_{T_3} \nonumber \\&\quad -\frac{\mu }{2}\sum \limits _{k=1}^K\left\| w_k-w^{*}\right\| ^2. \end{aligned}$$
(17)

To obtain the upper bound of Eq. (17), we first derive the following relation:

$$ \begin{aligned} T_{1} & = \frac{{\left\| {w_{1} - w^{*} } \right\|_{{\Delta _{1}^{{ - 1}} }}^{2} }}{{2(1 - \lambda _{{11}} )}} - \frac{{\left\| {w_{{K + 1}} - w^{*} } \right\|_{{\Delta _{K}^{{ - 1}} }}^{2} }}{{2(1 - \lambda _{{1K}} )}} \\ & \quad + \sum\limits_{{k = 2}}^{K} {\left( {\frac{{\left\| {w_{k} - w^{*} } \right\|_{{\Delta _{k}^{{ - 1}} }}^{2} }}{{2(1 - \lambda _{{1k}} )}} - \frac{{\left\| {w_{k} - w^{*} } \right\|_{{\Delta _{{k - 1}}^{{ - 1}} }}^{2} }}{{2(1 - \lambda _{{1(k - 1)}} )}}} \right)} \\ & \le \sum\limits_{{k = 2}}^{K} {\frac{{\left\| {w_{k} - w^{*} } \right\|_{{\Delta _{k}^{{ - 1}} }}^{2} - \left\| {w_{k} - w^{*} } \right\|_{{\Delta _{{k - 1}}^{{ - 1}} }}^{2} }}{{2(1 - \lambda _{{1k}} )}}} \\ & \quad + \frac{{\left\| {w_{1} - w^{*} } \right\|_{{\Delta _{1}^{{ - 1}} }}^{2} }}{{2(1 - \lambda _{1} )}}, \\ \end{aligned} $$
(18)

where the last inequality is obtained due to \(\lambda _{11}=\lambda _1\) and \(\lambda _{1k}\le \lambda _{1(k-1)}\) for all \(k\in [K]\). By using Eq. (18), we further get

$$ T_{1} - \frac{\mu }{2}\sum\limits_{{k = 1}}^{K} {\left\| {w_{k} - w^{*} } \right\|^{2} } \le T_{{11}} + \sum\limits_{{k = 2}}^{K} {T_{{12}} } , $$
(19)

where

$$\begin{aligned} T_{11}=\frac{\left\| w_1-w^{*}\right\| _{\Delta _1^{-1}}^2}{2(1-\lambda _{1})}-\frac{\mu }{2}\left\| w_1-w^{*}\right\| ^2 \end{aligned}$$

and

$$\begin{aligned} T_{12}=\frac{\left\| w_k-w^{*}\right\| _{\Delta _k^{-1}}^2-\left\| w_k-w^{*}\right\| _{\Delta _{k-1}^{-1}}^2}{2(1-\lambda _{1k})}-\frac{\mu }{2}\left\| w_k-w^{*}\right\| ^2. \end{aligned}$$

To bound Eq. (19), \(T_{11}\) and \(T_{12}\) need to be bounded. Because Condition 1 holds, we have

$$ \begin{aligned} T_{{11}} & = \sum\limits_{{i = 1}}^{d} {\left( {w_{{1,i}} - w_{i}^{*} } \right)^{2} } \frac{{\delta _{{1,i}}^{{ - 1}} - \mu (1 - \lambda _{1} )}}{{2(1 - \lambda _{1} )}} \\ & \le \frac{{D_{\infty }^{2} }}{{2(1 - \lambda _{1} )}}\sum\limits_{{i = 1}}^{d} {\delta _{{1,i}}^{{ - 1}} } , \\ \end{aligned} $$
(20)

where the last inequality follows from Assumption 1. For the term \(T_{12}\), we derive the following relation:

$$ \begin{aligned} \left\| {w_{k} - w^{*} } \right\|_{{\Delta _{k}^{{ - 1}} }}^{2} - \left\| {w_{k} - w^{*} } \right\|_{{\Delta _{{k - 1}}^{{ - 1}} }}^{2} - \mu (1 - \lambda _{{1k}} )\left\| {w_{k} - w^{*} } \right\|^{2} & = \sum\limits_{{i = 1}}^{d} {\left( {w_{{k,i}} - w_{i}^{*} } \right)^{2} } \left( {\delta _{{k,i}}^{{ - 1}} - \delta _{{k - 1,i}}^{{ - 1}} - \mu (1 - \lambda _{{1k}} )} \right) \\ & \le \sum\limits_{{i = 1}}^{d} {\left( {w_{{k,i}} - w_{i}^{*} } \right)^{2} } \left[ {\frac{k}{{\underset{\raise0.3em\hbox{$\smash{\scriptscriptstyle-}$}}{\alpha } (k)}} - \frac{{k - 1}}{{\bar{\alpha }(k - 1)}} - \mu (1 - \lambda _{{1k}} )} \right] \\ & \le \sum\limits_{{i = 1}}^{d} {\left( {w_{{k,i}} - w_{i}^{*} } \right)^{2} } \left( {\mu (1 - \lambda _{1} ) - \mu (1 - \lambda _{{1k}} )} \right) \\ & \le 0, \\ \end{aligned} $$
(21)

where we have utilized Eq. (7) to derive the first inequality; the second inequality is obtained by utilizing Eq. (9); the last inequality follows from \(\lambda _{1k}\le \lambda _1\) for all \(k\in [K]\). Thus, we immediately get

$$\begin{aligned} T_{12}\le 0. \end{aligned}$$
(22)

Plugging Eqs. (20) and (22) into (19), we have

$$\begin{aligned} T_1-\frac{\mu }{2}\sum \limits _{k=1}^K\left\| w_k-w^{*}\right\| ^2\le \frac{D_{\infty }^2}{2(1-\lambda _1)}\sum \limits _{i=1}^d\delta _{1,i}^{-1}. \end{aligned}$$
(23)

We now bound the term \(T_2\) in Eq. (17). Following from the definition of \(\delta _k\) implies that

$$\begin{aligned} \alpha _{\min }\le k\left\| \delta _k\right\| _{\infty }\le \alpha _{\max }. \end{aligned}$$
(24)

Thus, we obtain

$$ \begin{aligned} T_{2} & \le \sum\limits_{{k = 1}}^{K} {\frac{{\left\| {m_{k} } \right\|_{{\Delta _{k} }}^{2} }}{{2(1 - \lambda _{1} )}}} + \sum\limits_{{k = 2}}^{K} {\frac{{\left\| {m_{{k - 1}} } \right\|_{{\Delta _{{k - 1}} }}^{2} }}{{2(1 - \lambda _{1} )}}} \\ & \le \frac{1}{{1 - \lambda _{1} }}\sum\limits_{{k = 1}}^{K} {\left\| {m_{k} } \right\|_{{\Delta _{k} }}^{2} } \\ & \le \frac{{\alpha _{{\max }} }}{{1 - \lambda _{1} }}\sum\limits_{{k = 1}}^{K} {\frac{{\left\| {m_{k} } \right\|^{2} }}{k}} , \\ \end{aligned} $$
(25)

where the first inequality holds by \(\lambda _{1k}\le \lambda _1<1\); we use Eq. (23) to derive the last inequality. To bound \(T_2\), we need to upper bound the term \(\sum \nolimits _{k=1}^K\left\| m_k\right\| ^2/k\) in Eq. (24). Hence, by using the definition of \(m_k\) in Eq (5), we have

$$ \begin{aligned} \sum\limits_{{k = 1}}^{K} {\frac{{\left\| {m_{k} } \right\|^{2} }}{k}} & = \sum\limits_{{k = 1}}^{{K - 1}} {\frac{{\left\| {m_{k} } \right\|^{2} }}{k}} + \sum\limits_{{i = 1}}^{d} {\frac{{m_{{K,i}}^{2} }}{K}} \le \sum\limits_{{k = 1}}^{{K - 1}} {\frac{{\left\| {m_{k} } \right\|^{2} }}{k}} \\ & \quad + \sum\limits_{{i = 1}}^{d} {\underbrace {{\frac{{\left( {\sum\nolimits_{{j = 1}}^{K} {(1 - \lambda _{{1j}} )} \prod\limits_{{t = 1}}^{{K - j}} {\lambda _{{1(K - t + 1)}} } s_{{j,i}} } \right)^{2} }}{K}}}_{{T_{{21}} }}} . \\ \end{aligned} $$
(26)

To further bound Eq. (26), we now start to bound the term \(T_{21}\) by using the Cauchy–Schwarz inequality as follows:

$$\begin{aligned}&T_{21} \le \frac{\sum \nolimits _{j=1}^K\prod _{t=1}^{K-j}\lambda _{1(K-t+1)}}{K} \nonumber \\&\quad \cdot \frac{\sum \nolimits _{j=1}^K\prod _{t=1}^{K-j}\lambda _{1(K-t+1)}s_{j,i}^2}{K} \nonumber \\&\quad \le \frac{\left( \sum \nolimits _{j=1}^K\lambda _1^{K-j}\right) \left( \sum \nolimits _{j=1}^K\lambda _1^{K-j}s_{j,i}^2\right) }{K} \nonumber \\&\quad \le \frac{1}{1-\lambda _1}\sum \limits _{j=1}^K\frac{\lambda _1^{K-j}s_{j,i}^2}{K}, \end{aligned}$$
(27)

where the first inequality holds by utilizing \(\lambda _{1k}\le \lambda _1\) for all \(k\in [K]\); the last inequality is derived from the inequality \(\sum \nolimits _{j=1}^K\lambda _1^{K-j}\le 1/(1-\lambda _1)\). Plugging Eqs. (27) into (26), we get

$$\begin{aligned} \sum \limits _{k=1}^K\frac{\left\| m_k\right\| ^2}{k}&\le \sum \limits _{k=1}^{K-1}\frac{\left\| m_k\right\| ^2}{k}+\frac{1}{1-\lambda _1} \sum \limits _{i=1}^d\sum \limits _{j=1}^K\frac{\lambda _1^{K-j}s_{j,i}^2}{K} \nonumber \\&\quad \le \frac{1}{1-\lambda _1}\sum \limits _{i=1}^d\sum \limits _{k=1}^K\sum \limits _{j=1}^k\frac{\lambda _1^{k-j}s_{j,i}^2}{k} \nonumber \\&\quad \le \frac{G_{\infty }^2}{1-\lambda _1}\sum \limits _{i=1}^d\sum \limits _{k=1}^K\sum \limits _{j=1}^k\frac{\lambda _1^{k-j}}{k} \nonumber \\&\quad =\frac{G_{\infty }^2}{1-\lambda _1}\sum \limits _{i=1}^d\sum \limits _{k=1}^K\frac{1}{k}\sum \limits _{j=1}^k\lambda _1^{k-j} \nonumber \\&\quad =\frac{G_{\infty }^2}{1-\lambda _1}\sum \limits _{i=1}^d\sum \limits _{k=1}^K\sum \limits _{j=k}^K\frac{\lambda _1^{j-k}}{j} \nonumber \\&\quad \le \frac{G_{\infty }^2}{1-\lambda _1}\sum \limits _{i=1}^d\sum \limits _{k=1}^K\sum \limits _{j=k}^K\frac{\lambda _1^{j-k}}{k} \nonumber \\&\quad =\frac{G_{\infty }^2}{1-\lambda _1}\sum \limits _{i=1}^d\sum \limits _{k=1}^K\frac{1}{k}\sum \limits _{j=k}^K\lambda _1^{j-k} \nonumber \\&\quad \le \frac{G_{\infty }^2}{(1-\lambda _1)^2}\sum \limits _{i=1}^d\sum \limits _{k=1}^K\frac{1}{k} \nonumber \\&\quad \le \frac{dG_{\infty }^2}{(1-\lambda _1)^2}\left( 1+\log K\right) , \end{aligned}$$
(28)

where the third inequality is due to Assumption 2; the fifth inequality is obtained by the inequality \(\sum \nolimits _{j=k}^K\lambda _1^{j-k} \le 1/(1-\lambda _1)\); the sixth inequality is derived by using the following inequality:

$$\begin{aligned} \sum \limits _{k=1}^K\frac{1}{k}\le 1+\int _{k=1}^K\frac{1}{k}dk=1+\log K. \end{aligned}$$
(29)

Hence, plugging Eqs. (28) into (25), we have

$$\begin{aligned} T_2\le \frac{d\alpha _{\max }G_{\infty }^2}{(1-\lambda _1)^3}\left( 1+\log K\right) . \end{aligned}$$
(30)

Finally, we bound the term \(T_3\) as follows:

$$\begin{aligned} T_3&=\sum \limits _{k=2}^K\frac{\lambda _{1k}}{2(1-\lambda _{1k})}\left\| w_{k}-w^{*}\right\| _{\Delta _{k-1}^{-1}}^2 \nonumber \\&\quad =\sum \limits _{i=1}^d\sum \limits _{k=2}^K\frac{\lambda _{1k}}{2(1-\lambda _{1k})}\left( w_{k,i}-w_i^{*}\right) ^2\delta _{t,i}^{-1} \nonumber \\&\quad \le \frac{D_{\infty }^2}{2(1-\lambda _{1})}\sum \limits _{i=1}^d\sum \limits _{k=1}^K\lambda _{1k}\delta _{k,i}^{-1}, \end{aligned}$$
(31)

where the last inequality follows from Assumption 1 and \(\lambda _{1k}\le \lambda _1\) for all \(k\in [K]\). Plugging Eqs. (23), (30), and (31) into Eq. (17), Theorem 1 is proved. \(\Box \)

Appendix B

Proof of Corollary 1

By the definitions of bounded functions \({\underline{\alpha }}(k)\) and \({\overline{\alpha }}(k)\), we get

$$\begin{aligned}\frac{k}{{\underline{\alpha }}(k)}-\frac{k-1}{{\overline{\alpha }}(k-1)}&=k\left( \frac{1}{{\underline{\alpha }}(k)}-\frac{1}{{\overline{\alpha }}(k-1)}\right) +\frac{1}{{\overline{\alpha }}(k-1)} \\&\quad \le k\left( \frac{1}{{\underline{\alpha }}(k)}-\frac{1}{{\overline{\alpha }}(k-1)}\right) +\frac{1}{\alpha ^{\star }} \\&\quad =\frac{1}{\gamma \alpha ^{\star }}\frac{2\gamma k+1-\gamma }{1+\gamma (k-1)}+\frac{1}{\alpha ^{\star }} \\&\quad \le \frac{2}{\gamma \alpha ^{\star }}\frac{1+\gamma k}{1+\gamma (k-1)}+\frac{1}{\alpha ^{\star }} \\&\quad \le \frac{2}{\gamma \alpha ^{\star }}\left( 1+\gamma \right) +\frac{1}{\alpha ^{\star }} \\&\quad =\frac{2}{\gamma \alpha ^{\star }}+\frac{3}{\alpha ^{\star }}, \\ \end{aligned}$$
(32)

where we have used the relation \(1/{\overline{\alpha }}(k)\le 1/\alpha ^{\star }\) for all \(k\in [K]\) to derive the first inequality; in the last inequality, we have utilized the inequality \(\frac{1+\gamma k}{1+\gamma (k-1)}\le 1+\gamma \), which is due to \(\gamma ^2(k-1)\ge 0\) for all \(\gamma >0\) and \(k\ge 1\). Thus, for any \(\alpha ^{\star }\ge \frac{3+2\gamma ^{-1}}{\mu (1-\lambda _1)}\), the following inequality holds:

$$\begin{aligned} \frac{k}{{\underline{\alpha }}(k)}-\frac{k-1}{{\overline{\alpha }}(k-1)}\le \mu \left( 1-\lambda _1\right) . \end{aligned}$$

Besides, by the definition of \(\delta _t\), \(\forall i\in [d]\) and \(k\in [K]\), we get

$$\begin{aligned} \delta _{k,i}^{-1}=k{\hat{\delta }}_{k,i}^{-1}\le k{\underline{\alpha }}(1)^{-1}=\frac{k(1+\gamma ^{-1})}{\alpha ^{\star }}. \end{aligned}$$
(33)

Because \(\lambda _{1k}=\lambda _1\nu ^{k-1}\), we have

$$\begin{aligned} \sum \limits _{k=1}^K\lambda _{1k}\delta _{k,i}^{-1}&=\lambda _1\sum \limits _{k=1}^K\nu ^{k-1}\delta _{k,i}^{-1} \nonumber \\&\quad \le \frac{\lambda _1(1+\gamma ^{-1})}{\alpha ^{\star }}\sum \limits _{k=1}^Kk\nu ^{k-1} \nonumber \\&\quad \le \frac{\lambda _1(1+\gamma ^{-1})}{\alpha ^{\star }(1-\nu )^2}, \end{aligned}$$
(34)

where the first inequality follows from Eq. (33); the last inequality is derived from the following inequality:

$$\begin{aligned} \sum \limits _{k=1}^Kk\nu ^{k-1}&\le \sum \limits _{k=0}^{\infty }k\nu ^{k-1} \\&\quad =\partial _{\nu }\left( \sum \limits _{k=0}^{\infty }\nu ^{k}\right) \\&\quad =\partial _{\nu }\left( \frac{1}{1-\nu }\right) \\&\quad =\frac{1}{(1-\nu )^2}, \end{aligned}$$

where the notation \(\partial _{\nu }\) denotes the derivative operator with respect to \(\nu \). In addition, by Eq. (33), which implies that \(\delta _{1,i}^{-1}\le \frac{1+\gamma ^{-1}}{\alpha ^{\star }}\) for all \(i\in [d]\). Thus, plugging Eqs. (34) into (10), we have

$$\begin{aligned}R_K&\le \frac{dD_{\infty }^2}{2\alpha ^{\star }(1-\lambda _1)}\left( 1+\gamma ^{-1}\right) \left( 1+\frac{\lambda _1}{(1-\nu )^2}\right) \\&\quad +\frac{d\alpha _{\max }G_{\infty }^2}{(1-\lambda _1)^3}\left( 1+\log K\right) . \end{aligned}$$
(35)

Furthermore, due to \(\alpha _{\max }={\overline{\alpha }}(1)=\alpha ^{\star }(1+\gamma ^{-1})\), Corollary 1 is proved. \(\Box \)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, L., Wang, X., Li, T. et al. SAdaBoundNc: an adaptive subgradient online learning algorithm with logarithmic regret bounds. Neural Comput & Applic 35, 8051–8063 (2023). https://doi.org/10.1007/s00521-022-08082-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-022-08082-8

Keywords

Navigation