AdaXod: a new adaptive and momental bound algorithm for training deep neural networks

Liu, Yuanxuan; Li, Dequan

doi:10.1007/s11227-023-05338-5

AdaXod: a new adaptive and momental bound algorithm for training deep neural networks

Published: 09 May 2023

Volume 79, pages 17691–17715, (2023)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Yuanxuan Liu¹ &
Dequan Li²

218 Accesses
1 Citation
Explore all metrics

Abstract

Adaptive algorithms are widely used in deep learning because of their fast convergence. Among them, Adam is the most widely used algorithm. However, studies have shown that Adam’s generalization ability is weak. AdaX is a variant of Adam, which introduces a novel second-order momentum, modifies the second-order moment of Adam, and has good generalization ability. However, these algorithms may fail to converge due to instability and extreme learning rates during training. In this paper, we propose a new adaptive and momental bound algorithm, called AdaXod, which characterizes of exponentially averaging the learning rate and is particularly useful for training deep neural networks. By setting an adaptively limited learning rate in the AdaX algorithm, the resultant AdaXod can effectively eliminate the problem of excessive learning rate in the later stage of neural networks training and thus results in stable training. We conduct extensive experiments on different datasets and verify the advantages of the AdaXod algorithm by comparing with other advanced adaptive optimization algorithms. AdaXod eliminates large learning rates during neural networks training and outperforms other optimizers, especially for some neural networks with complex structures, such as DenseNet.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A modified Adam algorithm for deep neural network optimization

Article Open access 25 April 2023

Availability of data and materials

The datasets presented in this study are publicly available at http://www.cs.toronto.edu/~kriz/cifar.html and http://yann.lecun.com/exdb/mnist/.

References

Sharma Neha, Jain Vibhor, Mishra Anju (2018) An analysis of convolutional neural networks for image classification. Procedia Comput Sci 132:377–384
Article Google Scholar
Christian Szegedy, Alexander Toshev, Dumitru Erhan (2013) Deep neural networks for object detection. Adv Neural Inform Proc Syst 26:942
Google Scholar
Purwins Hendrik, Li Bo, Virtanen Tuomas, Schlüter Jan, Chang Shuo-Yiin, Sainath Tara (2019) Deep learning for audio signal processing. IEEE J Select Topics Signal Proc 13(2):206–219
Article Google Scholar
Can Burçak Kadir, Kaan Baykan Ömer, Harun Uğuz (2021) A new deep convolutional neural network model for classifying breast cancer histopathological images and the hyperparameter optimisation of the proposed model. J Supercomput 77(1):973–989
Article Google Scholar
Priyadarshini Ishaani, Cotton Chase (2021) A novel lstm-cnn-grid search-based deep neural network for sentiment analysis. J Supercomput 77(12):13911–13932
Article Google Scholar
Do Luu-Ngoc, Yang Hyung-Jeong, Nguyen Hai-Duong, Kim Soo-Hyung, Lee Guee-Sang, Na In-Seop (2021) Deep neural network-based fusion model for emotion recognition using visual data. J Supercomput 77(10):10773–10790
Article Google Scholar
McMahan H Brendan, Streeter Matthew (2010) Adaptive bound optimization for online convex optimization. arXiv preprintarXiv:1002.4908
Sutskever Ilya, Martens James, Dahl George, Hinton Geoffrey (2013) On the importance of initialization and momentum in deep learning. In International conference on machine learning pages 1139–1147. PMLR
Mingsheng Long, Yue Cao, Zhangjie Cao, Jianmin Wang, Jordan Michael I (2018) Transferable representation learning with deep adaptation networks. IEEE Trans Pattern Anal Machine Intell 41(12):3071–3085
Google Scholar
Xi Yang, Kaizhu Huang, Rui Zhang, Goulermas John Y (2019) A novel deep density model for unsupervised learning. Cognitive Comput 11:778–788
Article Google Scholar
Yangting Gui, Dequan Li, Runyue Fang (2022) A fast adaptive algorithm for training deep neural networks. Appl Intell 730:1–10
Google Scholar
Robbins Herbert, Monro Sutton (1951) A stochastic approximation method. The Annals Mathemat Stat pages 400–407
Balcan Maria-Florina, Khodak Mikhail, Talwalkar Ameet (2019) Provable guarantees for gradient-based meta-learning. In : International Conference on Machine Learning pages 424–433. PMLR
Nesterov Yurii (1983) A method for unconstrained convex minimization problem with the rate of convergence o (1/k$\hat{\,}$ 2). In Doklady an ussr 269:543–547
Google Scholar
Tieleman Tijmen, Hinton G (2017) Divide the gradient by a running average of its recent magnitude. coursera: neural networks for machine learning. Technical report
Duchi John, Hazan Elad, Singer Yoram (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Machine Learn Res 12(7)
Ghadimi Euhanna, Feyzmahdavian Hamid Reza, Johansson Mikael (2015) Global convergence of the heavy-ball method for convex optimization. In: 2015 European Control Conference (ECC), pages 310–315. IEEE
Perantonis Stavros J, Karras Dimitris A (1995) An efficient constrained learning algorithm with momentum acceleration. Neural Networks 8(2):237–249
Article Google Scholar
Lydia Agnes, Francis Sagayaraj (2019) Adagrad-an optimizer for stochastic gradient descent. Int J Inf Comput Sci 6(5):566–568
Google Scholar
Zou Fangyu, Shen Li, Jie Zequn, Zhang Weizhong, Liu Wei (2019) A sufficient condition for convergences of adam and rmsprop. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11127–11135
Kingma Diederik P, Ba Jimmy (2014) Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980
Zhou Zhiming, Zhang Qingru, Lu Guansong, Wang Hongwei, Zhang Weinan, Yu Yong (2018) Adashift: Decorrelation and convergence of adaptive learning rate methods. arXiv preprintarXiv:1810.00143
Savarese Pedro (2019) On the convergence of adabound and its connection to sgd. arXiv preprintarXiv:1908.04457
Li Wenjie, Zhang Zhaoyang, Wang Xinjiang, Luo Ping (2020) Adax: Adaptive gradient descent with exponential long term memory. arXiv preprintarXiv:2004.09740
Reddi Sashank J, Kale Satyen, Kumar Sanjiv (2019) On the convergence of adam and beyond. arXiv preprintarXiv:1904.09237
Tran Phuong Thi, et al (2019) On the convergence proof of amsgrad and a new version. IEEE Access 7: 61706–61716
Juntang Zhuang, Tommy Tang, Yifan Ding, Tatikonda Sekhar C, Nicha Dvornek, Xenophon Papademetris, James Duncan (2020) Adabelief optimizer: adapting stepsizes by the belief in observed gradients. Adv Neural Inform Proc Syst 33:18795–18806
Google Scholar
Ding Jianbang, Ren Xuancheng, Luo Ruixuan, Sun Xu (2019) An adaptive and momental bound method for stochastic learning. arXiv preprintarXiv:1910.12249
Wang Fei, Jiang Mengqing, Qian Chen, Yang Shuo, Li Cheng, Zhang Honggang, Wang Xiaogang, Tang Xiaoou (2017) Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3156–3164
Bansal Monika, Kumar Munish, Sachdeva Monika, Mittal Ajay (2021) Transfer learning for image classification using vgg19: Caltech-101 image data set. J Ambient Intell Humanized Comput pages 1–12
Clanuwat Tarin, Bober-Irizar Mikel, Kitamoto Asanobu, Lamb Alex, Yamamoto Kazuaki, Ha David (2018) Deep learning for classical japanese literature. arXiv preprintarXiv:1812.01718
Xiao Han, Rasul Kashif, Vollgraf Roland (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprintarXiv:1708.07747
He Kaiming, Zhang Xiangyu, Ren Shaoqing, Sun Jian (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition pages 770–778
Huang Gao, Liu Zhuang, Van Der Maaten Laurens, Weinberger Kilian Q (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition pages 4700–4708
Khan Riaz Ullah, Zhang Xiaosong, Kumar Rajesh, Aboagye Emelia Opoku (2018) Evaluating the performance of resnet model based on image recognition. In: Proceedings of the 2018 International Conference on Computing and Artificial Intelligence pages 86–90
Tong Wei, Chen Weitao, Han Wei, Li Xianju, Wang Lizhe (2020) Channel-attention-based densenet network for remote sensing image scene classification. IEEE J Select Topics Appl Earth Observ Remote Sens 13:4121–4132
Article Google Scholar

Download references

Acknowledgments

This work is supported in part by the Natural Science Foundation of China under Grant No. 61472003, Anhui Provincial Natural Science Foundation under Grant No. 2208085ME128, Academic and Technical Leaders and Backup Candidates of Anhui Province under Grant No. 2019h211, Innovation team of ’50 Star of Science and Technology’ of Huainan, Anhui Province.

Funding

The authors have acknowledged the funds that supported this work. The viewpoints expressed in this work are those of the authors and do not represent the viewpoints of supported funds.

Author information

Authors and Affiliations

School of Mathematics and Big Data, Anhui University of Science and Technology, Huainan, 232001, Anhui, China
Yuanxuan Liu
School of Artificial Intelligence, Anhui University of Science and Technology, Huainan, 232001, Anhui, China
Dequan Li

Authors

Yuanxuan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Dequan Li
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

In this paper, Dequan Li presented the main idea of this work and provided the funds. Yuanxuan Liu realized the idea of work and wrote the initial manuscript. Dequan Li and Yuanxuan Liu rechecked the manuscript and improve the final manuscript.

Corresponding author

Correspondence to Dequan Li.

Ethics declarations

Conflict of interest

We declare that the authors have no competing interests.

Ethics approval

This work does not involve in ethics, medicine, etc. We declare that this declaration is not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A Details About Proof

Before proving Theorem 1, we need to establish the following result of Lemma 1.

Lemma 1

For the parameter settings and conditions assumed in Theorem 1, we have

$$\begin{aligned} \sum \limits _{t = 1}^T {\left[\frac{1}{{2(1 - {\beta _{1t}})}}{{\left\| {{\eta _t}^{1/2}{m_t}} \right\| }^2} + \frac{{{\beta _{1t}}}}{{2(1 - {\beta _{1t}})}}{{\left\| {{\eta _t}^{1/2}{m_{t - 1}}} \right\| }^2}\right]} \le \frac{{{R_\infty }G_2^2}}{{1 - {\beta _1}}} \end{aligned}$$

(A1)

Proof

According to the definition of ${\eta _t}$, by the formula ${\eta _t}= {p_t}$, we have

$$\begin{aligned} {\left\| {{\eta _t}} \right\| _\infty } = {\left\| {{p_t}} \right\| _\infty } \le {R_\infty } \end{aligned}$$

(A2)

Hence,

$$\begin{aligned} \begin{aligned}&\sum \limits _{t = 1}^T {\left[\frac{1}{{2(1 - {\beta _{1t}})}}{{\left\| {{\eta _t}^{1/2}{\varvec{m}_t}} \right\| }^2} + \frac{{{\beta _{1t}}}}{{2(1 - {\beta _{1t}})}}{{\left\| {{\eta _t}^{1/2}{\varvec{m}_{t - 1}}} \right\| }^2}\right]} \\&\quad \le \sum \limits _{t = 1}^T {\left[\frac{{{R_\infty }}}{{2(1 - {\beta _1})}}{{\left\| {{\varvec{m}_t}} \right\| }^2} - \frac{{{R_\infty }}}{{2(1 - {\beta _1})}}{{\left\| {{\varvec{m}_{t - 1}}} \right\| }^2}\right]} \\&\quad = \frac{{{R_\infty }}}{{2(1 - {\beta _1})}}\left[\sum \limits _{t = 1}^T {{{\left\| {{\varvec{m}_t}} \right\| }^2}} + \sum \limits _{t = 1}^T {{{\left\| {{\varvec{m}_{t - 1}}} \right\| }^2}} \right]\\&\quad \le \frac{{{R_\infty }}}{{2(1 - {\beta _1})}}\left[\frac{1}{T}{\left[\sum \limits _{t = 1}^T {\left\| {{\varvec{m}_t}} \right\| } \right]^2} + \frac{1}{T}{\left[\sum \limits _{t = 1}^T {\left\| {{\varvec{m}_{t - 1}}} \right\| } \right]^2}\right]\\&\quad \le \frac{{{R_\infty }G_2^2}}{{(1 - {\beta _1})}} \end{aligned} \end{aligned}$$

(A3)

The first inequality is due to ${\beta _{1t}} \le {\beta _1} < 1$. We complete the proof of this Lemma 1. $\square$

Proof of Theorem 1

Let ${\varvec{\theta }^*} = \arg {\min _{\theta \in F}}(\sum \nolimits _{t = 1}^T {{f_t}({\varvec{\theta }^*})} )$, which exists since F is closed and convex. From the definition of the projection operation, we get the observation.

$$\begin{aligned}{} & {} \begin{aligned} {\varvec{\theta }_{t + 1}} = \prod \nolimits _F {({\varvec{\theta }_t} - {\eta _t}{\varvec{m}_t})}= \min \left\| {{\eta _t}^{ - 1/2}(\varvec{\theta } - ({\varvec{\theta }_t} - {\eta _t}{\varvec{m}_t}))} \right\| \end{aligned} \end{aligned}$$

(A4)

$$\begin{aligned}{} & {} \begin{aligned} {\left\| {{\eta _t}^{ - 1/2}({\varvec{\theta }_t} - {\varvec{\theta }^*})} \right\| ^2}&\le {\left\| {{\eta _t}^{ - 1/2}({\varvec{\theta }_t} - {\eta _t}{\varvec{m}_t} - {\varvec{\theta }^*})} \right\| ^2}\\&= {\left\| {{\eta _t}^{ - 1/2}({\varvec{\theta }_t} - {\varvec{\theta }^*})} \right\| ^2} \\&\quad+ {\left\| {{\eta _t}^{1/2}{\varvec{m}_t}} \right\| ^2}- 2\left\langle {{\varvec{m}_t},{\varvec{\theta }_t} - {\varvec{\theta }^*}} \right\rangle \\&= {\left\| {{\eta _t}^{ - 1/2}({\varvec{\theta }_t} - {\varvec{\theta }^*})} \right\| ^2} + {\left\| {{\eta _t}^{1/2}{\varvec{m}_t}} \right\| ^2} \\&\quad -2\left\langle {{\beta _{1t}}{\varvec{m}_{t - 1}} + (1 - {\beta _{1t}}){\varvec{g}_t},{\varvec{\theta }_t} - {\varvec{\theta }^*}} \right\rangle \end{aligned} \end{aligned}$$

(A5)

Rearranging the above inequalities (A5), we get

$$\begin{aligned} \begin{aligned} \left\langle {{\varvec{g}_t},{\varvec{\theta }_t} - {\varvec{\theta }^*}} \right\rangle&\le \frac{1}{{2(1 - {\beta _{1t}})}}\left[{\left\| {{\eta _t}^{ - 1/2}({\varvec{\theta }_t} - {\varvec{\theta }^*})} \right\| ^2} - {\left\| {{\eta _t}^{ - 1/2}({\varvec{\theta }_{t + 1}} - {\varvec{\theta }^*})} \right\| ^2}\right]\\&\quad + \frac{1}{{2(1 - {\beta _{1t}})}}{\left\| {{\eta _t}^{1/2}{\varvec{m}_t}} \right\| ^2} - \frac{{{\beta _{1t}}}}{{1 - {\beta _{1t}}}}\left\langle {{\varvec{m}_{t - 1}},{\varvec{\theta }_t} - {\varvec{\theta }^*}} \right\rangle \\&\le \frac{1}{{2(1 - {\beta _{1t}})}}\left[{\left\| {{\eta _t}^{ - 1/2}({\varvec{\theta }_t} - {\varvec{\theta }^*})} \right\| ^2} - {\left\| {{\eta _t}^{ - 1/2}({\varvec{\theta }_{t + 1}} - {\varvec{\theta }^*})} \right\| ^2}\right] \\&\quad + \frac{1}{{2(1 - {\beta _{1t}})}}{\left\| {{\eta _t}^{1/2}{\varvec{m}_t}} \right\| ^2} + \frac{{{\beta _{1t}}}}{{2(1 - {\beta _{1t}})}}{\left\| {{\eta _t}^{1/2}{\varvec{m}_{t - 1}}} \right\| ^2} \\&\quad + \frac{{{\beta _{1t}}}}{{2(1 - {\beta _{1t}})}}{\left\| {{\eta _t}^{1/2}({\varvec{\theta }_t} - {\varvec{\theta }^*})} \right\| ^2} \end{aligned} \end{aligned}$$

(A6)

Then by the convexity of the function ${{f_t}({\varvec{\theta }})}$ at each step, it yields

$$\begin{aligned} \begin{aligned} \sum \limits _{t = 1}^T {{f_t}({\varvec{\theta }_t})} - {f_t}({\varvec{\theta }^*})& \le \sum \limits _{t = 1}^T {\left\langle {{\varvec{g}_t},{\varvec{\theta }_t} - {\varvec{\theta }^*}} \right\rangle }\\& =\sum \limits _{t = 1}^T {\Big[\frac{1}{{2(1 - {\beta _{1t}})}}{{\left\| {{\eta _t}^{ - 1/2}({\varvec{\theta }_t} - {\varvec{\theta }^*})} \right\| }^2}\Big]}\\& \quad -\sum \limits _{t = 1}^T {\Big[\frac{1}{{2(1 - {\beta _{1t}})}}{{\left\| {{\eta _t}^{ - 1/2}({\varvec{\theta }_{t + 1}} - {\varvec{\theta }^*})} \right\| }^2}}\Big]\\& \quad +\sum \limits _{t = 1}^T {\Big[\frac{1}{{2(1 - {\beta _{1t}})}}{{\left\| {{\eta _t}^{1/2}{\varvec{m}_t}} \right\| }^2}} \Big]\\& \quad +\sum \limits _{t = 1}^T {\Big[\frac{{{\beta _{1t}}}}{{2(1 - {\beta _{1t}})}}{{\left\| {{\eta _t}^{1/2}{\varvec{m}_{t - 1}}} \right\| }^2}} \Big]\\& \quad + \sum \limits _{t = 1}^T {\frac{{{\beta _{1t}}}}{{2(1 - {\beta _{1t}})}}{{\left\| {{\eta _t}^{1/2}({\varvec{\theta }_t} - {\varvec{\theta }^*})} \right\| }^2}}\\ \end{aligned} \end{aligned}$$

(A7)

Then according to Lemma 1, equation (A7) can be further rearranged as

$$\begin{aligned}{} & {} \begin{aligned} \sum \limits _{t = 1}^T {{f_t}({\varvec{\theta }_t})} - {f_t}({\varvec{\theta }^*})&\le \sum \limits _{t = 1}^T {\Big[\frac{1}{{2(1 - {\beta _{1t}})}}{{\left\| {{\eta _t}^{ - 1/2}({\varvec{\theta }_t} - {\varvec{\theta }^*})} \right\| }^2}\Big]}\\&\quad -\sum \limits _{t = 1}^T {\Big[\frac{1}{{2(1 - {\beta _{1t}})}}{{\left\| {{\eta _t}^{ - 1/2}({\varvec{\theta }_{t + 1}} - {\varvec{\theta }^*})} \right\| }^2}} \Big]\\&\quad + \sum \limits _{t = 1}^T {\frac{{{\beta _{1t}}}}{{2(1 - {\beta _{1t}})}}{{\left\| {{\eta _t}^{1/2}({\varvec{\theta }_t} - {\varvec{\theta }^*})} \right\| }^2}} + \frac{{{R_\infty }G_2^2}}{{(1 - {\beta _1})}}\\ \end{aligned} \end{aligned}$$

(A8)

$$\begin{aligned}{} & {} \begin{aligned} \sum \limits _{t = 1}^T {{f_t}({\varvec{\theta }_t})} - {f_t}({\varvec{\theta }^*})& = \frac{1}{{2(1 - {\beta _1})}}\sum \limits _{t = 2}^T {{{\left\| {{\eta _t}^{ - 1/2}({\varvec{\theta } _t} - {\varvec{\theta } ^*})} \right\| }^2}} \\& \quad - \frac{1}{{2(1 - {\beta _1})}}\sum \limits _{t = 2}^T {{{\left\| {{\eta _{t - 1}}^{ - 1/2}({\varvec{\theta } _t} - {\varvec{\theta } ^*})} \right\| }^2}} \\& \quad + \frac{1}{{2(1 - {\beta _1})}} \Big[{\left\| {{\eta _t}^{ - 1/2}({\varvec{\theta }_t} - {\varvec{\theta }^*})} \right\| ^2}\Big] + \frac{{{R_\infty }G_2^2}}{{(1 - {\beta _1})}} \\& \quad + \sum \limits _{t = 1}^T {\frac{{{\beta _{1t}}}}{{2(1 - {\beta _{1t}})}}{{\left\| {{\eta _t}^{1/2}({\varvec{\theta }_t} - {\varvec{\theta }^*})} \right\| }^2}} \\& =\frac{1}{{2(1 - {\beta _1})}} \Big[\sum \limits _{i = 1}^d {{\eta _{t,i}}^{ - 1}({\varvec{\theta }_{t,i}} - \varvec{\theta }_i^*)} \Big]\\& \quad + \frac{1}{{2(1 - {\beta _1})}} \Big[\sum \limits _{t = 2}^T {\sum \limits _{i = 1}^d {{\eta _{t,i}}^{ - 1}{{({\varvec{\theta }_{t,i}} - \varvec{\theta }_i^*)}^2}} } \Big]\\& \quad - \frac{1}{{2(1 - {\beta _1})}} \Big[\sum \limits _{t = 2}^T {\sum \limits _{i = 1}^d {{\eta _{t - 1,i}}^{ - 1}{{({\varvec{\theta }_{t,i}} - \varvec{\theta }_i^*)}^2}} } \Big]\\& \quad + \frac{1}{{2(1 - {\beta _1})}} \Big[\sum \limits _{t = 2}^T {\sum \limits _{i = 1}^d {{\beta _1}_t{\eta _{t,i}}^{ - 1}{{({\varvec{\theta }_{t,i}} - \varvec{\theta }_i^*)}^2}} \Big]} +\frac{{{R_\infty }G_2^2}}{{(1 - {\beta _1})}}\\& \le \frac{{D_\infty ^2}}{{2(1 - {\beta _1})}} \Big[\sum \limits _{i = 1}^d {{\eta _{t,i}}^{ - 1}} + \sum \limits _{t = 2}^T {\sum \limits _{i = 1}^d { \Big[{\eta _{t,i}}^{ - 1} - {\eta _{t - 1,i}}^{ - 1}} } \Big]\Big] \\& \quad + \frac{{D_\infty ^2}}{{2(1 - {\beta _1})}}\sum \limits _{t = 1}^T {\sum \limits _{i = 1}^d {{\beta _1}_t{\eta _{t,i}}^{ - 1}} } + \frac{{{R_\infty }G_2^2}}{{(1 - {\beta _1})}}\\& \le \frac{{D_\infty ^2\sqrt{T} }}{{2(1 - {\beta _1})}}\sum \limits _{i = 1}^d {{{{\hat{\eta }} }_{t,i}}^{ - 1}} + \frac{{{R_\infty }G_2^2}}{{(1 - {\beta _1})}} \\& \quad +\frac{{D_\infty ^2}}{{2(1 - {\beta _1})}}\sum \limits _{t = 1}^T {\sum \limits _{i = 1}^d {{\beta _1}_t{\eta _{t,i}}^{ - 1}} } \\ \end{aligned} \end{aligned}$$

(A9)

Equation (A8) uses inequality ${\beta _{1t}} \le {\beta _1} < 1$ to get equation (A9). According to the definition of ${\eta _t}$, we have $\eta _{t,i}^{ - 1} \ge \eta _{t - 1,i}^{ - 1}$. Using the ${D_\infty }$ bound on the feasible region, we get the equation (A9). From the above formula, the regret obtained by AdaXod enjoys an upper bound of O(T). The proof is finished. $\square$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, Y., Li, D. AdaXod: a new adaptive and momental bound algorithm for training deep neural networks. J Supercomput 79, 17691–17715 (2023). https://doi.org/10.1007/s11227-023-05338-5

Download citation

Accepted: 23 April 2023
Published: 09 May 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s11227-023-05338-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

AdaXod: a new adaptive and momental bound algorithm for training deep neural networks

Abstract

Access this article

Similar content being viewed by others

A modified Adam algorithm for deep neural network optimization

Availability of data and materials

References

Acknowledgments

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Additional information

Publisher's Note

Appendix A Details About Proof

Lemma 1

Proof

Proof of Theorem 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

AdaXod: a new adaptive and momental bound algorithm for training deep neural networks

Abstract

Access this article

Similar content being viewed by others

A modified Adam algorithm for deep neural network optimization

Availability of data and materials

References

Acknowledgments

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Additional information

Publisher's Note

Appendix A Details About Proof

Appendix A Details About Proof

Lemma 1

Proof

Proof of Theorem 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation