Abstract
Adaptive optimization algorithms enjoy fast convergence and have been widely exploited in pattern recognition and cognitively-inspired machine learning. These algorithms may however be of high computational cost and low generalization ability due to their projection steps. Such limitations make them difficult to be applied in big data analytics, which may typically be seen in cognitively inspired learning, e.g. deep learning tasks. In this paper, we propose a fast and accurate adaptive momentum online algorithm, called LightAdam, to alleviate the drawbacks of projection steps for the adaptive algorithms. The proposed algorithm substantially reduces computational cost for each iteration step by replacing high-order projection operators with one-dimensional linear searches. Moreover, we introduce a novel second-order momentum and engage dynamic learning rate bounds in the proposed algorithm, thereby obtaining a higher generalization ability than other adaptive algorithms. We theoretically analyze that our proposed algorithm has a guaranteed convergence bound, and prove that our proposed algorithm has better generalization capability as compared to Adam. We conduct extensive experiments on three public datasets for image pattern classification, and validate the computational benefit and accuracy performance of the proposed algorithm in comparison with other state-of-the-art adaptive optimization algorithms
Similar content being viewed by others
References
McMahan HB. Streeter MJ. Adaptive bound optimization for online convex optimization, in: The 23rd Conference on Learning Theory. 2010:244–256.
Sutskever I, Martens J, Dahl GE, Hinton GE. On the importance of initialization and momentum in deep learning, in: Proceedings of the 30th International Conference on Machine Learning. 2013:1139–1147.
Long M, Cao Y, Cao Z, Wang J, Jordan M. Transferable representation learning with deep adaptation networks. IEEE Trans Pattern Anal Mach Intell. 2019;41:3071–85.
Yang X, Huang K, Zhang R, et al. A Novel Deep Density Model for Unsupervised Learning. Cogn Comput. 2019;11:778–88.
Nguyen B, Morell C, Baets BD. Scalable large-margin distance metric learning using stochastic gradient descent. IEEE Transactions on Cybernetics. 2020;50:1072–83.
Balcan M, Khodak M, Talwalkar A. Provable guarantees for gradient-based meta-learning, in: Proceedings of the 36th International Conference on Machine Learning, 2019:424–433.
Nesterov Y. A method for unconstrained convex minimization problem with the rate of convergence o(1=k2). Doklady AN USSR. 1983;269:543–7.
Tieleman T, Hinton G. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude, in: COURSERA: Neural Networks for Machine Learning. 2012.
Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res. 2011;12:2121–59.
Ghadimi E, Feyzmahdavian HR, Johansson M. Global convergence of the heavy-ball method for convex optimization, in: Proceedings of The European Control Conference. 2015:310–315.
Yang X, Zheng X, Gao H. SGD-Based Adaptive NN Control Design for Uncertain Nonlinear Systems. IEEE Transactions on Neural Networks and Learning Systems. 2018;29(10):5071–83.
Peng Y, Hao Z, Yun X. Lock-free parallelization for variance-reduced stochastic gradient descent onstreaming data. IEEE Trans Parallel Distrib Syst. 2020;31:2220–31.
Perantonis SJ, Karras DA. An efficient constrained learning algorithm with momentum acceleration. Neural Netw. 1995;8:237–49.
Kingma DP, Ba JL. Adam: A method for stochastic optimization, in: Proceedings of the 3rd International Conference on Learning Representations. 2015:1–15.
Gu G, Dogandžić A. Projected nesterov’s proximal-gradient algorithm for sparse signal recovery. IEEE Trans Signal Process. 2017;65:3510–25.
Chen J, Zhou D, Tang Y, et al. Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks, in: Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence. 2020.
Reddi SJ, Kale K, Kumar S. On the convergence of adam and beyond, in: Proceedings of the Sixth International Conference on Learning Representations. 2018:1–23.
Li W, Zhang Z, Wang X, Luo P. Adax: Adaptive gradient descent with exponential long term memory. 2020. https://arxiv.org/abs/2004.09740
Luo L, Xiong Y, Liu Y, Sun X. Adaptive gradient methods with dynamic bound of learning rate, in: Proceedings of the Seventh International Conference on Learning Representations. 2019:1–19.
Zhou Z, Zhang Q, Lu G, Wang H, Zhang W, Yu Y. Adashift: Decorrelation and convergence of adaptive learning rate methods. 2019:1–26.
Hazan E, Kale S. Projection-free online learning, in: Proceedings of the 29th International Conference on Machine Learning, 2012:1–8.
Balles L, Hennig P. Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients, in: Proceedings of the 35th International Conference on Machine Learning, PMLR 80:404-413. 2018.
Chen L, Harshaw C, Hassani H, Karbasi A. Projection-free online optimization with stochastic gradient: From convexity to submodularity, in: Proceedings of the 35th International Conference on Machine Learning. 2018:813–822.
Hazan E, Minasyan E. Faster projection-free online learning, in: Proceedings of the 33rd Annual Conference on Learning Theory. 2020:1877–1893.
Zhang M, Zhou Y, Quan W, Zhu J, Zheng R, Wu Q. Online learning for iot optimization: A frank-wolfe adam based algorithm. IEEE Internet Things J. 2020;7:8228–37.
Zinkevich M. Online convex programming and generalized infinitesimal gradient ascent, in: Proceedings of the Twentieth International Conference on Machine Learning. 2003:928–936.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016:770–778.
Huang G, Liu Z, Maaten L, Weinberger KQ. Densely connected convolutional network, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017:1–9.
Berrada L, Zisserman A, Kumar MP. Deep Frank-Wolfe For Neural Network Optimization, in: Proceedings of the International Conference on Learning Representations. 2019.
Lu H, Jin L, Luo X, et al. RNN for Solving Perturbed Time-Varying Underdetermined Linear System With Double Bound Limits on Residual Errors and State Variables. IEEE Trans Industr Inf. 2019;15(11):5931–42.
Xin L, Zhou M, Shang M, Xia Y. A Novel Approach to Extracting Non-Negative Latent Factors From Non-Negative Big Sparse Matrices. IEEE Access. 2016;4:2649–55.
Luo X, Zhou MC, Li S, et al. Algorithms of Unconstrained Non-negative Latent Factor Analysis for Recommender Systems. IEEE Transactions on Big Data. 2021;7(1):227–40.
Funding
This work was partially supported by Chinese Academy of Sciences under grant No. Y9BEJ11001 and the innovation workstation of Suzhou Institute of Nano-Tech and Nano-Bionics (SINANO) under grant No. E010210101. This work was also partially supported by National Natural Science Foundation of China under no.61876155 and Jiangsu Science and Technology Programme under no. BE2020006-4.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Ethical Approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Conflict of Interest
The authors declare that they have no conflict of interest.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix of LightAdam
Appendix of LightAdam
Proof of Lemma 1
Proof
From the definition of \(z_{t}\) and \(\mathbf {x}_{t+1}\), we obtain
Based on the fact that \(Y_t(\mathbf {x})\) is 2-smooth, we have
Moreover, from the definition of \(\mathbf {s}_t\), we have \(\mathbf {s}_t,\mathbf {x}_t\in \mathcal {F}\). Therefore, from Assumption 3 and the definition of \(\xi _t\), we obtain
According to the definition that \(\mathbf {s}_t:=\arg \min _{\mathbf {x}\in \mathcal {F}}\left\langle \nabla Y_t(\mathbf {x}_t),\mathbf {x} \right\rangle\), we have
Based on Equation (34) and the convexity of \(Y_t(\mathbf {x})\), we obtain
Furthermore, inserting Equation (35) into Equation (33), we obtain
Since that
we have \(Y_t(\mathbf {x}_{t}^*)\le Y_t(\mathbf {x}_{t+1}^*).\) Furthermore, from the definition of \(z_{t+1},\) we obtain
Moreover, by the definition of \(Y_{t+1}(\mathbf {x})\), we obtain
In addition, combining Equations (37) with (38), we have
Furthermore, applying the Cauchy-Schwarz inequality into Equation (39), we obtain
Next, according to the Assumption 2, and using the recursion algorithm for Equation (12), we obtain
By definition, \(Y_t(\mathbf {x})\) is 2-strongly convex. Meanwhile, from the optimality of \(\mathbf {x}_t^*\) and the definition 2, for any \(\mathbf {x}\in \mathcal {F}\), we obtain
Let \(\mathbf {x}=\mathbf {x}_{t+1}\), and for time \(t+1\), we obtain
Therefore, combining Equations (36), (40), (41) and (43), we have
Consequently, we obtain Lemma 1 through the above analysis.
Proof of Lemma 2
Proof
To compare the terms \(\frac{1}{\sqrt{t}}\left( 1-\frac{1}{2\sqrt{t}}\right)\) and \(\frac{1}{\sqrt{t+1}}\), we directly calculate the difference of their squares as follows:
Follows the fact that \(5t-1\le 4\sqrt{t}(t+1)\) for all \(t\ge 1\), we have that
Therefore, by Equation (46), we attain the result of Lemma 2
The proof of Lemma 2 is completed.
Proof of Lemma 3
Proof
Since the parameters chosen by our proposed algorithm satisfy
thus, from Equation (24), we obtain
Next, to prove the correctness of Equation (26), we use the mathematical induction. Firstly, considering the case that \(t=1\), we have
Thus, the case when \(t=1\) is satisfied. Secondly, assuming that the Equation (26) is true for time t. Next, by the mathematical induction, we consider the case \(t+1\) as follows. From Equation (24) and the relationship
we obtain
In addition, applying Lemma 2 into Equation (50), we attain
By Equation (51), the Equation (26) is true for time \(t+1\), therefore, the mathematical induction is satisfied for all \(t\in \{1,\ldots ,T\}\). The proof of Lemma 3 is completed.
Proof of Theorem 1
Proof
Denoting that \(\mathbf {x}^*:=\arg \min _{\mathbf {x}\in \mathcal {F}}\sum _{t=1}^T f_t(\mathbf {x})\), and according to the definition of the regret \(\mathcal {R}(T)\), we have
To get the bound of \(\mathcal {R}(T)\), we first consider the term
in Equation (52). Reviewing the Assumption 1, we know that the function \(f_t(\mathbf {x})\) is Lipschitz with L for all \(t\in \{1,\ldots ,T\}\). Moreover, by the Definition 3, we obtain
Moreover, summing Equation (53), and we have
Let \(\mathbf {x}=\mathbf {x}_t\) and substitute it into Equation (42), and applying Lemma 3, we attain
By the definition of the definite integral, we have the relationship
Therefore, combining Equations (54) and (55), we obtain
Now, the bound of \(\sum _{t=1}^T \big \vert f_t(\mathbf {x}_t) - f_t(\mathbf {x}_t^*)\big \vert\) is obtained. Next, we turn to calculate the bound of the term
in Equation (52). By the smoothness of \(f_t(\mathbf {x})\) and the Definition 4, we have
Moreover, from the convexity of \(f_t(\mathbf {x})\), the Definition 1 and the optimal of \(\mathbf {x}^*\), we further obtain
In addition, from Equation (58), and applying the Cauchy-Schwarz inequality, we attain
Applying Assumptions 2 and 3, and from Equation (55), we further have
Next, sum over both sides of Equation (60), we obtain
Substituting the inequalities
and
into Equation (60), and we attain
Finally, combining Equations (52), (56) and (62), we have that
Therefore, the stated bound of the regret \(\mathcal {R}(T)\) is obtained.
Proof of Theorem 2
Proof
Following with [18], we define a loss function as equation (30). Therefore, the minimum regret of Equation (30) is obtained when \(x=0\). For Adam, we set \(\beta _1=0, 0<\sqrt{\beta _2}<\lambda <1,\) and \(\alpha _t = \alpha / \sqrt{t},\) where \(t\in \{1,\ldots ,T\}\). Then, the gradient of \(f_t(x_{t,i})\) when \(x_{t,i}\ge 0\) as follows
Moreover, from Equation (27) and applying the recursive algorithm, we obtain the following
Since that \(\sqrt{\beta _2}<\lambda\), we have the following
By Equations (28) and (66), we attain the following
Since that \(\sum _{\tau =1}^t\frac{1}{\sqrt{\tau }}\) diverges when \(t\rightarrow \infty\), and from Equation (67), Adam would always reach the negative region if t is large enough.
For LightAdam, we also set \(\beta _1=0, 0<\sqrt{\beta _2}<\lambda <1,\) and \(\alpha _t = \alpha / \sqrt{t},\) where \(t\in \{1,\ldots ,T\}\). Then, by Equations (13) and (14), we obtain the following
By Equation (68), we further have the following
Moreover, from Equations (15), (18) and (69), we attain the following
From Equation (69), the step size of LightAdam has a lower bound, therefore, its step size is not affected by extreme gradients. In addition, from Equation (70), we can observe that LightAdam is able to converge to the optimal solution if its step size and parameters are initialized suitably. Therefore, the proof of Theorem 2 is completed.
Rights and permissions
About this article
Cite this article
Zhou, Y., Huang, K., Cheng, C. et al. LightAdam: Towards a Fast and Accurate Adaptive Momentum Online Algorithm. Cogn Comput 14, 764–779 (2022). https://doi.org/10.1007/s12559-021-09985-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12559-021-09985-9