Momentum-Based Variance-Reduced Proximal Stochastic Gradient Method for Composite Nonconvex Stochastic Optimization

Xu, Yangyang; Xu, Yibo

doi:10.1007/s10957-022-02132-w

Momentum-Based Variance-Reduced Proximal Stochastic Gradient Method for Composite Nonconvex Stochastic Optimization

Published: 02 December 2022

Volume 196, pages 266–297, (2023)
Cite this article

Journal of Optimization Theory and Applications Aims and scope Submit manuscript

588 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

Stochastic gradient methods (SGMs) have been extensively used for solving stochastic problems or large-scale machine learning problems. Recent works employ various techniques to improve the convergence rate of SGMs for both convex and nonconvex cases. Most of them require a large number of samples in some or all iterations of the improved SGMs. In this paper, we propose a new SGM, named PStorm, for solving nonconvex nonsmooth stochastic problems. With a momentum-based variance reduction technique, PStorm can achieve the optimal complexity result \(O(\varepsilon ^{-3})\) to produce a stochastic \(\varepsilon \)-stationary solution, if a mean-squared smoothness condition holds. Different from existing optimal methods, PStorm can achieve the \({O}(\varepsilon ^{-3})\) result by using only one or O(1) samples in every update. With this property, PStorm can be applied to online learning problems that favor real-time decisions based on one or O(1) new observations. In addition, for large-scale machine learning problems, PStorm can generalize better by small-batch training than other optimal methods that require large-batch training and the vanilla SGM, as we demonstrate on training a sparse fully-connected neural network and a sparse convolutional neural network.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

A mini-batch stochastic conjugate gradient algorithm with variance reduction

Article 01 July 2022

An Overview of Stochastic Quasi-Newton Methods for Large-Scale Machine Learning

Article Open access 25 February 2023

Accelerated Stochastic Variance Reduction for a Class of Convex Optimization Problems

Article 12 January 2023

Notes

Throughout the paper, we use \({\tilde{O}}\) to suppress an additional polynomial term of \(|\log \varepsilon |\).
By “optimal,” we mean that the complexity result can reach the lower bound result; a result is “near optimal,” if it has an additional logarithmic term or a polynomial of logarithmic term than the lower bound.

References

Allen-Zhu, Z.: Natasha 2: Faster non-convex optimization than SGD. In: Advances in Neural Information Processing Systems, pp. 2675–2686 (2018)
Allen-Zhu, Z., Hazan, E.: Variance reduction for faster non-convex optimization. In: International Conference on Machine Learning, pp. 699–707 (2016)
Arjevani, Y., Carmon, Y., Duchi, J.C., Foster, D.J., Srebro, N., Woodworth, B.: Lower bounds for non-convex stochastic optimization. arXiv:1912.02365 (2019)
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 1–27 (2011)
Article Google Scholar
Chen, X., Liu, S., Sun, R., Hong, M.: On the convergence of a class of adam-type algorithms for non-convex optimization. In: International Conference on Learning Representations (2018)
Cutkosky, A., Orabona, F.: Momentum-based variance reduction in non-convex SGD. In: Advances in Neural Information Processing Systems, pp. 32 (2019)
Davis, D., Drusvyatskiy, D.: Stochastic model-based minimization of weakly convex functions. SIAM J. Optim. 29(1), 207–239 (2019)
Article MathSciNet MATH Google Scholar
Davis, D., Drusvyatskiy, D., Kakade, S., Lee, J.D.: Stochastic subgradient method converges on tame functions. Found. Comput. Math. 20(1), 119–154 (2020)
Article MathSciNet MATH Google Scholar
Fang, C., Li, C.J., Lin, Z., Zhang, T.: Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In: Advances in Neural Information Processing Systems, pp. 689–699 (2018)
Ghadimi, S., Lan, G.: Stochastic first and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)
Article MathSciNet MATH Google Scholar
Ghadimi, S., Lan, G.: Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program. 156(1–2), 59–99 (2016)
Article MathSciNet MATH Google Scholar
Ghadimi, S., Lan, G., Zhang, H.: Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program. 155(1–2), 267–305 (2016)
Article MathSciNet MATH Google Scholar
Huo, Z., Huang, H.: Asynchronous stochastic gradient descent with variance reduction for non-convex optimization. arXiv:1604.03584 (2016)
Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large-batch training for deep learning: generalization gap and sharp minima. arXiv:1609.04836 (2016)
Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical Report, University of Toronto, Toronto, ON (2009)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Lei, L., Ju, C., Chen, J., Jordan, M.I.: Non-convex finite-sum optimization via scsg methods. In: Advances in Neural Information Processing Systems, pp. 2348–2358 (2017)
Liu, B., Wang, M., Foroosh, H., Tappen, M., Pensky, M.: Sparse convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 806–814 (2015)
Mairal, J., Bach, F., Ponce, J., Sapiro, G.: Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res. 11(Jan), 19–60 (2010)
MathSciNet MATH Google Scholar
Masters, D., Luschi, C.: Revisiting small batch training for deep neural networks. arXiv:1804.07612 (2018)
Mitliagkas, I., Caramanis, C., Jain, P.: Memory limited, streaming PCA. In: Advances in Neural Information Processing Systems, pp. 2886–2894 (2013)
Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: Sarah: a novel method for machine learning problems using stochastic recursive gradient. In: Proceedings of the 34th International Conference on Machine Learning, Vol. 70, pp. 2613–2621. JMLR. org (2017)
Pham, N.H., Nguyen, L.M., Phan, D.T., Tran-Dinh, Q.: ProxSARAH: an efficient algorithmic framework for stochastic composite nonconvex optimization. J. Mach. Learn. Res. 21(110), 1–48 (2020)
MathSciNet MATH Google Scholar
Reddi, S.J., Hefny, A., Sra, S., Póczos, B., Smola, A.: Stochastic variance reduction for nonconvex optimization. In: International Conference on Machine Learning, pp. 314–323 (2016)
Reddi, S.J., Sra, S., Poczos, B., Smola, A.J.: Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 1153–1161 (2016)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
Article MathSciNet MATH Google Scholar
Scardapane, S., Comminiello, D., Hussain, A., Uncini, A.: Group sparse regularization for deep neural networks. Neurocomputing 241, 81–89 (2017)
Article Google Scholar
Shi, J.V., Xu, Y., Baraniuk, R.G.: Sparse bilinear logistic regression. arXiv:1404.4104 (2014)
Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: The all convolutional net. arXiv:1412.6806 (2014)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, New York (2018)
MATH Google Scholar
Tran Dinh, Q., Liu, D., Nguyen, L.: Hybrid variance-reduced SGD algorithms for minimax problems with nonconvex-linear function. Adv. Neural. Inf. Process. Syst. 33, 11096–11107 (2020)
Google Scholar
Tran-Dinh, Q., Pham, N.H., Phan, D.T., Nguyen, L.M.: A hybrid stochastic optimization framework for composite nonconvex optimization. Math. Program. 191(2), 1005–1071 (2022)
Article MathSciNet MATH Google Scholar
Wang, Z., Ji, K., Zhou, Y., Liang, Y., Tarokh, V.: Spiderboost and momentum: faster variance reduction algorithms. In: Advances in Neural Information Processing Systems, pp. 32 (2019)
Wei, C., Lee, J.D., Liu, Q., Ma, T.: Regularization matters: generalization and optimization of neural nets vs their induced kernel. In: Advances in Neural Information Processing Systems, pp. 9709–9721 (2019)
Xu, Y., Xu, Y.: Katyusha acceleration for convex finite-sum compositional optimization. Informs J. Optim. 3(4), 418–443 (2021)
Article MathSciNet Google Scholar
Xu, Y., Xu, Y., Yan, Y., Sutcher-Shepard, C., Grinberg, L., Chen, J.: Parallel and distributed asynchronous adaptive stochastic gradient methods. arXiv:2002.09095 (2020)
Xu, Y., Yin, W.: Block stochastic gradient iteration for convex and nonconvex optimization. SIAM J. Optim. 25(3), 1686–1716 (2015)
Article MathSciNet MATH Google Scholar
Zhang, J., Xiao, L.: A stochastic composite gradient method with incremental variance reduction. In: Advances in Neural Information Processing Systems, pp. 32 (2019)
Zhang, J., Xiao, L.: Stochastic variance-reduced prox-linear algorithms for nonconvex composite optimization. Math. Program. 195, 1–43 (2021)
MathSciNet Google Scholar
Zhao, R., Tan, V.Y.: Online nonnegative matrix factorization with outliers. IEEE Trans. Signal Process. 65(3), 555–570 (2016)
Article MathSciNet MATH Google Scholar
Zhou, D., Tang, Y., Yang, Z., Cao, Y., Gu, Q.: On the convergence of adaptive gradient methods for nonconvex optimization. arXiv:1808.05671 (2018)

Download references

Acknowledgements

We thank two anonymous referees for their constructive comments and suggestions to improve the quality and contributions of the paper. This work is partly supported by NSF grants DMS-2053493 and DMS-2208394 and RPI-IBM AIRC.

Author information

Authors and Affiliations

Department of Mathematical Sciences, Rensselaer Polytechnic Institute, Troy, NY, 12180, USA
Yangyang Xu
School of Mathematical and Statistical Sciences, Clemson University, Clemson, SC, 29634, USA
Yibo Xu

Authors

Yangyang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yibo Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yangyang Xu.

Additional information

Communicated by Amir Beck.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xu, Y., Xu, Y. Momentum-Based Variance-Reduced Proximal Stochastic Gradient Method for Composite Nonconvex Stochastic Optimization. J Optim Theory Appl 196, 266–297 (2023). https://doi.org/10.1007/s10957-022-02132-w

Download citation

Received: 25 April 2021
Accepted: 27 October 2022
Published: 02 December 2022
Issue Date: January 2023
DOI: https://doi.org/10.1007/s10957-022-02132-w

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Momentum-Based Variance-Reduced Proximal Stochastic Gradient Method for Composite Nonconvex Stochastic Optimization

Abstract

Access this article

Similar content being viewed by others

A mini-batch stochastic conjugate gradient algorithm with variance reduction

An Overview of Stochastic Quasi-Newton Methods for Large-Scale Machine Learning

Accelerated Stochastic Variance Reduction for a Class of Convex Optimization Problems

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Momentum-Based Variance-Reduced Proximal Stochastic Gradient Method for Composite Nonconvex Stochastic Optimization

Abstract

Access this article

Similar content being viewed by others

A mini-batch stochastic conjugate gradient algorithm with variance reduction

An Overview of Stochastic Quasi-Newton Methods for Large-Scale Machine Learning

Accelerated Stochastic Variance Reduction for a Class of Convex Optimization Problems

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation