Abstract
Stochastic gradient methods (SGMs) have been extensively used for solving stochastic problems or large-scale machine learning problems. Recent works employ various techniques to improve the convergence rate of SGMs for both convex and nonconvex cases. Most of them require a large number of samples in some or all iterations of the improved SGMs. In this paper, we propose a new SGM, named PStorm, for solving nonconvex nonsmooth stochastic problems. With a momentum-based variance reduction technique, PStorm can achieve the optimal complexity result \(O(\varepsilon ^{-3})\) to produce a stochastic \(\varepsilon \)-stationary solution, if a mean-squared smoothness condition holds. Different from existing optimal methods, PStorm can achieve the \({O}(\varepsilon ^{-3})\) result by using only one or O(1) samples in every update. With this property, PStorm can be applied to online learning problems that favor real-time decisions based on one or O(1) new observations. In addition, for large-scale machine learning problems, PStorm can generalize better by small-batch training than other optimal methods that require large-batch training and the vanilla SGM, as we demonstrate on training a sparse fully-connected neural network and a sparse convolutional neural network.
Similar content being viewed by others
Notes
Throughout the paper, we use \({\tilde{O}}\) to suppress an additional polynomial term of \(|\log \varepsilon |\).
By “optimal,” we mean that the complexity result can reach the lower bound result; a result is “near optimal,” if it has an additional logarithmic term or a polynomial of logarithmic term than the lower bound.
References
Allen-Zhu, Z.: Natasha 2: Faster non-convex optimization than SGD. In: Advances in Neural Information Processing Systems, pp. 2675–2686 (2018)
Allen-Zhu, Z., Hazan, E.: Variance reduction for faster non-convex optimization. In: International Conference on Machine Learning, pp. 699–707 (2016)
Arjevani, Y., Carmon, Y., Duchi, J.C., Foster, D.J., Srebro, N., Woodworth, B.: Lower bounds for non-convex stochastic optimization. arXiv:1912.02365 (2019)
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 1–27 (2011)
Chen, X., Liu, S., Sun, R., Hong, M.: On the convergence of a class of adam-type algorithms for non-convex optimization. In: International Conference on Learning Representations (2018)
Cutkosky, A., Orabona, F.: Momentum-based variance reduction in non-convex SGD. In: Advances in Neural Information Processing Systems, pp. 32 (2019)
Davis, D., Drusvyatskiy, D.: Stochastic model-based minimization of weakly convex functions. SIAM J. Optim. 29(1), 207–239 (2019)
Davis, D., Drusvyatskiy, D., Kakade, S., Lee, J.D.: Stochastic subgradient method converges on tame functions. Found. Comput. Math. 20(1), 119–154 (2020)
Fang, C., Li, C.J., Lin, Z., Zhang, T.: Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In: Advances in Neural Information Processing Systems, pp. 689–699 (2018)
Ghadimi, S., Lan, G.: Stochastic first and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)
Ghadimi, S., Lan, G.: Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program. 156(1–2), 59–99 (2016)
Ghadimi, S., Lan, G., Zhang, H.: Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Math. Program. 155(1–2), 267–305 (2016)
Huo, Z., Huang, H.: Asynchronous stochastic gradient descent with variance reduction for non-convex optimization. arXiv:1604.03584 (2016)
Keskar, N.S., Mudigere, D., Nocedal, J., Smelyanskiy, M., Tang, P.T.P.: On large-batch training for deep learning: generalization gap and sharp minima. arXiv:1609.04836 (2016)
Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical Report, University of Toronto, Toronto, ON (2009)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Lei, L., Ju, C., Chen, J., Jordan, M.I.: Non-convex finite-sum optimization via scsg methods. In: Advances in Neural Information Processing Systems, pp. 2348–2358 (2017)
Liu, B., Wang, M., Foroosh, H., Tappen, M., Pensky, M.: Sparse convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 806–814 (2015)
Mairal, J., Bach, F., Ponce, J., Sapiro, G.: Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res. 11(Jan), 19–60 (2010)
Masters, D., Luschi, C.: Revisiting small batch training for deep neural networks. arXiv:1804.07612 (2018)
Mitliagkas, I., Caramanis, C., Jain, P.: Memory limited, streaming PCA. In: Advances in Neural Information Processing Systems, pp. 2886–2894 (2013)
Nguyen, L.M., Liu, J., Scheinberg, K., Takáč, M.: Sarah: a novel method for machine learning problems using stochastic recursive gradient. In: Proceedings of the 34th International Conference on Machine Learning, Vol. 70, pp. 2613–2621. JMLR. org (2017)
Pham, N.H., Nguyen, L.M., Phan, D.T., Tran-Dinh, Q.: ProxSARAH: an efficient algorithmic framework for stochastic composite nonconvex optimization. J. Mach. Learn. Res. 21(110), 1–48 (2020)
Reddi, S.J., Hefny, A., Sra, S., Póczos, B., Smola, A.: Stochastic variance reduction for nonconvex optimization. In: International Conference on Machine Learning, pp. 314–323 (2016)
Reddi, S.J., Sra, S., Poczos, B., Smola, A.J.: Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 1153–1161 (2016)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
Scardapane, S., Comminiello, D., Hussain, A., Uncini, A.: Group sparse regularization for deep neural networks. Neurocomputing 241, 81–89 (2017)
Shi, J.V., Xu, Y., Baraniuk, R.G.: Sparse bilinear logistic regression. arXiv:1404.4104 (2014)
Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: The all convolutional net. arXiv:1412.6806 (2014)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, New York (2018)
Tran Dinh, Q., Liu, D., Nguyen, L.: Hybrid variance-reduced SGD algorithms for minimax problems with nonconvex-linear function. Adv. Neural. Inf. Process. Syst. 33, 11096–11107 (2020)
Tran-Dinh, Q., Pham, N.H., Phan, D.T., Nguyen, L.M.: A hybrid stochastic optimization framework for composite nonconvex optimization. Math. Program. 191(2), 1005–1071 (2022)
Wang, Z., Ji, K., Zhou, Y., Liang, Y., Tarokh, V.: Spiderboost and momentum: faster variance reduction algorithms. In: Advances in Neural Information Processing Systems, pp. 32 (2019)
Wei, C., Lee, J.D., Liu, Q., Ma, T.: Regularization matters: generalization and optimization of neural nets vs their induced kernel. In: Advances in Neural Information Processing Systems, pp. 9709–9721 (2019)
Xu, Y., Xu, Y.: Katyusha acceleration for convex finite-sum compositional optimization. Informs J. Optim. 3(4), 418–443 (2021)
Xu, Y., Xu, Y., Yan, Y., Sutcher-Shepard, C., Grinberg, L., Chen, J.: Parallel and distributed asynchronous adaptive stochastic gradient methods. arXiv:2002.09095 (2020)
Xu, Y., Yin, W.: Block stochastic gradient iteration for convex and nonconvex optimization. SIAM J. Optim. 25(3), 1686–1716 (2015)
Zhang, J., Xiao, L.: A stochastic composite gradient method with incremental variance reduction. In: Advances in Neural Information Processing Systems, pp. 32 (2019)
Zhang, J., Xiao, L.: Stochastic variance-reduced prox-linear algorithms for nonconvex composite optimization. Math. Program. 195, 1–43 (2021)
Zhao, R., Tan, V.Y.: Online nonnegative matrix factorization with outliers. IEEE Trans. Signal Process. 65(3), 555–570 (2016)
Zhou, D., Tang, Y., Yang, Z., Cao, Y., Gu, Q.: On the convergence of adaptive gradient methods for nonconvex optimization. arXiv:1808.05671 (2018)
Acknowledgements
We thank two anonymous referees for their constructive comments and suggestions to improve the quality and contributions of the paper. This work is partly supported by NSF grants DMS-2053493 and DMS-2208394 and RPI-IBM AIRC.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Amir Beck.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xu, Y., Xu, Y. Momentum-Based Variance-Reduced Proximal Stochastic Gradient Method for Composite Nonconvex Stochastic Optimization. J Optim Theory Appl 196, 266–297 (2023). https://doi.org/10.1007/s10957-022-02132-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10957-022-02132-w