Abstract
In this paper, we propose and analyze zeroth-order stochastic approximation algorithms for nonconvex and convex optimization, with a focus on addressing constrained optimization, high-dimensional setting, and saddle point avoiding. To handle constrained optimization, we first propose generalizations of the conditional gradient algorithm achieving rates similar to the standard stochastic gradient algorithm using only zeroth-order information. To facilitate zeroth-order optimization in high dimensions, we explore the advantages of structural sparsity assumptions. Specifically, (i) we highlight an implicit regularization phenomenon where the standard stochastic gradient algorithm with zeroth-order information adapts to the sparsity of the problem at hand by just varying the step size and (ii) propose a truncated stochastic gradient algorithm with zeroth-order information, whose rate of convergence depends only poly-logarithmically on the dimensionality. We next focus on avoiding saddle points in nonconvex setting. Toward that, we interpret the Gaussian smoothing technique for estimating gradient based on zeroth-order information as an instantiation of first-order Stein’s identity. Based on this, we provide a novel linear-(in dimension) time estimator of the Hessian matrix of a function using only zeroth-order information, which is based on second-order Stein’s identity. We then provide a zeroth-order variant of cubic regularized Newton method for avoiding saddle points and discuss its rate of convergence to local minima.
Similar content being viewed by others
Notes
We remark that our step size choice requires knowledge of a rough upper bound on the true sparsity parameter.
For a definition of almost-differentiable function, we refer the reader to Definition 1 in [75].
References
Agarwal, A., Dekel, O., Xiao, L.: Optimal algorithms for online convex optimization with multi-point bandit feedback. In: Proceedings of The 23rd Conference on Learning Theory, pp. 28–40 (2010)
Akhavan, A., Pontil, M., Tsybakov, A.: Exploiting higher order smoothness in derivative-free optimization and continuous bandits. In: Advances in Neural Information Processing Systems, vol. 33 (2020)
Allen-Zhu, Z.: Natasha 2: Faster non-convex optimization than SGD. In: Advances in Neural Information Processing Systems, pp. 2680–2691 (2018)
Bach, F., Perchet, V.: Highly-smooth zero-th order online optimization. In: V. Feldman, A. Rakhlin, O. Shamir (eds.) 29th Annual Conference on Learning Theory, Proceedings of Machine Learning Research, vol. 49, pp. 257–283. PMLR (2016)
Beck, A.: First-Order Methods in Optimization, vol. 25. Society for Industrial and Applied Mathematics (SIAM) (2017)
Belloni, A., Liang, T., Narayanan, H., Rakhlin, A.: Escaping the local minima via simulated annealing: Optimization of approximately convex functions. In: P. Grunwald, E. Hazan, S. Kale (eds.) Proceedings of The 28th Conference on Learning Theory, Proceedings of Machine Learning Research, vol. 40, pp. 240–265. PMLR (2015)
Ben-Tal, A., Nemirovski, A.: Lectures on modern convex optimization: analysis, algorithms, and engineering applications, vol. 2. Society for Industrial and Applied Mathematics (SIAM) (2001)
Bertsekas, D.P.: Nonlinear programming. Athena scientific Belmont (2016)
Bertsekas, D.P., Scientific, A.: Convex optimization algorithms. Athena Scientific Belmont (2015)
Bhojanapalli, S., Neyshabur, B., Srebro, N.: Global optimality of local search for low rank matrix recovery. In: Advances in Neural Information Processing Systems, pp. 3873–3881 (2016)
Boyd, S., Vandenberghe, L.: Convex optimization. Cambridge University Press (2004)
Bubeck, S., Cesa-Bianchi, N.: Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning 5(1), 1–122 (2012)
Bubeck, S., Lee, Y.T., Eldan, R.: Kernel-based methods for bandit convex optimization. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 72–85 (2017)
Cai, H., Mckenzie, D., Yin, W., Zhang, Z.: Zeroth-order regularized optimization (ZORO): Approximately sparse gradients and adaptive sampling (2020)
Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Accelerated methods for nonconvex optimization. SIAM Journal on Optimization 28(2), 1751–1772 (2018)
Cartis, C., Gould, N.I., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization, Part I: Motivation, convergence and numerical results. Mathematical Programming 127(2), 245–295 (2011)
Cartis, C., Gould, N.I., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization, Part II: Worst-case function-and derivative-evaluation complexity. Mathematical programming 130(2), 295–319 (2011)
Cartis, C., Gould, N.I., Toint, P.L.: Second-order optimality and beyond: Characterization and evaluation complexity in convexly constrained nonlinear optimization. Foundations of Computational Mathematics 18(5), 1073–1107 (2018)
Chen, L., Zhang, M., Hassani, H., Karbasi, A.: Black box submodular maximization: Discrete and continuous settings. In: S. Chiappa, R. Calandra (eds.) Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 108, pp. 1058–1070 (2020)
Chen, P.Y., Zhang, H., Sharma, Y., Yi, J., Hsieh, C.J.: ZOO: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In: Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 15–26. ACM (2017)
Choromanski, K., Rowland, M., Sindhwani, V., Turner, R., Weller, A.: Structured evolution with compact architectures for scalable policy optimization. In: Proceedings of the 35th International Conference on Machine Learning. PMLR (2018)
Conn, A., Scheinberg, K., Vicente, L.: Introduction to derivative-free optimization, vol. 8. Society of Industrial and Applied Mathematics (SIAM) (2009)
Dani, V., Kakade, S.M., Hayes, T.P.: The price of bandit information for online optimization. In: Advances in Neural Information Processing Systems, pp. 345–352 (2008)
Demyanov, V., Rubinov, A.: Approximate methods in optimization problems. American Elsevier Publishing (1970)
DeVore, R., Petrova, G., Wojtaszczyk, P.: Approximation of functions of few variables in high dimensions. Constructive Approximation 33(1), 125–143 (2011)
Donoho, D.L.: Compressed sensing. IEEE Transactions on information theory 52(4), 1289–1306 (2006)
Duchi, J., Jordan, M., Wainwright, M., Wibisono, A.: Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory 61(5), 2788–2806 (2015)
Elibol, M., Lei, L., Jordan, M.I.: Variance reduction with sparse gradients. In: Proceedings of the 8th International Conference on Learning Representations (ICLR), pp. 1058–1070 (2020)
Erdogdu, M.A.: Newton-Stein method: an optimization method for GLMs via Stein’s lemma. The Journal of Machine Learning Research 17(1), 7565–7616 (2016)
Frank, M., Wolfe, P.: An algorithm for quadratic programming. Naval Research Logistics Quarterly 3, 95–110 (1956)
Gasnikov, A.V., Krymova, E.A., Lagunovskaya, A.A., Usmanova, I.N., Fedorenko, F.A.: Stochastic online optimization. single-point and multi-point non-linear multi-armed bandits. convex and strongly-convex case. Automation and remote control 78(2), 224–234 (2017)
Ge, R., Huang, F., Jin, C., Yuan, Y.: Escaping from saddle points: Online stochastic gradient for tensor decomposition. In: Conference on Learning Theory, pp. 797–842 (2015)
Ge, R., Lee, J.D., Ma, T.: Matrix completion has no spurious local minimum. In: Advances in Neural Information Processing Systems, pp. 2973–2981 (2016)
Ghadimi, S.: Conditional gradient type methods for composite nonlinear and stochastic optimization. Mathematical Programming (2018). https://doi.org/10.1007/s10107-017-1225-5
Ghadimi, S., Lan, G.: Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization 23(4), 2341–2368 (2013)
Han, C., Yuan, M.: Information based complexity for high dimensional sparse functions. Journal of Complexity 57, 101443 (2020)
Hazan, E., Kale, S.: Projection-free online learning. In: Proceedings of the 29th International Conference on International Conference on Machine Learning, pp. 1843–1850 (2012)
Hazan, E., Levy, K.: Bandit convex optimization: Towards tight bounds. In: Advances in Neural Information Processing Systems, pp. 784–792 (2014)
Hazan, E., Luo, H.: Variance-reduced and projection-free stochastic optimization. In: International Conference on Machine Learning, pp. 1263–1271 (2016)
Hearn, D.: The gap function of a convex program. Operations Research Letters 2, 95–110 (1982)
Hu, X., Prashanth, L.A., György, A., Szepesvari, C.: (Bandit) Convex Optimization with Biased Noisy Gradient Oracles. In: The 19th International Conference on Artificial Intelligence and Statistics, pp. 3420–3428 (2016)
Jaggi, M.: Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In: Proceedings of the 30th International Conference on International Conference on Machine Learning, pp. 427–435 (2013)
Jain, P., Kar, P.: Non-convex optimization for machine learning.Foundations and Trends® in Machine Learning 10(3-4), 142–336 (2017)
Jain, P., Tewari, A., Kar, P.: On iterative hard thresholding methods for high-dimensional m-estimation. In: Advances in Neural Information Processing Systems, pp. 685–693 (2014)
Jamieson, K., Nowak, R., Recht, B.: Query complexity of derivative-free optimization. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2012)
Jin, C., Ge, R., Netrapalli, P., Kakade, S.M., Jordan, M.I.: How to escape saddle points efficiently. In: International Conference on Machine Learning, pp. 1724–1732 (2017)
Kawaguchi, K., Kaelbling, L.P.: Elimination of all bad local minima in deep learning. arXiv:1901.00279
Lan, G., Zhou, Y.: Conditional gradient sliding for convex optimization. SIAM Journal on Optimization 26(2), 1379–1409 (2016)
Lattimore, T.: Improved regret for zeroth-order adversarial bandit convex optimisation. arXiv:2006.00475
Li, J., Balasubramanian, K., Ma, S.: Stochastic zeroth-order riemannian derivative estimation and optimization. arXiv:2003.11238 (2020)
Mania, H., Guy, A., Recht, B.: Simple random search provides a competitive approach to reinforcement learning. In: Advances in Neural Information Processing Systems (2018)
Minsker, S.: Sub-gaussian estimators of the mean of a random matrix with heavy-tailed entries. The Annals of Statistics 46(6A), 2871–2903 (2018)
Mockus, J.: Bayesian approach to global optimization: theory and applications, vol. 37. Springer Science & Business Media (2012)
Mokhtari, A., Hassani, H., Karbasi, A.: Conditional gradient method for stochastic submodular maximization: Closing the gap. In: International Conference on Artificial Intelligence and Statistics, pp. 1886–1895 (2018)
Mokhtari, A., Hassani, H., Karbasi, A.: Stochastic conditional gradient methods: From convex minimization to submodular maximization. Journal of Machine Learning Research 21, 1–49 (2020)
Murty, K.G., Kabadi, S.N.: Some NP-complete problems in quadratic and nonlinear programming. Mathematical programming 39(2), 117–129 (1987)
Nemirovski, A.S., Yudin, D.: Problem complexity and method efficiency in optimization. Wiley-Interscience Series in Discrete Mathematics. John Wiley, XV (1983)
Nesterov, Y.: Introductory Lectures on Convex Optimization: a basic course. Kluwer Academic Publishers, Massachusetts (2004)
Nesterov, Y.: Introductory lectures on convex optimization: A basic course, vol. 87. Springer Science & Business Media (2013)
Nesterov, Y., Polyak, B.: Cubic regularization of newton method and its global performance. Mathematical Programming 108(1), 177–205 (2006)
Nesterov, Y., Spokoiny, V.: Random gradient-free minimization of convex functions. Foundations of Computational Mathematics 17, 527–566 (2017)
Nestrov, Y.: Implementable tensor methods in unconstrained convex optimization. Mathematical Programming 186, 157–183 (2021)
Nocedal, J., Wright, S.J.: Numerical optimization. Springer Science & Business Media (2006)
Raskutti, G., Wainwright, M.J., Yu, B.: Minimax-optimal rates for sparse additive models over kernel classes via convex programming. The Journal of Machine Learning Research 13(1), 389–427 (2012)
Reddi, S., Sra, S., Póczos, B., Smola, A.: Stochastic Frank-Wolfe Methods for Nonconvex Optimization. In: Proceedings of the 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 1244–1251 (2016)
Reddi, S., Zaheer, M., Sra, S., Poczos, B., Bach, F., Salakhutdinov, R., Smola, A.: A generic approach for escaping saddle points. In: International Conference on Artificial Intelligence and Statistics, pp. 1233–1242 (2018)
Rio, E.: Moment inequalities for sums of dependent random variables under projective conditions. Journal of Theoretical Probability 22(1), 146–163 (2009)
Rubinstein, R., Kroese, D.: Simulation and the Monte Carlo method, vol. 10. John Wiley & Sons, New Jersey (2016)
Saha, A., Tewari, A.: Improved regret guarantees for online smooth convex optimization with bandit feedback. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 636–642 (2011)
Salimans, T., Ho, J., Chen, X., Sidor, S., Sutskever, I.: Evolution strategies as a scalable alternative to reinforcement learning. arXiv:1703.03864
Shamir, O.: On the complexity of bandit and derivative-free stochastic convex optimization. In: Conference on Learning Theory, pp. 3–24 (2013)
Snoek, J., Larochelle, H., Adams, R.: Practical bayesian optimization of machine learning algorithms. In: Advances in neural information processing systems, pp. 2951–2959 (2012)
Spall, J.: Introduction to stochastic search and optimization: estimation, simulation, and control, vol. 65. John Wiley & Sons, New Jersey (2005)
Stein, C.: A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In: Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory. The Regents of the University of California (1972)
Stein, C.M.: Estimation of the mean of a multivariate normal distribution. The annals of Statistics pp. 1135–1151 (1981)
Sun, J., Qu, Q., Wright, J.: When are nonconvex problems not scary? arXiv:1510.06096
Sun, J., Qu, Q., Wright, J.: A geometric analysis of phase retrieval. Foundations of Computational Mathematics 18(5), 1131–1198 (2018)
Tripuraneni, N., Stern, M., Jin, C., Regier, J., Jordan, M.: Stochastic cubic regularization for fast nonconvex optimization. In: Advances in Neural Information Processing Systems, pp. 2899–2908 (2018)
Tropp, J.A.: The expected norm of a sum of independent random matrices: An elementary approach. In: High Dimensional Probability VII, pp. 173–202. Springer (2016)
Tyagi, H., Kyrillidis, A., Gärtner, B., Krause, A.: Algorithms for learning sparse additive models with interactions in high dimensions. Information and Inference: A Journal of the IMA 7(2), 183–249 (2018)
Wang, Y., Du, S., Balakrishnan, S., Singh, A.: Stochastic zeroth-order optimization in high dimensions. In: A. Storkey, F. Perez-Cruz (eds.) Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 84, pp. 1356–1365 (2018)
Wojtaszczyk, P.: Complexity of approximation of functions of few variables in high dimensions. Journal of Complexity 27(2), 141–150 (2011)
Xu, P., Roosta-Khorasani, F., Mahoney, M.W.: Newton-type methods for non-convex optimization under inexact hessian information. Mathematical Programming 184, 35–70 (2020)
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Francis Bach.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Balasubramanian, K., Ghadimi, S. Zeroth-Order Nonconvex Stochastic Optimization: Handling Constraints, High Dimensionality, and Saddle Points. Found Comput Math 22, 35–76 (2022). https://doi.org/10.1007/s10208-021-09499-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10208-021-09499-8
Keywords
- Zeroth-order methods
- Nonconvex optimization
- Stochastic optimization
- Complexity bounds
- Conditional gradient methods
- Newton method