Combining Stochastic Adaptive Cubic Regularization with Negative Curvature for Nonconvex Optimization

Park, Seonho; Jung, Seung Hyun; Pardalos, Panos M.

doi:10.1007/s10957-019-01624-6

Combining Stochastic Adaptive Cubic Regularization with Negative Curvature for Nonconvex Optimization

Published: 24 December 2019

Volume 184, pages 953–971, (2020)
Cite this article

Journal of Optimization Theory and Applications Aims and scope Submit manuscript

556 Accesses
13 Citations
Explore all metrics

Abstract

We focus on minimizing nonconvex finite-sum functions that typically arise in machine learning problems. In an attempt to solve this problem, the adaptive cubic-regularized Newton method has shown its strong global convergence guarantees and the ability to escape from strict saddle points. In this paper, we expand this algorithm to incorporating the negative curvature method to update even at unsuccessful iterations. We call this new method Stochastic Adaptive cubic regularization with Negative Curvature (SANC). Unlike the previous method, in order to attain stochastic gradient and Hessian estimators, the SANC algorithm uses independent sets of data points of consistent size over all iterations. It makes the SANC algorithm more practical to apply for solving large-scale machine learning problems. To the best of our knowledge, this is the first approach that combines the negative curvature method with the adaptive cubic-regularized Newton method. Finally, we provide experimental results, including neural networks problems supporting the efficiency of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Faster Riemannian Newton-type optimization by subsampling and cubic regularization

Article Open access 02 May 2023

Adaptive regularization with cubics on manifolds

Article 13 May 2020

SCORE: approximating curvature information under self-concordant regularization

Article Open access 08 July 2023

References

Curtis, F.E., Robinson, D.P.: Exploiting negative curvature in deterministic and stochastic optimization. arXiv preprint arXiv:1703.00412 (2017)
Liu, M., Li, Z., Wang, X., Yi, J., Yang, T.: Adaptive negative curvature descent with applications in non-convex optimization. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, pp. 4854–4863. MIT Press, Cambridge (2018)
Google Scholar
Cano, J., Moguerza, J.M., Prieto, F.J.: Using improved directions of negative curvature for the solution of bound-constrained nonconvex problems. J. Optim. Theory Appl. 174(2), 474–499 (2017)
Article MathSciNet Google Scholar
Reddi, S.J., Zaheer, M., Sra, S., Poczos, B., Bach, F., Salakhutdinov, R., Smola, A.J.: A generic approach for escaping saddle points. arXiv preprint arXiv:1709.01434 (2017)
Kuczyński, J., Woźniakowski, H.: Estimating the largest eigenvalue by the power and lanczos algorithms with a random start. SIAM J. Matrix Anal. Appl. 13(4), 1094–1122 (1992)
Article MathSciNet Google Scholar
Oja, E.: Simplified neuron model as a principal component analyzer. J. Math. Biol. 15(3), 267–273 (1982)
Article MathSciNet Google Scholar
Martens, J.: Deep learning via hessian-free optimization. In: ICML vol. 27, pp. 735–742 (2010)
Martens, J., Sutskever, I.: Learning recurrent neural networks with hessian-free optimization. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1033–1040. Citeseer (2011)
Agarwal, N., Bullins, B., Hazan, E.: Second-order stochastic optimization for machine learning in linear time. J. Mach. Learn. Res. 18(1), 4148–4187 (2017)
MathSciNet MATH Google Scholar
Vinyals, O., Povey, D.: Krylov subspace descent for deep learning. In: Gale, W.A. (ed.) Artificial Intelligence and Statistics, pp. 1261–1268. Addison-Wesley Pub. Co., Boston (2012)
Google Scholar
Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-newton method for large-scale optimization. SIAM J. Optim. 26(2), 1008–1031 (2016)
Article MathSciNet Google Scholar
Pearlmutter, B.A.: Fast exact multiplication by the Hessian. Neural Comput. 6(1), 147–160 (1994)
Article Google Scholar
Griewank, A.: The modification of Newtons method for unconstrained optimization by bounding cubic terms. Technical report, NA/12 (1981)
Nesterov, Y., Polyak, B.T.: Cubic regularization of newton method and its global performance. Math. Program. 108(1), 177–205 (2006)
Article MathSciNet Google Scholar
Wang, X., Ma, S., Goldfarb, D., Liu, W.: Stochastic quasi-Newton methods for nonconvex stochastic optimization. SIAM J. Optim. 27(2), 927–956 (2017)
Article MathSciNet Google Scholar
Wang, Z., Zhou, Y., Liang, Y., Lan, G.: Cubic regularization with momentum for nonconvex optimization. arXiv preprint arXiv:1810.03763 (2018)
Cartis, C., Gould, N.I., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Math. Program. 127(2), 245–295 (2011)
Article MathSciNet Google Scholar
Cartis, C., Gould, N.I., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function-and derivative-evaluation complexity. Math. Program. 130(2), 295–319 (2011)
Article MathSciNet Google Scholar
Kohler, J.M., Lucchi, A.: Sub-sampled cubic regularization for non-convex optimization. arXiv preprint arXiv:1705.05933 (2017)
Bergou, E.H., Diouane, Y., Gratton, S.: A line-search algorithm inspired by the adaptive cubic regularization framework and complexity analysis. J. Optim. Theory Appl. 178(3), 885–913 (2018)
Article MathSciNet Google Scholar
Wang, X., Fan, N., Pardalos, P.M.: Stochastic subgradient descent method for large-scale robust chance-constrained support vector machines. Optim. Lett. 11(5), 1013–1024 (2017)
Article MathSciNet Google Scholar
Carmon, Y., Duchi, J.C.: Gradient descent efficiently finds the cubic-regularized non-convex Newton step. arXiv preprint arXiv:1612.00547 (2016)
Ritz, W.: Über eine neue methode zur lösung gewisser variationsprobleme der mathematischen physik. Journal für die reine und angewandte Mathematik (Crelles J.) 1909(135), 1–61 (1909)
Article Google Scholar
Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent converges to minimizers. arXiv preprint arXiv:1602.04915 (2016)
Gross, D.: Recovering low-rank matrices from few coefficients in any basis. IEEE Trans. Inf. Theory 57(3), 1548–1566 (2011)
Article MathSciNet Google Scholar
Ghadimi, S., Liu, H., Zhang, T.: Second-order methods with cubic regularization under inexact information. arXiv preprint arXiv:1710.05782 (2017)
Roosta-Khorasani, F., Mahoney, M.W.: Sub-sampled newton methods II: local convergence rates. arXiv preprint arXiv:1601.04738 (2016)
Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., Ma, T.: Finding approximate local minima faster than gradient descent. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1195–1199. ACM (2017)
Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Accelerated methods for nonconvex optimization. SIAM J. Optim. 28(2), 1751–1772 (2018)
Article MathSciNet Google Scholar
Allen-Zhu, Z.: Natasha 2: faster non-convex optimization than sgd. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, pp. 2675–2686. MIT Press, Cambridge (2018)
Google Scholar
Allen-Zhu, Z., Li, Y.: Neon2: finding local minima via first-order oracles. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, pp. 3716–3726. MIT Press, Cambridge (2018)
Google Scholar
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)
Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)
Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical report, Citeseer (2009)

Download references

Acknowledgements

This work was partially supported by the National Research Foundation (NRF) of Korea (NRF-2018R1D1A1B07043406). Panos M. Pardalos was supported by a Humboldt Research Award (Germany).

Author information

Authors and Affiliations

Center for Applied Optimization, University of Florida, Gainesville, FL, USA
Seonho Park & Panos M. Pardalos
Korea Institute of Industrial Technology (KITECH), Daegu, Republic of Korea
Seung Hyun Jung

Authors

Seonho Park
View author publications
You can also search for this author in PubMed Google Scholar
Seung Hyun Jung
View author publications
You can also search for this author in PubMed Google Scholar
Panos M. Pardalos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Seonho Park.

Additional information

Communicated by Alexander Mitsos.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Experiment Settings on Numerical Results

In “Appendix A”, we present some experimental results to show the effectiveness of the SANC algorithm for stochastic nonconvex optimization problems. In our numerical experiments, we considered three machine learning problems, namely (i) the logistic regression (ii) the multilayer perceptron, and (iii) the convolutional neural networks (CNN) with real datasets.

For the logistic regression problems, with the binary classification dataset, i.e., $\{\mathbf {x}_i,y_i\}$, $y_i\in \{0,1\}$, in order to find the optimum $\mathbf {w}^*$, we solved the following problem,

$$\begin{aligned} \min _{\mathbf {w}\in \mathbb {R}^d}&-\frac{1}{n}\sum _{i=1}^{n}y_i\log \left( \frac{1}{1+e^{-\mathbf {w}^T\mathbf {x}_i}}\right) \end{aligned}$$

(68)

$$\begin{aligned}&+\log \left( \frac{e^{\mathbf {w}^T\mathbf {x}_i}}{1+e^{-\mathbf {w}^T\mathbf {x}_i}}\right) +\lambda \Omega (\mathbf {w}) \end{aligned}$$

(69)

where $\Omega (\mathbf {w})=\sum _{i=1}^d \frac{w_i^2}{1+w_i^2}$, and $\lambda $ is a fixed regularization coefficient. We initialized all variables by $\mathbf {w}_0=1$.

For a multilayer perceptron, we used two hidden layers. We use the first layer with the size of 300, the second layer with the size of 500, and use hyperbolic tangent functions as an activation. At the outer layer, softmax functions were used. Cross-entropy loss was used as an objective function to be minimized. We also added l2 norm as a convex regularization term with a coefficient $\lambda =0.01$.

For CNN, we used two convolutional receptive filters. The first one is $5\times 5\times 32$ dimensional, and the second one is $5\times 5\times 64$ dimensional. The fully connected layer was added at the output of the convolutional layers, which has 1000 neurons. The settings for nonlinear activation functions and l2 norm are the same as those of multilayer perceptron.

All the variables, including the weights and the biases in the multilayer perceptron and the CNN, were initialized by the Xavier initialization [32]. We used real datasets from libsvm [33] for the logistic regression problem and multilayer perceptron. For CNN, we used MNIST [34] and CIFAR10 datasets [35].

The sizes of $\mathcal {S}_\mathbf {g}$ and $\mathcal {S}_\mathbf {B}$ are $\lceil \text {the number of datapoints}/20\rceil $ for the logistic regression problems. For the multilayer perceptron and CNN, we computed the stochastic gradients and Hessian estimators using independently drawn mini-batches of size 128. These batch size settings are used for SANC as well as all the baselines, which are described below.

We compared our method, SANC, with various optimization methods as follows:

SGD Fixed step size is used. We used 0.01 for the logistic regression problems and 0.001 for the multilayer perceptron and CNN problems, which is the best choice among $10^{-3:1:3}$, respectively.
SCR [19] We used the same parameter values for SANC. We used same sets, $\mathcal {S}_\mathbf {g}$, $\mathcal {S}_\mathbf {B}$ in SANC, which is different with the experiments in [19] where they used increasing sized sets.
CR [14] For the fixed cubic coefficient $\sigma $, we used 5, for all problems.
CRM [16] For the fixed cubic coefficient $\sigma $, we used 5, for all problems. The momentum parameter $\beta $ is set to $8\times \Vert \mathbf {s}\Vert $ as in [16].
NCD [2] We used the same parameter values what we used for SANC.

As a default setting, we used $\gamma =2$, $\eta _1=0.2$, $\eta _2=0.8$, and $\sigma _0=1$. These parameter settings are the same as those in SCR [19]. For the neural networks problems, we used $\eta _1=0.1$ and $\eta _2=0.3$ tuned by some experiments. $L_1$ and $L_2$ parameters in SANC have tuned. The searching range for $L_1$ and $L_2$ is $10^{-3:1:3}$. The used parameter values are presented in Table 1.

Table 1 Datasets used for numerical experiments

Full size table

For the Lanczos method, we truncated it at Lanczos iteration 5 for all experiments. Also, an approximate minimizer of an approximate local cubic model over a Krylov subspace was obtained by the ‘conjugate gradient method’ in scipy.optimize.minimize. To calculate the left-most eigenpair of $\mathbf {T}$, we used eigh_tridiagonal in scipy.linalg.

For SANC and SCR, we have to calculate $f(\mathbf {x}_t)$ and $f(\mathbf {x}_t+\mathbf {s}_t)$ to measure $\rho _t$. For the neural networks problems, because of the shortage of the memory storage, we made another set of dataset whose size is same as $\mathcal {S}_\mathbf {g}$ and $\mathcal {S}_\mathbf {B}$ to estimate $f(\mathbf {x}_t)$ and $f(\mathbf {x}_t+\mathbf {s}_t)$.

All baselines are our implementations using Tensorflow. One can reach out to the code at https://github.com/seonho-park/Stochastic-Adaptive-cubic-regularization-with-Negative-Curvature.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Park, S., Jung, S.H. & Pardalos, P.M. Combining Stochastic Adaptive Cubic Regularization with Negative Curvature for Nonconvex Optimization. J Optim Theory Appl 184, 953–971 (2020). https://doi.org/10.1007/s10957-019-01624-6

Download citation

Received: 17 June 2019
Accepted: 02 December 2019
Published: 24 December 2019
Issue Date: March 2020
DOI: https://doi.org/10.1007/s10957-019-01624-6

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Combining Stochastic Adaptive Cubic Regularization with Negative Curvature for Nonconvex Optimization

Abstract

Access this article

Similar content being viewed by others

Faster Riemannian Newton-type optimization by subsampling and cubic regularization

Adaptive regularization with cubics on manifolds

SCORE: approximating curvature information under self-concordant regularization

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix A: Experiment Settings on Numerical Results

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Combining Stochastic Adaptive Cubic Regularization with Negative Curvature for Nonconvex Optimization

Abstract

Access this article

Similar content being viewed by others

Faster Riemannian Newton-type optimization by subsampling and cubic regularization

Adaptive regularization with cubics on manifolds

SCORE: approximating curvature information under self-concordant regularization

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix A: Experiment Settings on Numerical Results

Appendix A: Experiment Settings on Numerical Results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation