Skip to main content
Log in

Combining Stochastic Adaptive Cubic Regularization with Negative Curvature for Nonconvex Optimization

  • Published:
Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Abstract

We focus on minimizing nonconvex finite-sum functions that typically arise in machine learning problems. In an attempt to solve this problem, the adaptive cubic-regularized Newton method has shown its strong global convergence guarantees and the ability to escape from strict saddle points. In this paper, we expand this algorithm to incorporating the negative curvature method to update even at unsuccessful iterations. We call this new method Stochastic Adaptive cubic regularization with Negative Curvature (SANC). Unlike the previous method, in order to attain stochastic gradient and Hessian estimators, the SANC algorithm uses independent sets of data points of consistent size over all iterations. It makes the SANC algorithm more practical to apply for solving large-scale machine learning problems. To the best of our knowledge, this is the first approach that combines the negative curvature method with the adaptive cubic-regularized Newton method. Finally, we provide experimental results, including neural networks problems supporting the efficiency of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  1. Curtis, F.E., Robinson, D.P.: Exploiting negative curvature in deterministic and stochastic optimization. arXiv preprint arXiv:1703.00412 (2017)

  2. Liu, M., Li, Z., Wang, X., Yi, J., Yang, T.: Adaptive negative curvature descent with applications in non-convex optimization. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, pp. 4854–4863. MIT Press, Cambridge (2018)

    Google Scholar 

  3. Cano, J., Moguerza, J.M., Prieto, F.J.: Using improved directions of negative curvature for the solution of bound-constrained nonconvex problems. J. Optim. Theory Appl. 174(2), 474–499 (2017)

    Article  MathSciNet  Google Scholar 

  4. Reddi, S.J., Zaheer, M., Sra, S., Poczos, B., Bach, F., Salakhutdinov, R., Smola, A.J.: A generic approach for escaping saddle points. arXiv preprint arXiv:1709.01434 (2017)

  5. Kuczyński, J., Woźniakowski, H.: Estimating the largest eigenvalue by the power and lanczos algorithms with a random start. SIAM J. Matrix Anal. Appl. 13(4), 1094–1122 (1992)

    Article  MathSciNet  Google Scholar 

  6. Oja, E.: Simplified neuron model as a principal component analyzer. J. Math. Biol. 15(3), 267–273 (1982)

    Article  MathSciNet  Google Scholar 

  7. Martens, J.: Deep learning via hessian-free optimization. In: ICML vol. 27, pp. 735–742 (2010)

  8. Martens, J., Sutskever, I.: Learning recurrent neural networks with hessian-free optimization. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1033–1040. Citeseer (2011)

  9. Agarwal, N., Bullins, B., Hazan, E.: Second-order stochastic optimization for machine learning in linear time. J. Mach. Learn. Res. 18(1), 4148–4187 (2017)

    MathSciNet  MATH  Google Scholar 

  10. Vinyals, O., Povey, D.: Krylov subspace descent for deep learning. In: Gale, W.A. (ed.) Artificial Intelligence and Statistics, pp. 1261–1268. Addison-Wesley Pub. Co., Boston (2012)

    Google Scholar 

  11. Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-newton method for large-scale optimization. SIAM J. Optim. 26(2), 1008–1031 (2016)

    Article  MathSciNet  Google Scholar 

  12. Pearlmutter, B.A.: Fast exact multiplication by the Hessian. Neural Comput. 6(1), 147–160 (1994)

    Article  Google Scholar 

  13. Griewank, A.: The modification of Newtons method for unconstrained optimization by bounding cubic terms. Technical report, NA/12 (1981)

  14. Nesterov, Y., Polyak, B.T.: Cubic regularization of newton method and its global performance. Math. Program. 108(1), 177–205 (2006)

    Article  MathSciNet  Google Scholar 

  15. Wang, X., Ma, S., Goldfarb, D., Liu, W.: Stochastic quasi-Newton methods for nonconvex stochastic optimization. SIAM J. Optim. 27(2), 927–956 (2017)

    Article  MathSciNet  Google Scholar 

  16. Wang, Z., Zhou, Y., Liang, Y., Lan, G.: Cubic regularization with momentum for nonconvex optimization. arXiv preprint arXiv:1810.03763 (2018)

  17. Cartis, C., Gould, N.I., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Math. Program. 127(2), 245–295 (2011)

    Article  MathSciNet  Google Scholar 

  18. Cartis, C., Gould, N.I., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function-and derivative-evaluation complexity. Math. Program. 130(2), 295–319 (2011)

    Article  MathSciNet  Google Scholar 

  19. Kohler, J.M., Lucchi, A.: Sub-sampled cubic regularization for non-convex optimization. arXiv preprint arXiv:1705.05933 (2017)

  20. Bergou, E.H., Diouane, Y., Gratton, S.: A line-search algorithm inspired by the adaptive cubic regularization framework and complexity analysis. J. Optim. Theory Appl. 178(3), 885–913 (2018)

    Article  MathSciNet  Google Scholar 

  21. Wang, X., Fan, N., Pardalos, P.M.: Stochastic subgradient descent method for large-scale robust chance-constrained support vector machines. Optim. Lett. 11(5), 1013–1024 (2017)

    Article  MathSciNet  Google Scholar 

  22. Carmon, Y., Duchi, J.C.: Gradient descent efficiently finds the cubic-regularized non-convex Newton step. arXiv preprint arXiv:1612.00547 (2016)

  23. Ritz, W.: Über eine neue methode zur lösung gewisser variationsprobleme der mathematischen physik. Journal für die reine und angewandte Mathematik (Crelles J.) 1909(135), 1–61 (1909)

    Article  Google Scholar 

  24. Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent converges to minimizers. arXiv preprint arXiv:1602.04915 (2016)

  25. Gross, D.: Recovering low-rank matrices from few coefficients in any basis. IEEE Trans. Inf. Theory 57(3), 1548–1566 (2011)

    Article  MathSciNet  Google Scholar 

  26. Ghadimi, S., Liu, H., Zhang, T.: Second-order methods with cubic regularization under inexact information. arXiv preprint arXiv:1710.05782 (2017)

  27. Roosta-Khorasani, F., Mahoney, M.W.: Sub-sampled newton methods II: local convergence rates. arXiv preprint arXiv:1601.04738 (2016)

  28. Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., Ma, T.: Finding approximate local minima faster than gradient descent. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1195–1199. ACM (2017)

  29. Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Accelerated methods for nonconvex optimization. SIAM J. Optim. 28(2), 1751–1772 (2018)

    Article  MathSciNet  Google Scholar 

  30. Allen-Zhu, Z.: Natasha 2: faster non-convex optimization than sgd. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, pp. 2675–2686. MIT Press, Cambridge (2018)

    Google Scholar 

  31. Allen-Zhu, Z., Li, Y.: Neon2: finding local minima via first-order oracles. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, pp. 3716–3726. MIT Press, Cambridge (2018)

    Google Scholar 

  32. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)

  33. Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)

    Google Scholar 

  34. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  35. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical report, Citeseer (2009)

Download references

Acknowledgements

This work was partially supported by the National Research Foundation (NRF) of Korea (NRF-2018R1D1A1B07043406). Panos M. Pardalos was supported by a Humboldt Research Award (Germany).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Seonho Park.

Additional information

Communicated by Alexander Mitsos.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Experiment Settings on Numerical Results

Appendix A: Experiment Settings on Numerical Results

In “Appendix A”, we present some experimental results to show the effectiveness of the SANC algorithm for stochastic nonconvex optimization problems. In our numerical experiments, we considered three machine learning problems, namely (i) the logistic regression (ii) the multilayer perceptron, and (iii) the convolutional neural networks (CNN) with real datasets.

For the logistic regression problems, with the binary classification dataset, i.e., \(\{\mathbf {x}_i,y_i\}\), \(y_i\in \{0,1\}\), in order to find the optimum \(\mathbf {w}^*\), we solved the following problem,

$$\begin{aligned} \min _{\mathbf {w}\in \mathbb {R}^d}&-\frac{1}{n}\sum _{i=1}^{n}y_i\log \left( \frac{1}{1+e^{-\mathbf {w}^T\mathbf {x}_i}}\right) \end{aligned}$$
(68)
$$\begin{aligned}&+\log \left( \frac{e^{\mathbf {w}^T\mathbf {x}_i}}{1+e^{-\mathbf {w}^T\mathbf {x}_i}}\right) +\lambda \Omega (\mathbf {w}) \end{aligned}$$
(69)

where \(\Omega (\mathbf {w})=\sum _{i=1}^d \frac{w_i^2}{1+w_i^2}\), and \(\lambda \) is a fixed regularization coefficient. We initialized all variables by \(\mathbf {w}_0=1\).

For a multilayer perceptron, we used two hidden layers. We use the first layer with the size of 300, the second layer with the size of 500, and use hyperbolic tangent functions as an activation. At the outer layer, softmax functions were used. Cross-entropy loss was used as an objective function to be minimized. We also added l2 norm as a convex regularization term with a coefficient \(\lambda =0.01\).

For CNN, we used two convolutional receptive filters. The first one is \(5\times 5\times 32\) dimensional, and the second one is \(5\times 5\times 64\) dimensional. The fully connected layer was added at the output of the convolutional layers, which has 1000 neurons. The settings for nonlinear activation functions and l2 norm are the same as those of multilayer perceptron.

All the variables, including the weights and the biases in the multilayer perceptron and the CNN, were initialized by the Xavier initialization [32]. We used real datasets from libsvm [33] for the logistic regression problem and multilayer perceptron. For CNN, we used MNIST [34] and CIFAR10 datasets [35].

The sizes of \(\mathcal {S}_\mathbf {g}\) and \(\mathcal {S}_\mathbf {B}\) are \(\lceil \text {the number of datapoints}/20\rceil \) for the logistic regression problems. For the multilayer perceptron and CNN, we computed the stochastic gradients and Hessian estimators using independently drawn mini-batches of size 128. These batch size settings are used for SANC as well as all the baselines, which are described below.

We compared our method, SANC, with various optimization methods as follows:

  • SGD Fixed step size is used. We used 0.01 for the logistic regression problems and 0.001 for the multilayer perceptron and CNN problems, which is the best choice among \(10^{-3:1:3}\), respectively.

  • SCR [19] We used the same parameter values for SANC. We used same sets, \(\mathcal {S}_\mathbf {g}\), \(\mathcal {S}_\mathbf {B}\) in SANC, which is different with the experiments in [19] where they used increasing sized sets.

  • CR [14] For the fixed cubic coefficient \(\sigma \), we used 5,  for all problems.

  • CRM [16] For the fixed cubic coefficient \(\sigma \), we used 5,  for all problems. The momentum parameter \(\beta \) is set to \(8\times \Vert \mathbf {s}\Vert \) as in [16].

  • NCD [2] We used the same parameter values what we used for SANC.

As a default setting, we used \(\gamma =2\), \(\eta _1=0.2\), \(\eta _2=0.8\), and \(\sigma _0=1\). These parameter settings are the same as those in SCR [19]. For the neural networks problems, we used \(\eta _1=0.1\) and \(\eta _2=0.3\) tuned by some experiments. \(L_1\) and \(L_2\) parameters in SANC have tuned. The searching range for \(L_1\) and \(L_2\) is \(10^{-3:1:3}\). The used parameter values are presented in Table 1.

Table 1 Datasets used for numerical experiments

For the Lanczos method, we truncated it at Lanczos iteration 5 for all experiments. Also, an approximate minimizer of an approximate local cubic model over a Krylov subspace was obtained by the ‘conjugate gradient method’ in scipy.optimize.minimize. To calculate the left-most eigenpair of \(\mathbf {T}\), we used eigh_tridiagonal in scipy.linalg.

For SANC and SCR, we have to calculate \(f(\mathbf {x}_t)\) and \(f(\mathbf {x}_t+\mathbf {s}_t)\) to measure \(\rho _t\). For the neural networks problems, because of the shortage of the memory storage, we made another set of dataset whose size is same as \(\mathcal {S}_\mathbf {g}\) and \(\mathcal {S}_\mathbf {B}\) to estimate \(f(\mathbf {x}_t)\) and \(f(\mathbf {x}_t+\mathbf {s}_t)\).

All baselines are our implementations using Tensorflow. One can reach out to the code at https://github.com/seonho-park/Stochastic-Adaptive-cubic-regularization-with-Negative-Curvature.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Park, S., Jung, S.H. & Pardalos, P.M. Combining Stochastic Adaptive Cubic Regularization with Negative Curvature for Nonconvex Optimization. J Optim Theory Appl 184, 953–971 (2020). https://doi.org/10.1007/s10957-019-01624-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10957-019-01624-6

Keywords

Mathematics Subject Classification

Navigation