Abstract
We focus on minimizing nonconvex finite-sum functions that typically arise in machine learning problems. In an attempt to solve this problem, the adaptive cubic-regularized Newton method has shown its strong global convergence guarantees and the ability to escape from strict saddle points. In this paper, we expand this algorithm to incorporating the negative curvature method to update even at unsuccessful iterations. We call this new method Stochastic Adaptive cubic regularization with Negative Curvature (SANC). Unlike the previous method, in order to attain stochastic gradient and Hessian estimators, the SANC algorithm uses independent sets of data points of consistent size over all iterations. It makes the SANC algorithm more practical to apply for solving large-scale machine learning problems. To the best of our knowledge, this is the first approach that combines the negative curvature method with the adaptive cubic-regularized Newton method. Finally, we provide experimental results, including neural networks problems supporting the efficiency of our method.
Similar content being viewed by others
References
Curtis, F.E., Robinson, D.P.: Exploiting negative curvature in deterministic and stochastic optimization. arXiv preprint arXiv:1703.00412 (2017)
Liu, M., Li, Z., Wang, X., Yi, J., Yang, T.: Adaptive negative curvature descent with applications in non-convex optimization. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, pp. 4854–4863. MIT Press, Cambridge (2018)
Cano, J., Moguerza, J.M., Prieto, F.J.: Using improved directions of negative curvature for the solution of bound-constrained nonconvex problems. J. Optim. Theory Appl. 174(2), 474–499 (2017)
Reddi, S.J., Zaheer, M., Sra, S., Poczos, B., Bach, F., Salakhutdinov, R., Smola, A.J.: A generic approach for escaping saddle points. arXiv preprint arXiv:1709.01434 (2017)
Kuczyński, J., Woźniakowski, H.: Estimating the largest eigenvalue by the power and lanczos algorithms with a random start. SIAM J. Matrix Anal. Appl. 13(4), 1094–1122 (1992)
Oja, E.: Simplified neuron model as a principal component analyzer. J. Math. Biol. 15(3), 267–273 (1982)
Martens, J.: Deep learning via hessian-free optimization. In: ICML vol. 27, pp. 735–742 (2010)
Martens, J., Sutskever, I.: Learning recurrent neural networks with hessian-free optimization. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 1033–1040. Citeseer (2011)
Agarwal, N., Bullins, B., Hazan, E.: Second-order stochastic optimization for machine learning in linear time. J. Mach. Learn. Res. 18(1), 4148–4187 (2017)
Vinyals, O., Povey, D.: Krylov subspace descent for deep learning. In: Gale, W.A. (ed.) Artificial Intelligence and Statistics, pp. 1261–1268. Addison-Wesley Pub. Co., Boston (2012)
Byrd, R.H., Hansen, S.L., Nocedal, J., Singer, Y.: A stochastic quasi-newton method for large-scale optimization. SIAM J. Optim. 26(2), 1008–1031 (2016)
Pearlmutter, B.A.: Fast exact multiplication by the Hessian. Neural Comput. 6(1), 147–160 (1994)
Griewank, A.: The modification of Newtons method for unconstrained optimization by bounding cubic terms. Technical report, NA/12 (1981)
Nesterov, Y., Polyak, B.T.: Cubic regularization of newton method and its global performance. Math. Program. 108(1), 177–205 (2006)
Wang, X., Ma, S., Goldfarb, D., Liu, W.: Stochastic quasi-Newton methods for nonconvex stochastic optimization. SIAM J. Optim. 27(2), 927–956 (2017)
Wang, Z., Zhou, Y., Liang, Y., Lan, G.: Cubic regularization with momentum for nonconvex optimization. arXiv preprint arXiv:1810.03763 (2018)
Cartis, C., Gould, N.I., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Math. Program. 127(2), 245–295 (2011)
Cartis, C., Gould, N.I., Toint, P.L.: Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function-and derivative-evaluation complexity. Math. Program. 130(2), 295–319 (2011)
Kohler, J.M., Lucchi, A.: Sub-sampled cubic regularization for non-convex optimization. arXiv preprint arXiv:1705.05933 (2017)
Bergou, E.H., Diouane, Y., Gratton, S.: A line-search algorithm inspired by the adaptive cubic regularization framework and complexity analysis. J. Optim. Theory Appl. 178(3), 885–913 (2018)
Wang, X., Fan, N., Pardalos, P.M.: Stochastic subgradient descent method for large-scale robust chance-constrained support vector machines. Optim. Lett. 11(5), 1013–1024 (2017)
Carmon, Y., Duchi, J.C.: Gradient descent efficiently finds the cubic-regularized non-convex Newton step. arXiv preprint arXiv:1612.00547 (2016)
Ritz, W.: Über eine neue methode zur lösung gewisser variationsprobleme der mathematischen physik. Journal für die reine und angewandte Mathematik (Crelles J.) 1909(135), 1–61 (1909)
Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent converges to minimizers. arXiv preprint arXiv:1602.04915 (2016)
Gross, D.: Recovering low-rank matrices from few coefficients in any basis. IEEE Trans. Inf. Theory 57(3), 1548–1566 (2011)
Ghadimi, S., Liu, H., Zhang, T.: Second-order methods with cubic regularization under inexact information. arXiv preprint arXiv:1710.05782 (2017)
Roosta-Khorasani, F., Mahoney, M.W.: Sub-sampled newton methods II: local convergence rates. arXiv preprint arXiv:1601.04738 (2016)
Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., Ma, T.: Finding approximate local minima faster than gradient descent. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1195–1199. ACM (2017)
Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Accelerated methods for nonconvex optimization. SIAM J. Optim. 28(2), 1751–1772 (2018)
Allen-Zhu, Z.: Natasha 2: faster non-convex optimization than sgd. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, pp. 2675–2686. MIT Press, Cambridge (2018)
Allen-Zhu, Z., Li, Y.: Neon2: finding local minima via first-order oracles. In: Mozer, M.C., Jordan, M.I., Petsche, T. (eds.) Advances in Neural Information Processing Systems, pp. 3716–3726. MIT Press, Cambridge (2018)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256 (2010)
Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2(3), 27 (2011)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical report, Citeseer (2009)
Acknowledgements
This work was partially supported by the National Research Foundation (NRF) of Korea (NRF-2018R1D1A1B07043406). Panos M. Pardalos was supported by a Humboldt Research Award (Germany).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Alexander Mitsos.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A: Experiment Settings on Numerical Results
Appendix A: Experiment Settings on Numerical Results
In “Appendix A”, we present some experimental results to show the effectiveness of the SANC algorithm for stochastic nonconvex optimization problems. In our numerical experiments, we considered three machine learning problems, namely (i) the logistic regression (ii) the multilayer perceptron, and (iii) the convolutional neural networks (CNN) with real datasets.
For the logistic regression problems, with the binary classification dataset, i.e., \(\{\mathbf {x}_i,y_i\}\), \(y_i\in \{0,1\}\), in order to find the optimum \(\mathbf {w}^*\), we solved the following problem,
where \(\Omega (\mathbf {w})=\sum _{i=1}^d \frac{w_i^2}{1+w_i^2}\), and \(\lambda \) is a fixed regularization coefficient. We initialized all variables by \(\mathbf {w}_0=1\).
For a multilayer perceptron, we used two hidden layers. We use the first layer with the size of 300, the second layer with the size of 500, and use hyperbolic tangent functions as an activation. At the outer layer, softmax functions were used. Cross-entropy loss was used as an objective function to be minimized. We also added l2 norm as a convex regularization term with a coefficient \(\lambda =0.01\).
For CNN, we used two convolutional receptive filters. The first one is \(5\times 5\times 32\) dimensional, and the second one is \(5\times 5\times 64\) dimensional. The fully connected layer was added at the output of the convolutional layers, which has 1000 neurons. The settings for nonlinear activation functions and l2 norm are the same as those of multilayer perceptron.
All the variables, including the weights and the biases in the multilayer perceptron and the CNN, were initialized by the Xavier initialization [32]. We used real datasets from libsvm [33] for the logistic regression problem and multilayer perceptron. For CNN, we used MNIST [34] and CIFAR10 datasets [35].
The sizes of \(\mathcal {S}_\mathbf {g}\) and \(\mathcal {S}_\mathbf {B}\) are \(\lceil \text {the number of datapoints}/20\rceil \) for the logistic regression problems. For the multilayer perceptron and CNN, we computed the stochastic gradients and Hessian estimators using independently drawn mini-batches of size 128. These batch size settings are used for SANC as well as all the baselines, which are described below.
We compared our method, SANC, with various optimization methods as follows:
SGD Fixed step size is used. We used 0.01 for the logistic regression problems and 0.001 for the multilayer perceptron and CNN problems, which is the best choice among \(10^{-3:1:3}\), respectively.
SCR [19] We used the same parameter values for SANC. We used same sets, \(\mathcal {S}_\mathbf {g}\), \(\mathcal {S}_\mathbf {B}\) in SANC, which is different with the experiments in [19] where they used increasing sized sets.
CR [14] For the fixed cubic coefficient \(\sigma \), we used 5, for all problems.
CRM [16] For the fixed cubic coefficient \(\sigma \), we used 5, for all problems. The momentum parameter \(\beta \) is set to \(8\times \Vert \mathbf {s}\Vert \) as in [16].
NCD [2] We used the same parameter values what we used for SANC.
As a default setting, we used \(\gamma =2\), \(\eta _1=0.2\), \(\eta _2=0.8\), and \(\sigma _0=1\). These parameter settings are the same as those in SCR [19]. For the neural networks problems, we used \(\eta _1=0.1\) and \(\eta _2=0.3\) tuned by some experiments. \(L_1\) and \(L_2\) parameters in SANC have tuned. The searching range for \(L_1\) and \(L_2\) is \(10^{-3:1:3}\). The used parameter values are presented in Table 1.
For the Lanczos method, we truncated it at Lanczos iteration 5 for all experiments. Also, an approximate minimizer of an approximate local cubic model over a Krylov subspace was obtained by the ‘conjugate gradient method’ in scipy.optimize.minimize. To calculate the left-most eigenpair of \(\mathbf {T}\), we used eigh_tridiagonal in scipy.linalg.
For SANC and SCR, we have to calculate \(f(\mathbf {x}_t)\) and \(f(\mathbf {x}_t+\mathbf {s}_t)\) to measure \(\rho _t\). For the neural networks problems, because of the shortage of the memory storage, we made another set of dataset whose size is same as \(\mathcal {S}_\mathbf {g}\) and \(\mathcal {S}_\mathbf {B}\) to estimate \(f(\mathbf {x}_t)\) and \(f(\mathbf {x}_t+\mathbf {s}_t)\).
All baselines are our implementations using Tensorflow. One can reach out to the code at https://github.com/seonho-park/Stochastic-Adaptive-cubic-regularization-with-Negative-Curvature.
Rights and permissions
About this article
Cite this article
Park, S., Jung, S.H. & Pardalos, P.M. Combining Stochastic Adaptive Cubic Regularization with Negative Curvature for Nonconvex Optimization. J Optim Theory Appl 184, 953–971 (2020). https://doi.org/10.1007/s10957-019-01624-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10957-019-01624-6
Keywords
- Adaptive cubic-regularized Newton method
- Cubic regularization
- Trust-region method
- Negative curvature
- Nonconvex optimization
- Worst-case complexity