Abstract
In this chapter we discuss higher-order methods for optimization problems in machine learning applications. We also present underlying theoretical background as well as detailed experimental results for each of these higher order methods and also provide their in-depth comparison with respect to competing methods in the context of real-world datasets. We show that higher-order methods, contrary to popular understanding, can achieve significantly superior results compared to state-of-the-art competing methods in shorter wall-clock times yielding orders of magnitude of relative speedup for typical real-world datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
It converges linearly to the optimum, starting from any initial guess x (0).
- 2.
If the iterates are close enough to the optimum, it converges with a constant linear rate independent of the problem-related quantities.
References
M. ABADI, P. BARHAM, J. CHEN, Z. CHEN, A. DAVIS, J. DEAN, M. DEVIN, S. GHEMAWAT, G. IRVING, M. ISARD, ET AL., TensorFlow: A system for large-scale machine learning., in OSDI, vol. 16, 2016, pp. 265–283.
Z. ALLEN-ZHU AND Y. LI, Neon2: Finding local minima via first-order oracles, in Advances in Neural Information Processing Systems, 2018, pp. 3720–3730.
J. BA, R. GROSSE, AND J. MARTENS, Distributed second-order optimization using Kronecker-factored approximations, ICLR, (2017).
A. S. BERAHAS, R. BOLLAPRAGADA, AND J. NOCEDAL, An Investigation of Newton-Sketch and Subsampled Newton Methods, arXiv preprint arXiv:1705.06211, (2017).
D. P. BERTSEKAS AND J. N. TSITSIKLIS, Neuro-dynamic Programming, Athena Scientific, 1996.
L. BOTTOU, Large-scale machine learning with stochastic gradient descent, in Proceedings of COMPSTAT’2010, Springer, 2010, pp. 177–186.
L. BOTTOU, F. E. CURTIS, AND J. NOCEDAL, Optimization methods for large-scale machine learning, arXiv preprint arXiv:1606.04838, (2016).
L. BOTTOU AND Y. LECUN, Large scale online learning, Advances in neural information processing systems, 16 (2004), p. 217.
S. BOYD, N. PARIKH, E. CHU, B. PELEATO, J. ECKSTEIN, ET AL., Distributed optimization and statistical learning via the alternating direction method of multipliers, Foundations and Trends® in Machine learning, 3 (2011), pp. 1–122.
S. BOYD AND L. VANDENBERGHE, Convex optimization, Cambridge university press, 2004.
R. H. BYRD, G. M. CHIN, W. NEVEITT, AND J. NOCEDAL, On the use of stochastic Hessian information in optimization methods for machine learning, SIAM Journal on Optimization, 21 (2011), pp. 977–995.
R. H. BYRD, G. M. CHIN, J. NOCEDAL, AND Y. WU, Sample size selection in optimization methods for machine learning, Mathematical programming, 134 (2012), pp. 127–155.
A. CAUCHY, Méthode générale pour la résolution des systemes d’équations simultanées, Comp. Rend. Sci. Paris, 25 (1847), pp. 536–538.
O. CHAPELLE AND D. ERHAN, Improved preconditioner for Hessian free optimization, in In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
J. CHEN, X. PAN, R. MONGA, S. BENGIO, AND R. JOZEFOWICZ, Revisiting distributed synchronous SGD, arXiv preprint arXiv:1604.00981, (2016).
A. R. CONN, N. I. GOULD, AND P. L. TOINT, Trust region methods, vol. 1, SIAM, 2000.
A. COTTER, O. SHAMIR, N. SREBRO, AND K. SRIDHARAN, Better mini-batch algorithms via accelerated gradient methods, in Advances in neural information processing systems, 2011, pp. 1647–1655.
R. CRANE AND F. ROOSTA, DINGO: Distributed Newton-Type Method for Gradient-Norm Optimization, in Proceedings of the Advances in Neural Information Processing Systems, 2019. Accepted.
B. CRAVEN, Invex functions and constrained local minima, Bulletin of the Australian Mathematical society, 24 (1981), pp. 357–366.
H. DANESHMAND, A. LUCCHI, AND T. HOFMANN, DynaNewton-Accelerating Newton’s Method for Machine Learning, arXiv preprint arXiv:1605.06561, (2016).
Y. DAUPHIN, H. DE VRIES, AND Y. BENGIO, Equilibrated adaptive learning rates for non-convex optimization, in Advances in Neural Information Processing Systems, 2015, pp. 1504–1512.
Y. N. DAUPHIN, R. PASCANU, C. GULCEHRE, K. CHO, S. GANGULI, AND Y. BENGIO, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, arXiv:1406.2572v1, (2014).
J. DEAN, G. CORRADO, R. MONGA, K. CHEN, M. DEVIN, M. MAO, A. SENIOR, P. TUCKER, K. YANG, Q. V. LE, ET AL., Large scale distributed deep networks, in Advances in neural information processing systems, 2012, pp. 1223–1231.
J. DUCHI, E. HAZAN, AND Y. SINGER, Adaptive subgradient methods for online learning and stochastic optimization, The Journal of Machine Learning Research, 12 (2011), pp. 2121–2159.
C. DĂĽNNER, A. LUCCHI, M. GARGIANI, A. BIAN, T. HOFMANN, AND M. JAGGI, A distributed second-order algorithm you can trust, arXiv preprint arXiv:1806.07569, (2018).
C.-H. FANG, S. B. KYLASA, F. ROOSTA-KHORASANI, M. W. MAHONEY, AND A. GRAMA, Distributed second-order convex optimization, arXiv preprint arXiv:1807.07132, (2018).
J. FRIEDMAN, T. HASTIE, AND R. TIBSHIRANI, The elements of statistical learning, vol. 1, Springer series in statistics Springer, Berlin, 2001.
R. GE, F. HUANG, C. JIN, AND Y. YUAN, Escaping from saddle points – online stochastic gradient for tensor decomposition, arXiv preprint: arXiv:1503.02101v1, (2015).
P. GOYAL, P. DOLLáR, R. GIRSHICK, P. NOORDHUIS, L. WESOLOWSKI, A. KYROLA, A. TULLOCH, Y. JIA, AND K. HE, Accurate, large minibatch SGD: training ImageNet in 1 hour, arXiv preprint arXiv:1706.02677, (2017).
R. GROSSE AND J. MARTENS, A Kronecker-factored approximate fisher matrix for convolution layers, arXiv:1602.01407v2, (2016).
K. HE, X. ZHANG, S. REN, AND J. SUN, Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification, arXiv preprint arXiv:1502.01852, (2015).
G. HINTON, Neural networks for machine learning, Coursera, video lectures. 307, (2012).
S. HOCHREITER AND J. SCHMIDHUBER, Long short-term memory, Neural Computation, 9 (1997), pp. 1735–1780.
IBM, Deep blue. https://www.ibm.com/blogs/think/2017/05/deep-blue/.
——–, IBM project debater. https://www.research.ibm.com/artificial-intelligence/project-debater/.
S. ICHI AMARI, Natural gradient works efficiently in learning, Neural Computation, 10 (1988).
S. ICHI AMARI, R. KARAKIDA, AND M. OYIZUMI, Fisher information and natural gradient learning of random deep networks, arXiv preprint: 1808.07172v1, (2018).
G. INC., Google AlphaGo. https://ai.google/research/pubs/pub44806.
C. JIN, R. GE, P. NETRAPALLI, S. M. KAKADE, AND M. I. JORDAN, How to escape saddle points efficiently, arXiv preprint arXiv:1703.00887, (2017).
P. H. JIN, Q. YUAN, F. IANDOLA, AND K. KEUTZER, How to scale distributed deep learning?, arXiv preprint arXiv:1611.04581, (2016).
R. JOHNSON AND T. ZHANG, Accelerating stochastic gradient descent using predictive variance reduction, in Advances in Neural Information Processing Systems, 2013, pp. 315–323.
D. KINGMA AND J. BA, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980, (2014).
S. B. KYLASA, F. R. KHORASANI, M. W. MAHONEY, AND A. Y. GRAMA, GPU accelerated sub-sampled newton’s method for convex classification problems, in Proceedings of the 2019 SIAM International Conference on Data Mining, SIAM, ed., SIAM, 2019, pp. 702–710.
K. LEVENBERG, A method for the solution of certain problems in least squares, Quarterly of Applied Mathematics, 2 (1944), pp. 164–168.
K. Y. LEVY, The power of normalization: Faster evasion of saddle points, arXiv preprint arXiv:1611.04831, (2016).
M. LI, T. ZHANG, Y. CHEN, AND A. J. SMOLA, Efficient mini-batch training for stochastic optimization, in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2014, pp. 661–670.
C.-J. LIN, R. C. WENG, AND S. S. KEERTHI, Trust region Newton method for logistic regression, The Journal of Machine Learning Research, 9 (2008), pp. 627–650.
D. C. LIU AND J. NOCEDAL, On the limited memory BFGS method for large scale optimization, Mathematical programming, 45 (1989), pp. 503–528.
D. W. MARQUARDT, An algorithm for least-squares estimation of nonlinear parameters, Journal of the Society for Industrial & Applied Mathematics, 11 (1963), pp. 431–441.
J. MARTENS, Deep learning via Hessian-free optimization, in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 735–742.
J. MARTENS, New insights and perspectives on the natural gradient method, arXiv preprint: arXiv:1412.1193v9, (2017).
J. MARTENS AND R. GROSSE, Optimizing neural networks with Kronecker-factored approximate curvature, arXiv:1503.05671v6, (2016).
G. MONTAVON, G. B. ORR, AND K.-R. MULLER, Neural Networks: Tricks of the Trade, Springer, 2nd ed., September 2012.
W. R. MORROW, Hessian-free methods for checking the second-order sufficient conditions in equality-constrained optimization and equilibrium problems, arXiv preprint arXiv:1106.0898, (2011).
Y. NESTEROV, Introductory lectures on convex optimization, vol. 87, Springer Science & Business Media, 2004.
J. NOCEDAL, Updating quasi-Newton matrices with limited storage, Mathematics of computation, 35 (1980), pp. 773–782.
J. NOCEDAL AND S. WRIGHT, Numerical Optimization, New York: Springer, 1999.
J. NOCEDAL AND S. WRIGHT, Numerical optimization, Springer Science & Business Media, 2006.
B. A. PEARLMUTTER, Fast exact multiplication by the hessian, Neural Computation, (1993).
B. T. POLYAK, Some methods of speeding up the convergence of iteration methods, USSR Computational Mathematics and Mathematical Physics, 4 (1964), pp. 1–17.
S. J. REDDI, J. KONEčNỲ, P. RICHTáRIK, B. PóCZóS, AND A. SMOLA, AIDE: Fast and communication efficient distributed optimization, arXiv preprint arXiv:1608.06879, (2016).
H. ROBBINS AND S. MONRO, A stochastic approximation method, The annals of mathematical statistics, (1951), pp. 400–407.
F. ROOSTA-KHORASANI AND M. W. MAHONEY, Sub-sampled Newton methods I: globally convergent algorithms, arXiv preprint arXiv:1601.04737, (2016).
——–, Sub-sampled Newton methods II: Local convergence rates, arXiv preprint arXiv:1601.04738, (2016).
S. SHALEV-SHWARTZ AND S. BEN-DAVID, Understanding machine learning: From theory to algorithms, Cambridge university press, 2014.
S. SRA, S. NOWOZIN, AND S. J. WRIGHT, Optimization for machine learning, MIT Press, 2012.
I. SUTSKEVER, J. MARTENS, G. DAHL, AND G. HINTON, On the importance of initialization and momentum in deep learning, in International conference on machine learning, 2013, pp. 1139–1147.
T. TIELEMAN AND G. HINTON, Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude, COURSERA: Neural Networks for Machine Learning, 4 (2012).
S. WANG, F. ROOSTA-KHORASANI, P. XU, AND M. W. MAHONEY, GIANT: Globally Improved Approximate Newton Method for Distributed Optimization, arXiv preprint arXiv:1709.03528, (2017).
P. XU, F. ROOSTA-KHORASANI, AND M. W. MAHONEY, Newton-Type Methods for Non-Convex Optimization Under Inexact Hessian Information, arXiv preprint arXiv:1708.07164, (2017).
——–, Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study, arXiv preprint arXiv:1708.07827, (2017).
P. XU, J. YANG, F. ROOSTA-KHORASANI, C. Ré, AND M. W. MAHONEY, Sub-sampled newton methods with non-uniform sampling, in Advances in Neural Information Processing Systems, 2016, pp. 3000–3008.
Y. XU, J. RONG, AND T. YANG, First-order stochastic algorithms for escaping from saddle points in almost linear time., in Advances in Neural Information Processing Systems, 2018.
Z. XU, M. A. FIGUEIREDO, X. YUAN, C. STUDER, AND T. GOLDSTEIN, Adaptive relaxed admm: Convergence theory and practical implementation, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2017, pp. 7234–7243.
Z. XU, G. TAYLOR, H. LI, M. FIGUEIREDO, X. YUAN, AND T. GOLDSTEIN, Adaptive consensus admm for distributed optimization, arXiv preprint arXiv:1706.02869, (2017).
H. H. YANG AND S. ICHI AMARI, The efficiency and the robustness of natural gradient descent learning rule, in Neural Information Processing Systems, 1997, pp. 385–391.
Z. YAO, P. XU, F. ROOSTA-KHORASANI, AND M. W. MAHONEY, Inexact non-convex Newton-type methods, arXiv preprint arXiv:1802.06925, (2018).
J. YU, S. VISHWANATHAN, S. GüNTER, AND N. N. SCHRAUDOLPH, A quasi-Newton approach to nonsmooth convex optimization problems in machine learning, The Journal of Machine Learning Research, 11 (2010), pp. 1145–1200.
Y. YU, P. XU, AND Q. GU, Third-order smoothness helps: Even faster stochastic optimization algorithms for finding local minima, arXiv:1712.06585v1, (2017).
Y. YU, D. ZOU, AND Q. GU, Saving gradient and negative curvature computations: Finding local minima more efficiently, arXiv:1712.03950v1, (2017).
M. D. ZEILER, Adadelta: an adaptive learning rate method, arXiv preprint arXiv:1212.5701, (2012).
H. ZHANG, C. XIONG, J. BRADBURY, AND R. SOCHER, Block-diagonal Hessian-free optimization for training neural networks, arXiv preprint: arXiv:1712.07296v1, (2017).
Y. ZHANG AND X. LIN, DiSCO: Distributed optimization for self-concordant empirical loss, in International conference on machine learning, 2015, pp. 362–370.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Kylasa, S., Fang, CH., Roosta, F., Grama, A. (2020). Parallel Optimization Techniques for Machine Learning. In: Grama, A., Sameh, A. (eds) Parallel Algorithms in Computational Science and Engineering. Modeling and Simulation in Science, Engineering and Technology. Birkhäuser, Cham. https://doi.org/10.1007/978-3-030-43736-7_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-43736-7_13
Published:
Publisher Name: Birkhäuser, Cham
Print ISBN: 978-3-030-43735-0
Online ISBN: 978-3-030-43736-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)