Parallel Optimization Techniques for Machine Learning

Kylasa, Sudhir; Fang, Chih-Hao; Roosta, Fred; Grama, Ananth

doi:10.1007/978-3-030-43736-7_13

Sudhir Kylasa¹⁹,
Chih-Hao Fang²⁰,
Fred Roosta²¹ &
…
Ananth Grama²⁰

Part of the book series: Modeling and Simulation in Science, Engineering and Technology ((MSSET))

1397 Accesses
1 Citations

Abstract

In this chapter we discuss higher-order methods for optimization problems in machine learning applications. We also present underlying theoretical background as well as detailed experimental results for each of these higher order methods and also provide their in-depth comparison with respect to competing methods in the context of real-world datasets. We show that higher-order methods, contrary to popular understanding, can achieve significantly superior results compared to state-of-the-art competing methods in shorter wall-clock times yielding orders of magnitude of relative speedup for typical real-world datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Hardcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
It converges linearly to the optimum, starting from any initial guess x ⁽⁰⁾.
2.
If the iterates are close enough to the optimum, it converges with a constant linear rate independent of the problem-related quantities.

References

M. ABADI, P. BARHAM, J. CHEN, Z. CHEN, A. DAVIS, J. DEAN, M. DEVIN, S. GHEMAWAT, G. IRVING, M. ISARD, ET AL., TensorFlow: A system for large-scale machine learning., in OSDI, vol. 16, 2016, pp. 265–283.
Google Scholar
Z. ALLEN-ZHU AND Y. LI, Neon2: Finding local minima via first-order oracles, in Advances in Neural Information Processing Systems, 2018, pp. 3720–3730.
Google Scholar
J. BA, R. GROSSE, AND J. MARTENS, Distributed second-order optimization using Kronecker-factored approximations, ICLR, (2017).
Google Scholar
A. S. BERAHAS, R. BOLLAPRAGADA, AND J. NOCEDAL, An Investigation of Newton-Sketch and Subsampled Newton Methods, arXiv preprint arXiv:1705.06211, (2017).
Google Scholar
D. P. BERTSEKAS AND J. N. TSITSIKLIS, Neuro-dynamic Programming, Athena Scientific, 1996.
Google Scholar
L. BOTTOU, Large-scale machine learning with stochastic gradient descent, in Proceedings of COMPSTAT’2010, Springer, 2010, pp. 177–186.
Google Scholar
L. BOTTOU, F. E. CURTIS, AND J. NOCEDAL, Optimization methods for large-scale machine learning, arXiv preprint arXiv:1606.04838, (2016).
Google Scholar
L. BOTTOU AND Y. LECUN, Large scale online learning, Advances in neural information processing systems, 16 (2004), p. 217.
Google Scholar
S. BOYD, N. PARIKH, E. CHU, B. PELEATO, J. ECKSTEIN, ET AL., Distributed optimization and statistical learning via the alternating direction method of multipliers, Foundations and Trends® in Machine learning, 3 (2011), pp. 1–122.
Google Scholar
S. BOYD AND L. VANDENBERGHE, Convex optimization, Cambridge university press, 2004.
Google Scholar
R. H. BYRD, G. M. CHIN, W. NEVEITT, AND J. NOCEDAL, On the use of stochastic Hessian information in optimization methods for machine learning, SIAM Journal on Optimization, 21 (2011), pp. 977–995.
Google Scholar
R. H. BYRD, G. M. CHIN, J. NOCEDAL, AND Y. WU, Sample size selection in optimization methods for machine learning, Mathematical programming, 134 (2012), pp. 127–155.
Google Scholar
A. CAUCHY, Méthode générale pour la résolution des systemes d’équations simultanées, Comp. Rend. Sci. Paris, 25 (1847), pp. 536–538.
Google Scholar
O. CHAPELLE AND D. ERHAN, Improved preconditioner for Hessian free optimization, in In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
Google Scholar
J. CHEN, X. PAN, R. MONGA, S. BENGIO, AND R. JOZEFOWICZ, Revisiting distributed synchronous SGD, arXiv preprint arXiv:1604.00981, (2016).
Google Scholar
A. R. CONN, N. I. GOULD, AND P. L. TOINT, Trust region methods, vol. 1, SIAM, 2000.
Google Scholar
A. COTTER, O. SHAMIR, N. SREBRO, AND K. SRIDHARAN, Better mini-batch algorithms via accelerated gradient methods, in Advances in neural information processing systems, 2011, pp. 1647–1655.
Google Scholar
R. CRANE AND F. ROOSTA, DINGO: Distributed Newton-Type Method for Gradient-Norm Optimization, in Proceedings of the Advances in Neural Information Processing Systems, 2019. Accepted.
Google Scholar
B. CRAVEN, Invex functions and constrained local minima, Bulletin of the Australian Mathematical society, 24 (1981), pp. 357–366.
Article MathSciNet Google Scholar
H. DANESHMAND, A. LUCCHI, AND T. HOFMANN, DynaNewton-Accelerating Newton’s Method for Machine Learning, arXiv preprint arXiv:1605.06561, (2016).
Google Scholar
Y. DAUPHIN, H. DE VRIES, AND Y. BENGIO, Equilibrated adaptive learning rates for non-convex optimization, in Advances in Neural Information Processing Systems, 2015, pp. 1504–1512.
Google Scholar
Y. N. DAUPHIN, R. PASCANU, C. GULCEHRE, K. CHO, S. GANGULI, AND Y. BENGIO, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, arXiv:1406.2572v1, (2014).
Google Scholar
J. DEAN, G. CORRADO, R. MONGA, K. CHEN, M. DEVIN, M. MAO, A. SENIOR, P. TUCKER, K. YANG, Q. V. LE, ET AL., Large scale distributed deep networks, in Advances in neural information processing systems, 2012, pp. 1223–1231.
Google Scholar
J. DUCHI, E. HAZAN, AND Y. SINGER, Adaptive subgradient methods for online learning and stochastic optimization, The Journal of Machine Learning Research, 12 (2011), pp. 2121–2159.
Google Scholar
C. DüNNER, A. LUCCHI, M. GARGIANI, A. BIAN, T. HOFMANN, AND M. JAGGI, A distributed second-order algorithm you can trust, arXiv preprint arXiv:1806.07569, (2018).
Google Scholar
C.-H. FANG, S. B. KYLASA, F. ROOSTA-KHORASANI, M. W. MAHONEY, AND A. GRAMA, Distributed second-order convex optimization, arXiv preprint arXiv:1807.07132, (2018).
Google Scholar
J. FRIEDMAN, T. HASTIE, AND R. TIBSHIRANI, The elements of statistical learning, vol. 1, Springer series in statistics Springer, Berlin, 2001.
Google Scholar
R. GE, F. HUANG, C. JIN, AND Y. YUAN, Escaping from saddle points – online stochastic gradient for tensor decomposition, arXiv preprint: arXiv:1503.02101v1, (2015).
Google Scholar
P. GOYAL, P. DOLLáR, R. GIRSHICK, P. NOORDHUIS, L. WESOLOWSKI, A. KYROLA, A. TULLOCH, Y. JIA, AND K. HE, Accurate, large minibatch SGD: training ImageNet in 1 hour, arXiv preprint arXiv:1706.02677, (2017).
Google Scholar
R. GROSSE AND J. MARTENS, A Kronecker-factored approximate fisher matrix for convolution layers, arXiv:1602.01407v2, (2016).
Google Scholar
K. HE, X. ZHANG, S. REN, AND J. SUN, Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification, arXiv preprint arXiv:1502.01852, (2015).
Google Scholar
G. HINTON, Neural networks for machine learning, Coursera, video lectures. 307, (2012).
Google Scholar
S. HOCHREITER AND J. SCHMIDHUBER, Long short-term memory, Neural Computation, 9 (1997), pp. 1735–1780.
Google Scholar
IBM, Deep blue. https://www.ibm.com/blogs/think/2017/05/deep-blue/.
——–, IBM project debater. https://www.research.ibm.com/artificial-intelligence/project-debater/.
S. ICHI AMARI, Natural gradient works efficiently in learning, Neural Computation, 10 (1988).
Google Scholar
S. ICHI AMARI, R. KARAKIDA, AND M. OYIZUMI, Fisher information and natural gradient learning of random deep networks, arXiv preprint: 1808.07172v1, (2018).
Google Scholar
G. INC., Google AlphaGo. https://ai.google/research/pubs/pub44806.
C. JIN, R. GE, P. NETRAPALLI, S. M. KAKADE, AND M. I. JORDAN, How to escape saddle points efficiently, arXiv preprint arXiv:1703.00887, (2017).
Google Scholar
P. H. JIN, Q. YUAN, F. IANDOLA, AND K. KEUTZER, How to scale distributed deep learning?, arXiv preprint arXiv:1611.04581, (2016).
Google Scholar
R. JOHNSON AND T. ZHANG, Accelerating stochastic gradient descent using predictive variance reduction, in Advances in Neural Information Processing Systems, 2013, pp. 315–323.
Google Scholar
D. KINGMA AND J. BA, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980, (2014).
Google Scholar
S. B. KYLASA, F. R. KHORASANI, M. W. MAHONEY, AND A. Y. GRAMA, GPU accelerated sub-sampled newton’s method for convex classification problems, in Proceedings of the 2019 SIAM International Conference on Data Mining, SIAM, ed., SIAM, 2019, pp. 702–710.
Google Scholar
K. LEVENBERG, A method for the solution of certain problems in least squares, Quarterly of Applied Mathematics, 2 (1944), pp. 164–168.
Article MathSciNet Google Scholar
K. Y. LEVY, The power of normalization: Faster evasion of saddle points, arXiv preprint arXiv:1611.04831, (2016).
Google Scholar
M. LI, T. ZHANG, Y. CHEN, AND A. J. SMOLA, Efficient mini-batch training for stochastic optimization, in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2014, pp. 661–670.
Google Scholar
C.-J. LIN, R. C. WENG, AND S. S. KEERTHI, Trust region Newton method for logistic regression, The Journal of Machine Learning Research, 9 (2008), pp. 627–650.
Google Scholar
D. C. LIU AND J. NOCEDAL, On the limited memory BFGS method for large scale optimization, Mathematical programming, 45 (1989), pp. 503–528.
Google Scholar
D. W. MARQUARDT, An algorithm for least-squares estimation of nonlinear parameters, Journal of the Society for Industrial & Applied Mathematics, 11 (1963), pp. 431–441.
Article MathSciNet Google Scholar
J. MARTENS, Deep learning via Hessian-free optimization, in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 735–742.
Google Scholar
J. MARTENS, New insights and perspectives on the natural gradient method, arXiv preprint: arXiv:1412.1193v9, (2017).
Google Scholar
J. MARTENS AND R. GROSSE, Optimizing neural networks with Kronecker-factored approximate curvature, arXiv:1503.05671v6, (2016).
Google Scholar
G. MONTAVON, G. B. ORR, AND K.-R. MULLER, Neural Networks: Tricks of the Trade, Springer, 2nd ed., September 2012.
Google Scholar
W. R. MORROW, Hessian-free methods for checking the second-order sufficient conditions in equality-constrained optimization and equilibrium problems, arXiv preprint arXiv:1106.0898, (2011).
Google Scholar
Y. NESTEROV, Introductory lectures on convex optimization, vol. 87, Springer Science & Business Media, 2004.
Google Scholar
J. NOCEDAL, Updating quasi-Newton matrices with limited storage, Mathematics of computation, 35 (1980), pp. 773–782.
Article MathSciNet Google Scholar
J. NOCEDAL AND S. WRIGHT, Numerical Optimization, New York: Springer, 1999.
Google Scholar
J. NOCEDAL AND S. WRIGHT, Numerical optimization, Springer Science & Business Media, 2006.
Google Scholar
B. A. PEARLMUTTER, Fast exact multiplication by the hessian, Neural Computation, (1993).
Google Scholar
B. T. POLYAK, Some methods of speeding up the convergence of iteration methods, USSR Computational Mathematics and Mathematical Physics, 4 (1964), pp. 1–17.
Article Google Scholar
S. J. REDDI, J. KONEčNỲ, P. RICHTáRIK, B. PóCZóS, AND A. SMOLA, AIDE: Fast and communication efficient distributed optimization, arXiv preprint arXiv:1608.06879, (2016).
Google Scholar
H. ROBBINS AND S. MONRO, A stochastic approximation method, The annals of mathematical statistics, (1951), pp. 400–407.
Google Scholar
F. ROOSTA-KHORASANI AND M. W. MAHONEY, Sub-sampled Newton methods I: globally convergent algorithms, arXiv preprint arXiv:1601.04737, (2016).
Google Scholar
——–, Sub-sampled Newton methods II: Local convergence rates, arXiv preprint arXiv:1601.04738, (2016).
Google Scholar
S. SHALEV-SHWARTZ AND S. BEN-DAVID, Understanding machine learning: From theory to algorithms, Cambridge university press, 2014.
Google Scholar
S. SRA, S. NOWOZIN, AND S. J. WRIGHT, Optimization for machine learning, MIT Press, 2012.
Google Scholar
I. SUTSKEVER, J. MARTENS, G. DAHL, AND G. HINTON, On the importance of initialization and momentum in deep learning, in International conference on machine learning, 2013, pp. 1139–1147.
Google Scholar
T. TIELEMAN AND G. HINTON, Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude, COURSERA: Neural Networks for Machine Learning, 4 (2012).
Google Scholar
S. WANG, F. ROOSTA-KHORASANI, P. XU, AND M. W. MAHONEY, GIANT: Globally Improved Approximate Newton Method for Distributed Optimization, arXiv preprint arXiv:1709.03528, (2017).
Google Scholar
P. XU, F. ROOSTA-KHORASANI, AND M. W. MAHONEY, Newton-Type Methods for Non-Convex Optimization Under Inexact Hessian Information, arXiv preprint arXiv:1708.07164, (2017).
Google Scholar
——–, Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study, arXiv preprint arXiv:1708.07827, (2017).
Google Scholar
P. XU, J. YANG, F. ROOSTA-KHORASANI, C. Ré, AND M. W. MAHONEY, Sub-sampled newton methods with non-uniform sampling, in Advances in Neural Information Processing Systems, 2016, pp. 3000–3008.
Google Scholar
Y. XU, J. RONG, AND T. YANG, First-order stochastic algorithms for escaping from saddle points in almost linear time., in Advances in Neural Information Processing Systems, 2018.
Google Scholar
Z. XU, M. A. FIGUEIREDO, X. YUAN, C. STUDER, AND T. GOLDSTEIN, Adaptive relaxed admm: Convergence theory and practical implementation, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2017, pp. 7234–7243.
Google Scholar
Z. XU, G. TAYLOR, H. LI, M. FIGUEIREDO, X. YUAN, AND T. GOLDSTEIN, Adaptive consensus admm for distributed optimization, arXiv preprint arXiv:1706.02869, (2017).
Google Scholar
H. H. YANG AND S. ICHI AMARI, The efficiency and the robustness of natural gradient descent learning rule, in Neural Information Processing Systems, 1997, pp. 385–391.
Google Scholar
Z. YAO, P. XU, F. ROOSTA-KHORASANI, AND M. W. MAHONEY, Inexact non-convex Newton-type methods, arXiv preprint arXiv:1802.06925, (2018).
Google Scholar
J. YU, S. VISHWANATHAN, S. GüNTER, AND N. N. SCHRAUDOLPH, A quasi-Newton approach to nonsmooth convex optimization problems in machine learning, The Journal of Machine Learning Research, 11 (2010), pp. 1145–1200.
Google Scholar
Y. YU, P. XU, AND Q. GU, Third-order smoothness helps: Even faster stochastic optimization algorithms for finding local minima, arXiv:1712.06585v1, (2017).
Google Scholar
Y. YU, D. ZOU, AND Q. GU, Saving gradient and negative curvature computations: Finding local minima more efficiently, arXiv:1712.03950v1, (2017).
Google Scholar
M. D. ZEILER, Adadelta: an adaptive learning rate method, arXiv preprint arXiv:1212.5701, (2012).
Google Scholar
H. ZHANG, C. XIONG, J. BRADBURY, AND R. SOCHER, Block-diagonal Hessian-free optimization for training neural networks, arXiv preprint: arXiv:1712.07296v1, (2017).
Google Scholar
Y. ZHANG AND X. LIN, DiSCO: Distributed optimization for self-concordant empirical loss, in International conference on machine learning, 2015, pp. 362–370.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, USA
Sudhir Kylasa
Department of Computer Science, Purdue University, West Lafayette, IN, USA
Chih-Hao Fang & Ananth Grama
School of Mathematics and Physics, University of Queensland, Brisbane, QLD, Australia
Fred Roosta

Authors

Sudhir Kylasa
View author publications
You can also search for this author in PubMed Google Scholar
Chih-Hao Fang
View author publications
You can also search for this author in PubMed Google Scholar
Fred Roosta
View author publications
You can also search for this author in PubMed Google Scholar
Ananth Grama
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ananth Grama .

Editor information

Editors and Affiliations

Purdue University, West Lafayette, IN, USA
Ananth Grama
Purdue University, West Lafayette, IN, USA
Ahmed H. Sameh

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kylasa, S., Fang, CH., Roosta, F., Grama, A. (2020). Parallel Optimization Techniques for Machine Learning. In: Grama, A., Sameh, A. (eds) Parallel Algorithms in Computational Science and Engineering. Modeling and Simulation in Science, Engineering and Technology. Birkhäuser, Cham. https://doi.org/10.1007/978-3-030-43736-7_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-43736-7_13
Published: 07 July 2020
Publisher Name: Birkhäuser, Cham
Print ISBN: 978-3-030-43735-0
Online ISBN: 978-3-030-43736-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics