Skip to main content

Parallel Optimization Techniques for Machine Learning

  • Chapter
  • First Online:
Parallel Algorithms in Computational Science and Engineering

Abstract

In this chapter we discuss higher-order methods for optimization problems in machine learning applications. We also present underlying theoretical background as well as detailed experimental results for each of these higher order methods and also provide their in-depth comparison with respect to competing methods in the context of real-world datasets. We show that higher-order methods, contrary to popular understanding, can achieve significantly superior results compared to state-of-the-art competing methods in shorter wall-clock times yielding orders of magnitude of relative speedup for typical real-world datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 139.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    It converges linearly to the optimum, starting from any initial guess x (0).

  2. 2.

    If the iterates are close enough to the optimum, it converges with a constant linear rate independent of the problem-related quantities.

References

  1. M. ABADI, P. BARHAM, J. CHEN, Z. CHEN, A. DAVIS, J. DEAN, M. DEVIN, S. GHEMAWAT, G. IRVING, M. ISARD, ET AL., TensorFlow: A system for large-scale machine learning., in OSDI, vol. 16, 2016, pp. 265–283.

    Google Scholar 

  2. Z. ALLEN-ZHU AND Y. LI, Neon2: Finding local minima via first-order oracles, in Advances in Neural Information Processing Systems, 2018, pp. 3720–3730.

    Google Scholar 

  3. J. BA, R. GROSSE, AND J. MARTENS, Distributed second-order optimization using Kronecker-factored approximations, ICLR, (2017).

    Google Scholar 

  4. A. S. BERAHAS, R. BOLLAPRAGADA, AND J. NOCEDAL, An Investigation of Newton-Sketch and Subsampled Newton Methods, arXiv preprint arXiv:1705.06211, (2017).

    Google Scholar 

  5. D. P. BERTSEKAS AND J. N. TSITSIKLIS, Neuro-dynamic Programming, Athena Scientific, 1996.

    Google Scholar 

  6. L. BOTTOU, Large-scale machine learning with stochastic gradient descent, in Proceedings of COMPSTAT’2010, Springer, 2010, pp. 177–186.

    Google Scholar 

  7. L. BOTTOU, F. E. CURTIS, AND J. NOCEDAL, Optimization methods for large-scale machine learning, arXiv preprint arXiv:1606.04838, (2016).

    Google Scholar 

  8. L. BOTTOU AND Y. LECUN, Large scale online learning, Advances in neural information processing systems, 16 (2004), p. 217.

    Google Scholar 

  9. S. BOYD, N. PARIKH, E. CHU, B. PELEATO, J. ECKSTEIN, ET AL., Distributed optimization and statistical learning via the alternating direction method of multipliers, Foundations and Trends® in Machine learning, 3 (2011), pp. 1–122.

    Google Scholar 

  10. S. BOYD AND L. VANDENBERGHE, Convex optimization, Cambridge university press, 2004.

    Google Scholar 

  11. R. H. BYRD, G. M. CHIN, W. NEVEITT, AND J. NOCEDAL, On the use of stochastic Hessian information in optimization methods for machine learning, SIAM Journal on Optimization, 21 (2011), pp. 977–995.

    Google Scholar 

  12. R. H. BYRD, G. M. CHIN, J. NOCEDAL, AND Y. WU, Sample size selection in optimization methods for machine learning, Mathematical programming, 134 (2012), pp. 127–155.

    Google Scholar 

  13. A. CAUCHY, Méthode générale pour la résolution des systemes d’équations simultanées, Comp. Rend. Sci. Paris, 25 (1847), pp. 536–538.

    Google Scholar 

  14. O. CHAPELLE AND D. ERHAN, Improved preconditioner for Hessian free optimization, in In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.

    Google Scholar 

  15. J. CHEN, X. PAN, R. MONGA, S. BENGIO, AND R. JOZEFOWICZ, Revisiting distributed synchronous SGD, arXiv preprint arXiv:1604.00981, (2016).

    Google Scholar 

  16. A. R. CONN, N. I. GOULD, AND P. L. TOINT, Trust region methods, vol. 1, SIAM, 2000.

    Google Scholar 

  17. A. COTTER, O. SHAMIR, N. SREBRO, AND K. SRIDHARAN, Better mini-batch algorithms via accelerated gradient methods, in Advances in neural information processing systems, 2011, pp. 1647–1655.

    Google Scholar 

  18. R. CRANE AND F. ROOSTA, DINGO: Distributed Newton-Type Method for Gradient-Norm Optimization, in Proceedings of the Advances in Neural Information Processing Systems, 2019. Accepted.

    Google Scholar 

  19. B. CRAVEN, Invex functions and constrained local minima, Bulletin of the Australian Mathematical society, 24 (1981), pp. 357–366.

    Article  MathSciNet  Google Scholar 

  20. H. DANESHMAND, A. LUCCHI, AND T. HOFMANN, DynaNewton-Accelerating Newton’s Method for Machine Learning, arXiv preprint arXiv:1605.06561, (2016).

    Google Scholar 

  21. Y. DAUPHIN, H. DE VRIES, AND Y. BENGIO, Equilibrated adaptive learning rates for non-convex optimization, in Advances in Neural Information Processing Systems, 2015, pp. 1504–1512.

    Google Scholar 

  22. Y. N. DAUPHIN, R. PASCANU, C. GULCEHRE, K. CHO, S. GANGULI, AND Y. BENGIO, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, arXiv:1406.2572v1, (2014).

    Google Scholar 

  23. J. DEAN, G. CORRADO, R. MONGA, K. CHEN, M. DEVIN, M. MAO, A. SENIOR, P. TUCKER, K. YANG, Q. V. LE, ET AL., Large scale distributed deep networks, in Advances in neural information processing systems, 2012, pp. 1223–1231.

    Google Scholar 

  24. J. DUCHI, E. HAZAN, AND Y. SINGER, Adaptive subgradient methods for online learning and stochastic optimization, The Journal of Machine Learning Research, 12 (2011), pp. 2121–2159.

    Google Scholar 

  25. C. DĂĽNNER, A. LUCCHI, M. GARGIANI, A. BIAN, T. HOFMANN, AND M. JAGGI, A distributed second-order algorithm you can trust, arXiv preprint arXiv:1806.07569, (2018).

    Google Scholar 

  26. C.-H. FANG, S. B. KYLASA, F. ROOSTA-KHORASANI, M. W. MAHONEY, AND A. GRAMA, Distributed second-order convex optimization, arXiv preprint arXiv:1807.07132, (2018).

    Google Scholar 

  27. J. FRIEDMAN, T. HASTIE, AND R. TIBSHIRANI, The elements of statistical learning, vol. 1, Springer series in statistics Springer, Berlin, 2001.

    Google Scholar 

  28. R. GE, F. HUANG, C. JIN, AND Y. YUAN, Escaping from saddle points – online stochastic gradient for tensor decomposition, arXiv preprint: arXiv:1503.02101v1, (2015).

    Google Scholar 

  29. P. GOYAL, P. DOLLáR, R. GIRSHICK, P. NOORDHUIS, L. WESOLOWSKI, A. KYROLA, A. TULLOCH, Y. JIA, AND K. HE, Accurate, large minibatch SGD: training ImageNet in 1 hour, arXiv preprint arXiv:1706.02677, (2017).

    Google Scholar 

  30. R. GROSSE AND J. MARTENS, A Kronecker-factored approximate fisher matrix for convolution layers, arXiv:1602.01407v2, (2016).

    Google Scholar 

  31. K. HE, X. ZHANG, S. REN, AND J. SUN, Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification, arXiv preprint arXiv:1502.01852, (2015).

    Google Scholar 

  32. G. HINTON, Neural networks for machine learning, Coursera, video lectures. 307, (2012).

    Google Scholar 

  33. S. HOCHREITER AND J. SCHMIDHUBER, Long short-term memory, Neural Computation, 9 (1997), pp. 1735–1780.

    Google Scholar 

  34. IBM, Deep blue. https://www.ibm.com/blogs/think/2017/05/deep-blue/.

  35. ——–, IBM project debater. https://www.research.ibm.com/artificial-intelligence/project-debater/.

  36. S. ICHI AMARI, Natural gradient works efficiently in learning, Neural Computation, 10 (1988).

    Google Scholar 

  37. S. ICHI AMARI, R. KARAKIDA, AND M. OYIZUMI, Fisher information and natural gradient learning of random deep networks, arXiv preprint: 1808.07172v1, (2018).

    Google Scholar 

  38. G. INC., Google AlphaGo. https://ai.google/research/pubs/pub44806.

  39. C. JIN, R. GE, P. NETRAPALLI, S. M. KAKADE, AND M. I. JORDAN, How to escape saddle points efficiently, arXiv preprint arXiv:1703.00887, (2017).

    Google Scholar 

  40. P. H. JIN, Q. YUAN, F. IANDOLA, AND K. KEUTZER, How to scale distributed deep learning?, arXiv preprint arXiv:1611.04581, (2016).

    Google Scholar 

  41. R. JOHNSON AND T. ZHANG, Accelerating stochastic gradient descent using predictive variance reduction, in Advances in Neural Information Processing Systems, 2013, pp. 315–323.

    Google Scholar 

  42. D. KINGMA AND J. BA, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980, (2014).

    Google Scholar 

  43. S. B. KYLASA, F. R. KHORASANI, M. W. MAHONEY, AND A. Y. GRAMA, GPU accelerated sub-sampled newton’s method for convex classification problems, in Proceedings of the 2019 SIAM International Conference on Data Mining, SIAM, ed., SIAM, 2019, pp. 702–710.

    Google Scholar 

  44. K. LEVENBERG, A method for the solution of certain problems in least squares, Quarterly of Applied Mathematics, 2 (1944), pp. 164–168.

    Article  MathSciNet  Google Scholar 

  45. K. Y. LEVY, The power of normalization: Faster evasion of saddle points, arXiv preprint arXiv:1611.04831, (2016).

    Google Scholar 

  46. M. LI, T. ZHANG, Y. CHEN, AND A. J. SMOLA, Efficient mini-batch training for stochastic optimization, in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, 2014, pp. 661–670.

    Google Scholar 

  47. C.-J. LIN, R. C. WENG, AND S. S. KEERTHI, Trust region Newton method for logistic regression, The Journal of Machine Learning Research, 9 (2008), pp. 627–650.

    Google Scholar 

  48. D. C. LIU AND J. NOCEDAL, On the limited memory BFGS method for large scale optimization, Mathematical programming, 45 (1989), pp. 503–528.

    Google Scholar 

  49. D. W. MARQUARDT, An algorithm for least-squares estimation of nonlinear parameters, Journal of the Society for Industrial & Applied Mathematics, 11 (1963), pp. 431–441.

    Article  MathSciNet  Google Scholar 

  50. J. MARTENS, Deep learning via Hessian-free optimization, in Proceedings of the 27th International Conference on Machine Learning (ICML-10), 2010, pp. 735–742.

    Google Scholar 

  51. J. MARTENS, New insights and perspectives on the natural gradient method, arXiv preprint: arXiv:1412.1193v9, (2017).

    Google Scholar 

  52. J. MARTENS AND R. GROSSE, Optimizing neural networks with Kronecker-factored approximate curvature, arXiv:1503.05671v6, (2016).

    Google Scholar 

  53. G. MONTAVON, G. B. ORR, AND K.-R. MULLER, Neural Networks: Tricks of the Trade, Springer, 2nd ed., September 2012.

    Google Scholar 

  54. W. R. MORROW, Hessian-free methods for checking the second-order sufficient conditions in equality-constrained optimization and equilibrium problems, arXiv preprint arXiv:1106.0898, (2011).

    Google Scholar 

  55. Y. NESTEROV, Introductory lectures on convex optimization, vol. 87, Springer Science & Business Media, 2004.

    Google Scholar 

  56. J. NOCEDAL, Updating quasi-Newton matrices with limited storage, Mathematics of computation, 35 (1980), pp. 773–782.

    Article  MathSciNet  Google Scholar 

  57. J. NOCEDAL AND S. WRIGHT, Numerical Optimization, New York: Springer, 1999.

    Google Scholar 

  58. J. NOCEDAL AND S. WRIGHT, Numerical optimization, Springer Science & Business Media, 2006.

    Google Scholar 

  59. B. A. PEARLMUTTER, Fast exact multiplication by the hessian, Neural Computation, (1993).

    Google Scholar 

  60. B. T. POLYAK, Some methods of speeding up the convergence of iteration methods, USSR Computational Mathematics and Mathematical Physics, 4 (1964), pp. 1–17.

    Article  Google Scholar 

  61. S. J. REDDI, J. KONEčNỲ, P. RICHTáRIK, B. PóCZóS, AND A. SMOLA, AIDE: Fast and communication efficient distributed optimization, arXiv preprint arXiv:1608.06879, (2016).

    Google Scholar 

  62. H. ROBBINS AND S. MONRO, A stochastic approximation method, The annals of mathematical statistics, (1951), pp. 400–407.

    Google Scholar 

  63. F. ROOSTA-KHORASANI AND M. W. MAHONEY, Sub-sampled Newton methods I: globally convergent algorithms, arXiv preprint arXiv:1601.04737, (2016).

    Google Scholar 

  64. ——–, Sub-sampled Newton methods II: Local convergence rates, arXiv preprint arXiv:1601.04738, (2016).

    Google Scholar 

  65. S. SHALEV-SHWARTZ AND S. BEN-DAVID, Understanding machine learning: From theory to algorithms, Cambridge university press, 2014.

    Google Scholar 

  66. S. SRA, S. NOWOZIN, AND S. J. WRIGHT, Optimization for machine learning, MIT Press, 2012.

    Google Scholar 

  67. I. SUTSKEVER, J. MARTENS, G. DAHL, AND G. HINTON, On the importance of initialization and momentum in deep learning, in International conference on machine learning, 2013, pp. 1139–1147.

    Google Scholar 

  68. T. TIELEMAN AND G. HINTON, Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude, COURSERA: Neural Networks for Machine Learning, 4 (2012).

    Google Scholar 

  69. S. WANG, F. ROOSTA-KHORASANI, P. XU, AND M. W. MAHONEY, GIANT: Globally Improved Approximate Newton Method for Distributed Optimization, arXiv preprint arXiv:1709.03528, (2017).

    Google Scholar 

  70. P. XU, F. ROOSTA-KHORASANI, AND M. W. MAHONEY, Newton-Type Methods for Non-Convex Optimization Under Inexact Hessian Information, arXiv preprint arXiv:1708.07164, (2017).

    Google Scholar 

  71. ——–, Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study, arXiv preprint arXiv:1708.07827, (2017).

    Google Scholar 

  72. P. XU, J. YANG, F. ROOSTA-KHORASANI, C. Ré, AND M. W. MAHONEY, Sub-sampled newton methods with non-uniform sampling, in Advances in Neural Information Processing Systems, 2016, pp. 3000–3008.

    Google Scholar 

  73. Y. XU, J. RONG, AND T. YANG, First-order stochastic algorithms for escaping from saddle points in almost linear time., in Advances in Neural Information Processing Systems, 2018.

    Google Scholar 

  74. Z. XU, M. A. FIGUEIREDO, X. YUAN, C. STUDER, AND T. GOLDSTEIN, Adaptive relaxed admm: Convergence theory and practical implementation, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2017, pp. 7234–7243.

    Google Scholar 

  75. Z. XU, G. TAYLOR, H. LI, M. FIGUEIREDO, X. YUAN, AND T. GOLDSTEIN, Adaptive consensus admm for distributed optimization, arXiv preprint arXiv:1706.02869, (2017).

    Google Scholar 

  76. H. H. YANG AND S. ICHI AMARI, The efficiency and the robustness of natural gradient descent learning rule, in Neural Information Processing Systems, 1997, pp. 385–391.

    Google Scholar 

  77. Z. YAO, P. XU, F. ROOSTA-KHORASANI, AND M. W. MAHONEY, Inexact non-convex Newton-type methods, arXiv preprint arXiv:1802.06925, (2018).

    Google Scholar 

  78. J. YU, S. VISHWANATHAN, S. GüNTER, AND N. N. SCHRAUDOLPH, A quasi-Newton approach to nonsmooth convex optimization problems in machine learning, The Journal of Machine Learning Research, 11 (2010), pp. 1145–1200.

    Google Scholar 

  79. Y. YU, P. XU, AND Q. GU, Third-order smoothness helps: Even faster stochastic optimization algorithms for finding local minima, arXiv:1712.06585v1, (2017).

    Google Scholar 

  80. Y. YU, D. ZOU, AND Q. GU, Saving gradient and negative curvature computations: Finding local minima more efficiently, arXiv:1712.03950v1, (2017).

    Google Scholar 

  81. M. D. ZEILER, Adadelta: an adaptive learning rate method, arXiv preprint arXiv:1212.5701, (2012).

    Google Scholar 

  82. H. ZHANG, C. XIONG, J. BRADBURY, AND R. SOCHER, Block-diagonal Hessian-free optimization for training neural networks, arXiv preprint: arXiv:1712.07296v1, (2017).

    Google Scholar 

  83. Y. ZHANG AND X. LIN, DiSCO: Distributed optimization for self-concordant empirical loss, in International conference on machine learning, 2015, pp. 362–370.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ananth Grama .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Kylasa, S., Fang, CH., Roosta, F., Grama, A. (2020). Parallel Optimization Techniques for Machine Learning. In: Grama, A., Sameh, A. (eds) Parallel Algorithms in Computational Science and Engineering. Modeling and Simulation in Science, Engineering and Technology. Birkhäuser, Cham. https://doi.org/10.1007/978-3-030-43736-7_13

Download citation

Publish with us

Policies and ethics