Skip to main content

Deep Learning Optimization

  • 1885 Accesses

Part of the Mathematics in Industry book series (MATHINDUSTRY,volume 37)

Abstract

In Chap. 6, we discussed various optimization methods for deep neural network training. Although they are in various forms, these algorithms are basically gradient-based local update schemes. However, the biggest obstacle recognized by the entire community is that the loss surfaces of deep neural networks are extremely non-convex and not even smooth. This non-convexity and non-smoothness make the optimization unaffordable to analyze, and the main concern was whether popular gradient-based approaches might fall into local minimizers.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-981-16-6046-7_11
  • Chapter length: 16 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   69.99
Price excludes VAT (USA)
  • ISBN: 978-981-16-6046-7
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Hardcover Book
USD   89.99
Price excludes VAT (USA)
Fig. 11.1
Fig. 11.2
Fig. 11.3

References

  1. Z. Allen-Zhu, Y. Li, and Z. Song, “A convergence theory for deep learning via over-parameterization,” in International Conference on Machine Learning. PMLR, 2019, pp. 242–252.

    Google Scholar 

  2. S. Du, J. Lee, H. Li, L. Wang, and X. Zhai, “Gradient descent finds global minima of deep neural networks,” in International Conference on Machine Learning. PMLR, 2019, pp. 1675–1685.

    Google Scholar 

  3. D. Zou, Y. Cao, D. Zhou, and Q. Gu, “Stochastic gradient descent optimizes over-parameterized deep ReLU networks,” arXiv preprint arXiv:1811.08888, 2018.

    Google Scholar 

  4. H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal-gradient methods under the Polyak-łojasiewicz condition,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2016, pp. 795–811.

    Google Scholar 

  5. Q. Nguyen, “On connected sublevel sets in deep learning,” in International Conference on Machine Learning. PMLR, 2019, pp. 4790–4799.

    Google Scholar 

  6. C. Liu, L. Zhu, and M. Belkin, “Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning,” arXiv preprint arXiv:2003.00307, 2020.

    Google Scholar 

  7. Z. Allen-Zhu, Y. Li, and Y. Liang, “Learning and generalization in overparameterized neural networks, going beyond two layers,” arXiv preprint arXiv:1811.04918, 2018.

    Google Scholar 

  8. M. Soltanolkotabi, A. Javanmard, and J. D. Lee, “Theoretical insights into the optimization landscape of over-parameterized shallow neural networks,” IEEE Transactions on Information Theory, vol. 65, no. 2, pp. 742–769, 2018.

    MathSciNet  CrossRef  Google Scholar 

  9. S. Oymak and M. Soltanolkotabi, “Overparameterized nonlinear learning: Gradient descent takes the shortest path?” in International Conference on Machine Learning. PMLR, 2019, pp. 4951–4960.

    Google Scholar 

  10. S. S. Du, X. Zhai, B. Poczos, and A. Singh, “Gradient descent provably optimizes over-parameterized neural networks,” arXiv preprint arXiv:1810.02054, 2018.

    Google Scholar 

  11. I. Safran, G. Yehudai, and O. Shamir, “The effects of mild over-parameterization on the optimization landscape of shallow ReLU neural networks,” arXiv preprint arXiv:2006.01005, 2020.

    Google Scholar 

  12. A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: convergence and generalization in neural networks,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, pp. 8580–8589.

    Google Scholar 

  13. S. Arora, S. S. Du, W. Hu, Z. Li, R. Salakhutdinov, and R. Wang, “On exact computation with an infinitely wide neural net,” arXiv preprint arXiv:1904.11955, 2019.

    Google Scholar 

  14. Y. Li, T. Luo, and N. K. Yip, “Towards an understanding of residual networks using neural tangent hierarchy (NTH),” arXiv preprint arXiv:2007.03714, 2020.

    Google Scholar 

  15. Y. Nesterov, Introductory lectures on convex optimization: A basic course. Springer Science & Business Media, 2003, vol. 87.

    Google Scholar 

  16. Z.-Q. Luo and P. Tseng, “Error bounds and convergence analysis of feasible descent methods: a general approach,” Annals of Operations Research, vol. 46, no. 1, pp. 157–178, 1993.

    MathSciNet  CrossRef  Google Scholar 

  17. J. Liu, S. Wright, C. Ré, V. Bittorf, and S. Sridhar, “An asynchronous parallel stochastic coordinate descent algorithm,” in International Conference on Machine Learning. PMLR, 2014, pp. 469–477.

    Google Scholar 

  18. I. Necoara, Y. Nesterov, and F. Glineur, “Linear convergence of first order methods for non-strongly convex optimization,” Mathematical Programming, vol. 175, no. 1, pp. 69–107, 2019.

    MathSciNet  CrossRef  Google Scholar 

  19. H. Zhang and W. Yin, “Gradient methods for convex minimization: better rates under weaker conditions,” arXiv preprint arXiv:1303.4645, 2013.

    Google Scholar 

  20. B. T. Polyak, “Gradient methods for minimizing functionals,” Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, vol. 3, no. 4, pp. 643–653, 1963.

    MathSciNet  MATH  Google Scholar 

  21. S. Lojasiewicz, “A topological property of real analytic subsets,” Coll. du CNRS, Les équations aux dérivées partielles, vol. 117, pp. 87–89, 1963.

    MATH  Google Scholar 

  22. B. D. Craven and B. M. Glover, “Invex functions and duality,” Journal of the Australian Mathematical Society, vol. 39, no. 1, pp. 1–20, 1985.

    MathSciNet  CrossRef  Google Scholar 

  23. K. Kawaguchi, “Deep learning without poor local minima,” arXiv preprint arXiv:1605.07110, 2016.

    Google Scholar 

  24. H. Lu and K. Kawaguchi, “Depth creates no bad local minima,” arXiv preprint arXiv:1702.08580, 2017.

    Google Scholar 

  25. Y. Zhou and Y. Liang, “Critical points of neural networks: Analytical forms and landscape properties,” arXiv preprint arXiv:1710.11205, 2017.

    Google Scholar 

  26. C. Yun, S. Sra, and A. Jadbabaie, “Small nonlinearities in activation functions create bad local minima in neural networks,” arXiv preprint arXiv:1802.03487, 2018.

    Google Scholar 

  27. D. Li, T. Ding, and R. Sun, “Over-parameterized deep neural networks have no strict local minima for any continuous activations,” arXiv preprint arXiv:1812.11039, 2018.

    Google Scholar 

  28. N. P. Bhatia and G. P. Szegö, Stability Theory of Dynamical Systems. Springer Science & Business Media, 2002.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and Permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Verify currency and authenticity via CrossMark

Cite this chapter

Ye, J.C. (2022). Deep Learning Optimization. In: Geometry of Deep Learning. Mathematics in Industry, vol 37. Springer, Singapore. https://doi.org/10.1007/978-981-16-6046-7_11

Download citation