Deep Learning Optimization

Ye, Jong Chul

doi:10.1007/978-981-16-6046-7_11

Jong Chul Ye¹³

Part of the book series: Mathematics in Industry ((MATHINDUSTRY,volume 37))

5863 Accesses

Abstract

In Chap. 6, we discussed various optimization methods for deep neural network training. Although they are in various forms, these algorithms are basically gradient-based local update schemes. However, the biggest obstacle recognized by the entire community is that the loss surfaces of deep neural networks are extremely non-convex and not even smooth. This non-convexity and non-smoothness make the optimization unaffordable to analyze, and the main concern was whether popular gradient-based approaches might fall into local minimizers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 29.99; Price excludes VAT (USA)

Softcover Book: USD 39.99; Price excludes VAT (USA)

Hardcover Book: USD 39.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Z. Allen-Zhu, Y. Li, and Z. Song, “A convergence theory for deep learning via over-parameterization,” in International Conference on Machine Learning. PMLR, 2019, pp. 242–252.
Google Scholar
S. Du, J. Lee, H. Li, L. Wang, and X. Zhai, “Gradient descent finds global minima of deep neural networks,” in International Conference on Machine Learning. PMLR, 2019, pp. 1675–1685.
Google Scholar
D. Zou, Y. Cao, D. Zhou, and Q. Gu, “Stochastic gradient descent optimizes over-parameterized deep ReLU networks,” arXiv preprint arXiv:1811.08888, 2018.
Google Scholar
H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal-gradient methods under the Polyak-łojasiewicz condition,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2016, pp. 795–811.
Google Scholar
Q. Nguyen, “On connected sublevel sets in deep learning,” in International Conference on Machine Learning. PMLR, 2019, pp. 4790–4799.
Google Scholar
C. Liu, L. Zhu, and M. Belkin, “Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning,” arXiv preprint arXiv:2003.00307, 2020.
Google Scholar
Z. Allen-Zhu, Y. Li, and Y. Liang, “Learning and generalization in overparameterized neural networks, going beyond two layers,” arXiv preprint arXiv:1811.04918, 2018.
Google Scholar
M. Soltanolkotabi, A. Javanmard, and J. D. Lee, “Theoretical insights into the optimization landscape of over-parameterized shallow neural networks,” IEEE Transactions on Information Theory, vol. 65, no. 2, pp. 742–769, 2018.
Article MathSciNet Google Scholar
S. Oymak and M. Soltanolkotabi, “Overparameterized nonlinear learning: Gradient descent takes the shortest path?” in International Conference on Machine Learning. PMLR, 2019, pp. 4951–4960.
Google Scholar
S. S. Du, X. Zhai, B. Poczos, and A. Singh, “Gradient descent provably optimizes over-parameterized neural networks,” arXiv preprint arXiv:1810.02054, 2018.
Google Scholar
I. Safran, G. Yehudai, and O. Shamir, “The effects of mild over-parameterization on the optimization landscape of shallow ReLU neural networks,” arXiv preprint arXiv:2006.01005, 2020.
Google Scholar
A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: convergence and generalization in neural networks,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, pp. 8580–8589.
Google Scholar
S. Arora, S. S. Du, W. Hu, Z. Li, R. Salakhutdinov, and R. Wang, “On exact computation with an infinitely wide neural net,” arXiv preprint arXiv:1904.11955, 2019.
Google Scholar
Y. Li, T. Luo, and N. K. Yip, “Towards an understanding of residual networks using neural tangent hierarchy (NTH),” arXiv preprint arXiv:2007.03714, 2020.
Google Scholar
Y. Nesterov, Introductory lectures on convex optimization: A basic course. Springer Science & Business Media, 2003, vol. 87.
Google Scholar
Z.-Q. Luo and P. Tseng, “Error bounds and convergence analysis of feasible descent methods: a general approach,” Annals of Operations Research, vol. 46, no. 1, pp. 157–178, 1993.
Article MathSciNet Google Scholar
J. Liu, S. Wright, C. Ré, V. Bittorf, and S. Sridhar, “An asynchronous parallel stochastic coordinate descent algorithm,” in International Conference on Machine Learning. PMLR, 2014, pp. 469–477.
Google Scholar
I. Necoara, Y. Nesterov, and F. Glineur, “Linear convergence of first order methods for non-strongly convex optimization,” Mathematical Programming, vol. 175, no. 1, pp. 69–107, 2019.
Article MathSciNet Google Scholar
H. Zhang and W. Yin, “Gradient methods for convex minimization: better rates under weaker conditions,” arXiv preprint arXiv:1303.4645, 2013.
Google Scholar
B. T. Polyak, “Gradient methods for minimizing functionals,” Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, vol. 3, no. 4, pp. 643–653, 1963.
MathSciNet MATH Google Scholar
S. Lojasiewicz, “A topological property of real analytic subsets,” Coll. du CNRS, Les équations aux dérivées partielles, vol. 117, pp. 87–89, 1963.
MATH Google Scholar
B. D. Craven and B. M. Glover, “Invex functions and duality,” Journal of the Australian Mathematical Society, vol. 39, no. 1, pp. 1–20, 1985.
Article MathSciNet Google Scholar
K. Kawaguchi, “Deep learning without poor local minima,” arXiv preprint arXiv:1605.07110, 2016.
Google Scholar
H. Lu and K. Kawaguchi, “Depth creates no bad local minima,” arXiv preprint arXiv:1702.08580, 2017.
Google Scholar
Y. Zhou and Y. Liang, “Critical points of neural networks: Analytical forms and landscape properties,” arXiv preprint arXiv:1710.11205, 2017.
Google Scholar
C. Yun, S. Sra, and A. Jadbabaie, “Small nonlinearities in activation functions create bad local minima in neural networks,” arXiv preprint arXiv:1802.03487, 2018.
Google Scholar
D. Li, T. Ding, and R. Sun, “Over-parameterized deep neural networks have no strict local minima for any continuous activations,” arXiv preprint arXiv:1812.11039, 2018.
Google Scholar
N. P. Bhatia and G. P. Szegö, Stability Theory of Dynamical Systems. Springer Science & Business Media, 2002.
Google Scholar

Download references

Author information

Authors and Affiliations

Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea
Jong Chul Ye

Authors

Jong Chul Ye
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ye, J.C. (2022). Deep Learning Optimization. In: Geometry of Deep Learning. Mathematics in Industry, vol 37. Springer, Singapore. https://doi.org/10.1007/978-981-16-6046-7_11

Download citation

DOI: https://doi.org/10.1007/978-981-16-6046-7_11
Published: 05 January 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-6045-0
Online ISBN: 978-981-16-6046-7
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics