Z. Allen-Zhu, Y. Li, and Z. Song, “A convergence theory for deep learning via over-parameterization,” in International Conference on Machine Learning. PMLR, 2019, pp. 242–252.
S. Du, J. Lee, H. Li, L. Wang, and X. Zhai, “Gradient descent finds global minima of deep neural networks,” in International Conference on Machine Learning. PMLR, 2019, pp. 1675–1685.
D. Zou, Y. Cao, D. Zhou, and Q. Gu, “Stochastic gradient descent optimizes over-parameterized deep ReLU networks,” arXiv preprint arXiv:1811.08888, 2018.
H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal-gradient methods under the Polyak-łojasiewicz condition,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2016, pp. 795–811.
Q. Nguyen, “On connected sublevel sets in deep learning,” in International Conference on Machine Learning. PMLR, 2019, pp. 4790–4799.
C. Liu, L. Zhu, and M. Belkin, “Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning,” arXiv preprint arXiv:2003.00307, 2020.
Z. Allen-Zhu, Y. Li, and Y. Liang, “Learning and generalization in overparameterized neural networks, going beyond two layers,” arXiv preprint arXiv:1811.04918, 2018.
M. Soltanolkotabi, A. Javanmard, and J. D. Lee, “Theoretical insights into the optimization landscape of over-parameterized shallow neural networks,” IEEE Transactions on Information Theory, vol. 65, no. 2, pp. 742–769, 2018.
S. Oymak and M. Soltanolkotabi, “Overparameterized nonlinear learning: Gradient descent takes the shortest path?” in International Conference on Machine Learning. PMLR, 2019, pp. 4951–4960.
S. S. Du, X. Zhai, B. Poczos, and A. Singh, “Gradient descent provably optimizes over-parameterized neural networks,” arXiv preprint arXiv:1810.02054, 2018.
I. Safran, G. Yehudai, and O. Shamir, “The effects of mild over-parameterization on the optimization landscape of shallow ReLU neural networks,” arXiv preprint arXiv:2006.01005, 2020.
A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: convergence and generalization in neural networks,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, pp. 8580–8589.
S. Arora, S. S. Du, W. Hu, Z. Li, R. Salakhutdinov, and R. Wang, “On exact computation with an infinitely wide neural net,” arXiv preprint arXiv:1904.11955, 2019.
Y. Li, T. Luo, and N. K. Yip, “Towards an understanding of residual networks using neural tangent hierarchy (NTH),” arXiv preprint arXiv:2007.03714, 2020.
Y. Nesterov, Introductory lectures on convex optimization: A basic course. Springer Science & Business Media, 2003, vol. 87.
Z.-Q. Luo and P. Tseng, “Error bounds and convergence analysis of feasible descent methods: a general approach,” Annals of Operations Research, vol. 46, no. 1, pp. 157–178, 1993.
J. Liu, S. Wright, C. Ré, V. Bittorf, and S. Sridhar, “An asynchronous parallel stochastic coordinate descent algorithm,” in International Conference on Machine Learning. PMLR, 2014, pp. 469–477.
I. Necoara, Y. Nesterov, and F. Glineur, “Linear convergence of first order methods for non-strongly convex optimization,” Mathematical Programming, vol. 175, no. 1, pp. 69–107, 2019.
H. Zhang and W. Yin, “Gradient methods for convex minimization: better rates under weaker conditions,” arXiv preprint arXiv:1303.4645, 2013.
B. T. Polyak, “Gradient methods for minimizing functionals,” Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki, vol. 3, no. 4, pp. 643–653, 1963.
S. Lojasiewicz, “A topological property of real analytic subsets,” Coll. du CNRS, Les équations aux dérivées partielles, vol. 117, pp. 87–89, 1963.
B. D. Craven and B. M. Glover, “Invex functions and duality,” Journal of the Australian Mathematical Society, vol. 39, no. 1, pp. 1–20, 1985.
K. Kawaguchi, “Deep learning without poor local minima,” arXiv preprint arXiv:1605.07110, 2016.
H. Lu and K. Kawaguchi, “Depth creates no bad local minima,” arXiv preprint arXiv:1702.08580, 2017.
Y. Zhou and Y. Liang, “Critical points of neural networks: Analytical forms and landscape properties,” arXiv preprint arXiv:1710.11205, 2017.
C. Yun, S. Sra, and A. Jadbabaie, “Small nonlinearities in activation functions create bad local minima in neural networks,” arXiv preprint arXiv:1802.03487, 2018.
D. Li, T. Ding, and R. Sun, “Over-parameterized deep neural networks have no strict local minima for any continuous activations,” arXiv preprint arXiv:1812.11039, 2018.
N. P. Bhatia and G. P. Szegö, Stability Theory of Dynamical Systems. Springer Science & Business Media, 2002.