Abstract
Average-case analysis computes the complexity of an algorithm averaged over all possible inputs. Compared to worst-case analysis, it is more representative of the typical behavior of an algorithm, but remains largely unexplored in optimization. One difficulty is that the analysis can depend on the probability distribution of the inputs to the model. However, we show that this is not the case for a class of large-scale problems trained with first-order methods including random least squares and one-hidden layer neural networks with random weights. In fact, the halting time exhibits a universality property: it is independent of the probability distribution. With this barrier for average-case analysis removed, we provide the first explicit average-case convergence rates showing a tighter complexity not captured by traditional worst-case analysis. Finally, numerical simulations suggest this universality property holds for a more general class of algorithms and problems.
This is a preview of subscription content, access via your institution.








Notes
The signal \({\widetilde{{{\varvec{x}}}}}\) is not the same as the vector for which the iterates of the algorithm are converging to as \(k \rightarrow \infty \).
The definition of \({\widetilde{R}}^2\) in Assumption 1 does not imply that \(R^2 \approx \frac{1}{d}\Vert {{\varvec{b}}}\Vert ^2 - {\widetilde{R}}^2\). However, the precise definition of \({\widetilde{R}}\) and this intuitive one yield similar magnitudes and both are generated from similar quantities.
In many situations this deterministic quantity \( \underset{d \rightarrow \infty }{{\mathcal {E}}} [\Vert \nabla f({{\varvec{x}}}_{k})\Vert ^2]\,\) is in fact the limiting expectation of the squared-norm of the gradient. However, under the assumptions that we are using, this does not immediately follow. It is, however, always the limit of the median of the squared-norm of the gradient.
References
Arora, S., Du, S.S., Hu, W., Li, Z., Salakhutdinov, R.R., Wang, R.: On exact computation with an infinitely wide neural net. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 32 (2019)
Bai, Z., Silverstein, J.: No eigenvalues outside the support of the limiting spectral distribution of large-dimensional sample covariance matrices. Ann. Probab. 26(1), 316–345 (1998). https://doi.org/10.1214/aop/1022855421
Bai, Z., Silverstein, J.: Exact separation of eigenvalues of large-dimensional sample covariance matrices. Ann. Probab. 27(3), 1536–1555 (1999). https://doi.org/10.1214/aop/1022677458
Bai, Z., Silverstein, J.: CLT for linear spectral statistics of large-dimensional sample covariance matrices. Ann. Probab. 32(1A), 553–605 (2004). https://doi.org/10.1214/aop/1078415845
Bai, Z., Silverstein, J.: Spectral analysis of large dimensional random matrices, second edn. Springer Series in Statistics. Springer, New York (2010). https://doi.org/10.1007/978-1-4419-0661-8
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009). https://doi.org/10.1137/080716542
Benigni, L., Péché, S.: Eigenvalue distribution of nonlinear models of random matrices. arXiv preprint arXiv:1904.03090 (2019)
Bhojanapalli, S., Boumal, N., Jain, P., Netrapalli, P.: Smoothed analysis for low-rank solutions to semidefinite programs in quadratic penalty form. In: Proceedings of the 31st Conference On Learning Theory (COLT), Proceedings of Machine Learning Research, vol. 75, pp. 3243–3270. PMLR (2018)
Borgwardt, K.: A Probabilistic Analysis of the Simplex Method. Springer-Verlag, Berlin, Heidelberg (1986)
Bottou, L., Curtis, F., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Review 60(2), 223–311 (2018). https://doi.org/10.1137/16M1080173
Bradbury, J., Frostig, R., Hawkins, P., Johnson, M., Leary, C., Maclaurin, D., Wanderman-Milne, S.: JAX: composable transformations of Python+NumPy programs (2018)
Chizat, L., Oyallon, E., Bach, F.: On lazy training in differentiable programming. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 32 (2019)
Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., Bengio, Y.: Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 27 (2014)
Deift, P., Menon, G., Olver, S., Trogdon, T.: Universality in numerical computations with random data. Proc. Natl. Acad. Sci. USA 111(42), 14973–14978 (2014). https://doi.org/10.1073/pnas.1413446111
Deift, P., Trogdon, T.: Universality in numerical computation with random data: Case studies, analytical results, and some speculations. Abel Symposia 13(3), 221–231 (2018)
Deift, P., Trogdon, T.: Universality in numerical computation with random data: case studies and analytical results. J. Math. Phys. 60(10), 103306, 14 (2019). https://doi.org/10.1063/1.5117151
Deift, P., Trogdon, T.: The conjugate gradient algorithm on well-conditioned Wishart matrices is almost deterministic. Quart. Appl. Math. 79(1), 125–161 (2021). https://doi.org/10.1090/qam/1574
Demmel, J.W.: The probability that a numerical analysis problem is difficult. Math. Comp. 50(182), 449–480 (1988). https://doi.org/10.2307/2008617
Durrett, R.: Probability—theory and examples, Cambridge Series in Statistical and Probabilistic Mathematics, vol. 49. Cambridge University Press, Cambridge (2019). https://doi.org/10.1017/9781108591034
Edelman, A.: Eigenvalues and condition numbers of random matrices. SIAM J. Matrix Anal. Appl 9(4), 543–560 (1988). https://doi.org/10.1137/0609045
Edelman, A., Rao, N.R.: Random matrix theory. Acta Numer. 14, 233–297 (2005). https://doi.org/10.1017/S0962492904000236
Engeli, M., Ginsburg, T., Rutishauser, H., Stiefel, E.: Refined iterative methods for computation of the solution and the eigenvalues of self-adjoint boundary value problems. Mitt. Inst. Angew. Math. Zürich 8, 107 (1959)
Fischer, B.: Polynomial based iteration methods for symmetric linear systems, Classics in Applied Mathematics, vol. 68. Society for Industrial and Applied Mathematics (SIAM) (2011). https://doi.org/10.1137/1.9781611971927.fm
Flanders, D., Shortley, G.: Numerical determination of fundamental modes. J. Appl. Phys. 21, 1326–1332 (1950)
Ghorbani, B., Krishnan, S., Xiao, Y.: An investigation into neural net optimization via hessian eigenvalue density. In: Proceedings of the 36th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, vol. 97, pp. 2232–2241. PMLR (2019)
Golub, G., Varga, R.: Chebyshev semi-iterative methods, successive over-relaxation iterative methods, and second order Richardson iterative methods. I. Numer. Math. 3, 147–156 (1961). https://doi.org/10.1007/BF01386013
Gunasekar, S., Lee, J., Soudry, D., Srebro, N.: Characterizing implicit bias in terms of optimization geometry. In: Proceedings of the 35th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, vol. 80, pp. 1832–1841. PMLR (2018)
Hachem, W., Hardy, A., Najim, J.: Large complex correlated Wishart matrices: fluctuations and asymptotic independence at the edges. Ann. Probab. 44(3), 2264–2348 (2016). https://doi.org/10.1214/15-AOP1022
Hastie, T., Montanari, A., Rosset, S., Tibshirani, R.: Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560 (2019)
Hestenes, M., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Research Nat. Bur. Standards 49, 409–436 (1952)
Hoare, C.A.R.: Quicksort. Comput. J. 5, 10–15 (1962). https://doi.org/10.1093/comjnl/5.1.10
Jacot, A., Gabriel, F., Hongler, C.: Neural tangent kernel: Convergence and generalization in neural networks. In: Advances in neural information processing systems (NeurIPS), vol. 31 (2018)
Knowles, A., Yin, J.: Anisotropic local laws for random matrices. Probab. Theory Related Fields 169(1-2), 257–352 (2017). https://doi.org/10.1007/s00440-016-0730-4
Kuijlaars, A.B.J., McLaughlin, K.T.R., Van Assche, W., Vanlessen, M.: The Riemann-Hilbert approach to strong asymptotics for orthogonal polynomials on \([-1,1]\). Adv. Math. 188(2), 337–398 (2004). https://doi.org/10.1016/j.aim.2003.08.015
Lacotte, J., Pilanci, M.: Optimal randomized first-order methods for least-squares problems. In: Proceedings of the 37th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, vol. 119, pp. 5587–5597. PMLR (2020)
Liao, Z., Couillet, R.: The dynamics of learning: A random matrix approach. In: Proceedings of the 35th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, vol. 80, pp. 3072–3081. PMLR (2018)
Louart, C., Liao, Z., Couillet, R.: A random matrix approach to neural networks. Ann. Appl. Probab. 28(2), 1190–1248 (2018). https://doi.org/10.1214/17-AAP1328
Marčenko, V., Pastur, L.: Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik (1967)
Martin, C., Mahoney, M.: Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning. Journal of Machine Learning Research 22(165), 1–73 (2021)
Mei, S., Montanari, A.: The generalization error of random features regression: Precise asymptotics and double descent curve. Communications on Pure and Applied Mathematics (CPAM) (2019). https://doi.org/10.1002/cpa.22008
Menon, G., Trogdon, T.: Smoothed analysis for the conjugate gradient algorithm. SIGMA Symmetry Integrability Geom. Methods Appl. 12, Paper No. 109, 22 (2016). https://doi.org/10.3842/SIGMA.2016.109
Nemirovski, A.: Information-based complexity of convex programming. Lecture Notes (1995)
Nesterov, Y.: Introductory lectures on convex optimization: A basic course, Applied Optimization, vol. 87. Kluwer Academic Publishers (2004). https://doi.org/10.1007/978-1-4419-8853-9
Nesterov, Y.: How to make the gradients small. Optima 88 pp. 10–11 (2012)
Novak, R., Xiao, L., Lee, J., Bahri, Y., Yang, G., Hron, J., Abolafia, D., Pennington, J., Sohl-Dickstein, J.: Bayesian deep convolutional networks with many channels are gaussian processes. In: Proceedings of the 7th International Conference on Learning Representations (ICLR) (2019)
Papyan, V.: The full spectrum of deepnet hessians at scale: Dynamics with sgd training and sample size. arXiv preprint arXiv:1811.07062 (2018)
Paquette, E., Trogdon, T.: Universality for the conjugate gradient and minres algorithms on sample covariance matrices. arXiv preprint arXiv:2007.00640 (2020)
Pedregosa, F., Scieur, D.: Average-case acceleration through spectral density estimation. In: Proceedings of the 37th International Conference on Machine Learning (ICML), vol. 119, pp. 7553–7562 (2020)
Pennington, J., Worah, P.: Nonlinear random matrix theory for deep learning. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 30 (2017)
Pfrang, C.W., Deift, P., Menon, G.: How long does it take to compute the eigenvalues of a random symmetric matrix? In: Random matrix theory, interacting particle systems, and integrable systems, Math. Sci. Res. Inst. Publ., vol. 65, pp. 411–442. Cambridge Univ. Press, New York (2014)
Polyak, B.: Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics 04, 791–803 (1964)
Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 20, pp. 1177–1184 (2008)
Sagun, L., Bottou, L., LeCun, Y.: Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv preprint arXiv:1611.07476 (2016)
Sagun, L., Trogdon, T., LeCun, Y.: Universal halting times in optimization and machine learning. Quarterly of Applied Mathematics 76(2), 289–301 (2018). https://doi.org/10.1090/qam/1483
Sankar, A., Spielman, D.A., Teng, S.: Smoothed analysis of the condition numbers and growth factors of matrices. SIAM J. Matrix Anal. Appl. 28(2), 446–476 (2006). https://doi.org/10.1137/S0895479803436202
Schmidt, M., Le Roux, N.: Fast convergence of stochastic gradient descent under a strong growth condition. arXiv preprint arXiv:1308.6370 (2013)
Smale, S.: On the average number of steps of the simplex method of linear programming. Mathematical Programming 27(3), 241–262 (1983). https://doi.org/10.1007/BF02591902
Spielman, D., Teng, S.: Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. J. ACM 51(3), 385-463 (2004). https://doi.org/10.1017/CBO9780511721571.010
Su, W., Boyd, S., Candès, E.: A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights. Journal of Machine Learning Research 17(153), 1–43 (2016)
Tao, T.: Topics in random matrix theory, vol. 132. American Mathematical Soc. (2012). https://doi.org/10.1090/gsm/132
Tao, T., Vu, V.: Random matrices: the distribution of the smallest singular values. Geom. Funct. Anal. 20(1), 260–297 (2010). https://doi.org/10.1007/s00039-010-0057-8
Taylor, A., Hendrickx, J., Glineur, F.: Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Math. Program. 161(1-2, Ser. A), 307–345 (2017). https://doi.org/10.1007/s10107-016-1009-3
Todd, M.J.: Probabilistic models for linear programming. Math. Oper. Res. 16(4), 671–693 (1991). https://doi.org/10.1287/moor.16.4.671
Trefethen, L.N., Schreiber, R.S.: Average-case stability of Gaussian elimination. SIAM J. Matrix Anal. Appl. 11(3), 335–360 (1990). https://doi.org/10.1137/0611023
Walpole, R.E., Myers, R.H.: Probability and statistics for engineers and scientists, second edn. Macmillan Publishing Co., Inc., New York; Collier Macmillan Publishers, London (1978)
Wilson, A., Roelofs, R., Stern, M., Srebro, N., Recht, B.: The marginal value of adaptive gradient methods in machine learning. In: Advances in Neural Information Processing Systems (NeurIPS), vol. 30 (2017)
Acknowledgements
The authors would like to thank our colleagues Nicolas Le Roux, Ross Goroshin, Zaid Harchaoui, Damien Scieur, and Dmitriy Drusvyatskiy for their feedback on this manuscript, and Henrik Ueberschaer for providing useful random matrix theory references.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Jim Renegar.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
C. Paquette is a CIFAR AI chair. Research by E. Paquette was supported by a Discovery Grant from the Natural Science and Engineering Research Council (NSERC) of Canada.
Appendices
Derivation of Polynomials
In this section, we construct the residual polynomials for various popular first-order methods, including Nesterov’s accelerated gradient and Polyak momentum.
1.1 Nesterov’s Accelerated Methods
Nesterov accelerated methods generate iterates using the relation
By developing the recurrence of the iterates on the least squares problem (10), we get the following three-term recurrence
with the initial vector \({{\varvec{x}}}_0 \in {{\mathbb {R}}}^d\) and \({{\varvec{x}}}_1 = {{\varvec{x}}}_0-\alpha \nabla f({{\varvec{x}}}_0)\). Using these standard initial conditions, we deduce from Proposition 1 the following
It immediately follows that the residual polynomials satisfy the same three-term recurrence, namely,
By Proposition 1, we only need to derive an explicit expression for the \(P_k\) polynomials.
1.1.1 Strongly Convex Setting
The polynomial recurrence relationship for Nesterov’s accelerated method in the strongly convex setting is given by
We generate an explicit representation for the polynomial by constructing the generating function for the polynomials \(P_k\), namely
We solve this expression for \({\mathfrak {G}}\), which gives
Ultimately, we want to relate the generating function for the polynomials \(P_k\) to a generating function for known polynomials. Notably, in this case, the Chebyshev polynomials of the 1st and 2nd kind—denoted \((T_k(x))\) and \((U_k(x))\), respectively—resemble the generating function for the residual polynomials of Nesterov accelerated method. The generating function for Chebyshev polynomials is given as
To give the explicit relationship between (99) and (100), we make the substitution \(t \mapsto \frac{t}{(\beta (1-\alpha \lambda )^{1/2}}\). A simple calculation yields the following
We can compare (100) with (101) to derive an expression for the polynomials \(P_k\)
where \(T_k\) is the Chebyshev polynomial of the first kind and \(U_k\) is the Chebyshev polynomial of the second kind.
1.1.2 Convex Setting: Legendre Polynomials and Bessel Asymptotics
When the objective function is convex (i.e., \(\lambda _{{{\varvec{H}}}}^- = 0\)), the recurrence for the residual polynomial associated with Nesterov’s accelerated method residual reduces to
We now seek to solve this recurrence.
Nesterov’s polynomials as Legendre polynomials. First we observe that these polynomials are also polynomials in \(u =\alpha \lambda \), so we can define new polynomials \({\widetilde{P}}_k(u)\) such that \({\widetilde{P}}_k(\alpha \lambda ) = P_k(\lambda ; \lambda _{{{\varvec{H}}}}^{\pm })\). Let us define new polynomials \({\widetilde{R}}_k(u) = {\widetilde{P}}_k(u)(1-u)^{-k/2}\). Then the recurrence in (103) can be reformulated as
A simple computation shows that the polynomials \(\{{\widetilde{R}}_k\}\) are polynomials in \(v = (1-u)^{1/2}\). Because of this observation, we define new polynomials \(R_k(v)\) where \(R_k( (1-u)^{1/2}) = {\widetilde{R}}_k(u)\). Now we will find a formula for the polynomials \(R_k\) by constructing its generating function,
The recurrence in (104) together with the definition of \(R_k\) yields the following differential identity
One can see this is a first-order linear ODE with initial conditions given by
Using an integrating factor of \(\mu (t) = \tfrac{t}{\sqrt{t^2-2tv +1}}\) , the solution to this initial value problem is
At first glance, this does not seem related to any known generating function for a polynomial; however, if we differentiate this function, we get that
and it is known that the generating function for the Legendre Polynomials \(\{L_k\}\) is exactly
Hence, it follows that
and so for \(k \ge 1\), by comparing coefficients we deduce that
Replacing all of our substitutions back in, we get the following representation for the Nesterov polynomials for \(k \ge 1\)
where \(\{L_k\}\) are the Legendre polynomials.
Bessel asymptotics for Nesterov’s residual polynomials. In this section, we derive an asymptotic for the residual polynomials of Nesterov’s accelerated method in the convex setting. We will show that the polynomials \(P_k\) in (105) satisfy in a sufficiently strong sense
where \(J_1\) is the Bessel function of the first kind. Another possible way to derive this asymptotic is to extract the second order asymptotics from [34]. We will show that the Bessel asymptotic (106) follows directly from the Legendre polynomials in (105). To see this, recall the integral representation of a Legendre polynomial is given below by
and so we have
Now define the polynomial \({\widetilde{P}}_k(u) = P_k( \lambda ^+ u; \lambda _{{{\varvec{H}}}}^{\pm })\) where the polynomials \(P_k\) satisfies Nesterov’s recurrence (97). Using the derivation of Nesterov’s polynomial from the previous section (105), we obtain the following expression
We can get an explicit expression for the imaginary part of the k-th power in (107) by expressing \(\sqrt{1-u} + i \sqrt{u} \cos (\phi )\) in terms of its polar form. In particular, we have that
Hence, we have the following
Define the following integral
and note the similarity of this integral with the Bessel function, namely
Using this definition, the polynomial can be written as \({\widetilde{P}}_k(u) = \frac{2(1-u)^{(k+1)/2}}{k\sqrt{u}}I_k(u)\). Since \(I_k\) is always bounded, then for \(u \ge \log ^2(k)/k\), the magnitude of \(|{\widetilde{P}}_k(u)|\) is smaller than any power of k. This follows by using the bound that \((1-x)^k \le \text {exp}(-kx)\) and noting that \(\text {exp}(-\log ^2(k))\) decays faster than any polynomial in k. So the interesting asymptotic is for \(u \le \log ^2(k)/k\), and for this we show the following.
Lemma 9
There is an absolute constant C so that for all \(k \ge 1\) and \(0 \le u \le \log ^2(k)/k\)
Corollary 1
(Nesterov’s polynomial asymptotic) There is an absolute constant C so that for all \(k \ge 1\) and all \(0 \le u \le \tfrac{\log ^2(k)}{k}\), the following holds
In particular, the following result holds for all \(0 \le u \le \frac{\log ^2(k)}{k}\)
Proof of Corollary 1
First, we have that \({\widetilde{P}}_k(u) = \frac{2(1-u)^{(k+1)/2}}{k \sqrt{u}} I_k(u)\). A simple triangle inequality shows that
The first difference is small because \(|(1-u)^{(k+1)/2} - e^{-uk/2}| \le C e^{-uk/2} (u + ku^2)\) for some absolute constant C and \(I_k\) is bounded. The second difference follows directly from Lemma 9. The second inequality (111) follows from \(|a^2-b^2| \le |a-b| (|a-b| + 2|b|)\) and \(J_1(x) \le \frac{C}{\sqrt{x}}\). \(\square \)
Proof of Lemma 9
First, we observe that \(z^k\) is k-Lipschitz on the interval [0, 1]. Since \(e^{-u}\) and \(1-u\) lie in the interval [0, 1] for any \(u \in [0,1]\), the Lipschitz property of \(z^k\) and the 2nd-order Taylor approximation of \(e^{-uk}\) imply that there exists an \(\xi \in [0,1]\) such that
Here we used that \(u \le \tfrac{\log ^2(k)}{k}\). Similarly, we have that
We also know that \(\sin (k x)\) is k-Lipschitz. Therefore, again by Taylor approximation on \(\tan ^{-1}(x)\), we deduce the following bound for some \(\xi _{u, \phi } \in [0, \sqrt{\tfrac{u}{1-u}}]\)
Moreover, suppose \(v = \sqrt{u}\) and consider the function \(\sqrt{\frac{u}{1-u}} = \tfrac{v}{\sqrt{1-v^2}}\). Using a Taylor approximation at \(v = 0\) and the k-Lip. of \(\sin (kx)\), we obtain that
Since \(0 \le u \le \tfrac{\log ^2(k)}{k}\), we have that \(\sqrt{\frac{u}{1-u}}\) is bounded by some constant C independent of k and hence, the constant \(\xi _{u, \phi }\) is bounded. This implies that \(\frac{6 \xi _{u, \phi }^2 -2}{(\xi _{u,\phi }+1)^3}\) is bounded by some absolute constant \({\widetilde{C}}\). By putting together (114) and (115), we deduce that
where C is some absolute constant. Here we used that \(u \le \tfrac{\log ^2(k)}{k}\) and \(\log ^2(k) \le C k^{1/3}\). We now consider three cases. Suppose \(u \le k^{-4/3}\). Using \(u \le k^{-4/3}\) and (113) we deduce that
By putting together (116) and (117), we get the following bound for all \(u \le k^{-4/3}\).
The result immediately follows. On the other hand for \(k^{-4/3} \le u \le \log ^2(k)/k\), we cut the range of \(\phi \). Let \(\phi _0\) be such that \(\phi _0 = k^{-2/3} u^{-1/2}\). We know from \(u \ge k^{-4/3}\) that
Now for \(\phi \le \phi _0\), we have in this range that
In the last inequality we used that \(\phi _0 = k^{-2/3} u^{-1/2}\) and \(u \le \log ^2(k)/k\). Since \(\frac{\log ^3(k)}{k^{1/2}} \le C k^{-1/3}\), the result immediately follows.
For larger \(\phi \), we use integration by parts with \(F(\phi ) = \cos (k \sqrt{u} \cos (\phi ))\) and \(G(\phi ) = e^{-k \sin ^2(\phi )u/2} \cot (\phi )\) to express
Since \(u \in [0,1]\), we get the following bound
for some \(C > 0\). We will bound each of the terms in (120) independently. For (a), Taylor’s approximation yields that \(|\cot (\phi _0) - \tfrac{1}{\phi _0} | \le C\) which implies that \(|\cot (\phi _0)| \le \tfrac{C}{\phi _0}\) for some positive constants. Therefore, we deduce that the quantity \(\text {(a)}\) is bounded by \(\frac{C}{k \sqrt{u} \phi _0}\). For (b) since \(|F(\phi )| \le 1\), \(|\cot (\phi )\sin (2\phi )| = |2\cos ^2(\phi )| \le 2\) and, of course, \(| e^{-k \sin ^2 \phi u/2} | \le 1\), we have that the quantity (b) is bounded by \(C \sqrt{u}\). As for the quantity (c), we use the following approximation \(|\csc ^2(\phi )- \tfrac{1}{\phi ^2}| \le C\) so that \(|\csc ^2(\phi )| \le \frac{C}{\phi ^2}\). Hence, the integral (c) is bounded by \(\tfrac{C}{k\sqrt{u} \phi _0}\). Therefore, we conclude that
Here we used that \(k \sqrt{u} \phi _0 = k^{1/3}\) and \(\sqrt{u} \le \log (k)/\sqrt{k} \le C k^{-1/3}\).
Now let’s repeat this process replacing \(G(\phi ) = e^{-k \sin ^2(\phi ) u/2} \cot (\phi )\) with \(G(\phi ) = \cot (\phi )\). This time we have that
Using the same bounds as before, we deduce the following
For \(u \ge k^{-4/3}\), we have the following result
This finishes the proof for the lemma. \(\square \)
1.2 Polyak Momentum (Heavy-Ball) Method
The polynomials that generated Polyak’s heavy ball method satisfies the following three-term recursion
As in the previous examples, we will construct the generating function for the polynomials \(P_k\) using the recurrence in (123)
We solve for the generating function
This generating function for Polyak resembles the generating function for Chebyshev polynomials of the first and second kind (100). First, we set \(t \mapsto \frac{t}{\sqrt{-m}}\) (note that \(m < 0\) by definition in (123)). Under this transformation, we have the following
where \(\sigma (\lambda ) = \frac{\lambda _{{{\varvec{H}}}}^+ + \lambda _{{{\varvec{H}}}}^- - 2 \lambda }{\lambda _{{{\varvec{H}}}}^+ - \lambda _{{{\varvec{H}}}}^-}\). A simple computation shows that
By matching terms in the generating function for Chebyshev polynomials (100) and Polyak’s generating function (124), we derive an expression for the polynomials \(P_k\) in Polyak’s momentum
where \(T_k(x) \,(U_k(x))\) is the Chebyshev polynomial of the 1st (2nd) kind, respectively
Average-Case Complexity
In this section, we compute the average-case complexity for various first-order methods. To do so, we integrate the residual polynomials found in Table 3 against the Marčenko-Pastur density.
Lemma 10
(Average-case: Gradient descent) Let \(\mathrm {d}\mu _{\mathrm {MP}}\) be the Marčenko-Pastur law defined in (2) and \(P_k, Q_k\) be the residual polynomials for gradient descent.
-
1.
For \(r = 1\) and \(\ell = \{1,2\}\), the following holds
$$\begin{aligned} \int \lambda ^\ell P_k^2(\lambda ; \lambda ^{\pm }) \, d\mu _{\mathrm {MP}}&= \frac{(\lambda ^+)^{\ell +1}}{2 \pi \sigma ^2} \cdot \frac{\varGamma (2k + \tfrac{3}{2}) \varGamma (\ell + \tfrac{1}{2})}{\varGamma (2k+\ell +2)}\\&\sim \frac{(\lambda ^+)^{\ell + 1}}{2 \pi \sigma ^2} \cdot \frac{\varGamma (\ell + \tfrac{1}{2})}{(2k + 3/2)^{\ell + 1/2}}. \end{aligned}$$ -
2.
For \(r \ne 1\), the following holds
$$\begin{aligned} \int \lambda P_k^2(\lambda ; \lambda ^{\pm }) \, d\mu _{\mathrm {MP}}=&\frac{(\lambda ^+-\lambda ^-)^2}{2 \pi \sigma ^2 r} \left( 1 - \frac{\lambda ^-}{\lambda ^+} \right) ^{2k} \frac{\varGamma (2k + \tfrac{3}{2}) \varGamma (\tfrac{3}{2})}{\varGamma (2k + 3)}\\&\sim \frac{(\lambda ^+-\lambda ^-)^2}{2 \pi \sigma ^2 r} \left( 1 - \frac{\lambda ^-}{\lambda ^+} \right) ^{2k} \cdot \frac{\varGamma (\tfrac{3}{2})}{(2k + \tfrac{3}{2})^{3/2}}. \end{aligned}$$ -
3.
For \(r \ne 1\), the following holds
$$\begin{aligned}&\int \lambda ^2 P_k^2(\lambda ; \lambda ^{\pm }) \, d\mu _{\mathrm {MP}}\\&\quad = \frac{(\lambda ^+-\lambda ^-)^2}{2 \pi \sigma ^2 r} \left( 1 - \frac{\lambda ^-}{\lambda ^+} \right) ^{2k} \left( \frac{\lambda ^- \cdot \varGamma (2k + \tfrac{3}{2}) \varGamma (\tfrac{3}{2})}{\varGamma (2k + 3)} + \frac{(\lambda ^+-\lambda ^-) \cdot \varGamma (2k + \tfrac{3}{2}) \varGamma (\tfrac{5}{2})}{\varGamma (2k + 4)} \right) \\&\qquad \sim \frac{(\lambda ^+-\lambda ^-)^2}{2 \pi \sigma ^2 r} \left( 1 - \frac{\lambda ^-}{\lambda ^+} \right) ^{2k} \left( \frac{\lambda ^- \cdot \varGamma (\tfrac{3}{2})}{(2k + \tfrac{3}{2})^{3/2}} + \frac{(\lambda ^+-\lambda ^-) \varGamma (\tfrac{5}{2})}{(2k + \tfrac{3}{2})^{5/2}} \right) . \end{aligned}$$
Proof
The proof relies on writing the integrals in terms of \(\beta \)-functions. Let \(\ell = \{1,2\}\). Using a change of variables \(\lambda = \lambda ^- + (\lambda ^+-\lambda ^-) w\), we deduce the following expression
We consider cases depending on whether \(\lambda ^- = 0\) or not (i.e., \(r = 1\)). First suppose \(\lambda ^- = 0\) so by equation (126) we have
The result follows after noting that the integral is a \(\beta \)-function with parameters \( 2k + 3/2\) and \(\ell + 1/2\) as well as the asymptotics of \(\beta \)-functions, \(\beta (x,y) = \varGamma (y) x^{-y}\) for x large and y fixed.
Next consider when \(r \ne 1\) and \(\ell =1\). Using (126), we have that
The integral is a \(\beta \)-function with parameters \(2k + 3/2\) and 3/2. Applying the asymptotics of \(\beta \)-functions, finishes this case.
Lastly consider when \(r \ne 1\) and \(\ell = 2\). Similar to the previous case, using (126), the following holds
The first integral is a \(\beta \)-function with parameters \(2k + 3/2\) and 3/2 and the second term is a \(\beta \)-function with parameters \(2k + 3/2\) and 5/2. Again using the asymptotics for \(\beta \)-functions yields the result. \(\square \)
Lemma 11
(Average-case: Nesterov accelerated method (strongly convex)) Let \(d\mu _{\mathrm {MP}}\) be the Marčenko-Pastur law defined in (2) and \(P_k\) be the residual polynomial for Nesterov accelerated method on a strongly convex objective function (102). Then the following holds
and the integral equals
where \(\beta = \frac{\sqrt{\lambda ^+}-\sqrt{\lambda ^-}}{\sqrt{\lambda ^+} + \sqrt{\lambda ^-}}\) and \(\alpha = \frac{1}{\lambda ^+}\).
Proof
Throughout this proof, we define \(P_k(\lambda )\) to be \(P_k(\lambda ; \lambda ^{\pm })\) in order to simplify the notation. In order to integrate the Chebyshev polynomials, we reduce our integral to trig functions via a series of change of variables. Under the change of variables that sends \(\lambda = \lambda ^{-} + (\lambda ^+-\lambda ^-) w\), we have that for any \(\ell \ge 1\)
We note that under this transformation \(1-\alpha \lambda = (1-\tfrac{\lambda ^-}{\lambda ^+}) (1-w)\) and \(\tfrac{1+\beta }{2 \sqrt{\beta }} \sqrt{1-\alpha \lambda } = (1-w)^{1/2}\). Moreover, by expanding out Nesterov’s polynomial (102), we deduce the following
First, we consider the setting where \(\ell = 1\) in (128) and hence we deduce that
We will treat each term in the summand separately. Because the integral of \(\int _0^{2\pi } e^{i k\theta } =0 \) for any \(k \in {\mathbb {N}}\), we only need to keep track of the constant terms. From this observation, we get the following
We note that \(\tfrac{1}{4^k} \left( \left( {\begin{array}{c}2k+2\\ k+1\end{array}}\right) - \left( {\begin{array}{c}2k+2\\ k\end{array}}\right) \right) \sim \tfrac{4}{\sqrt{\pi } k^{3/2}}\) and \(\tfrac{1}{4^k} \left( {\begin{array}{c}2k+2\\ k+1\end{array}}\right) \sim \frac{4}{\sqrt{\pi } k^{1/2}}\).
Next, we consider the setting where \(\ell = 2\) and we observe from (128) that
Since we know how to evaluate \(\int \lambda P_k^2(\lambda ) d\mu _{\mathrm {MP}}\), we only need to analyze the RHS of this integral. A similar analysis as in (130) applies
As before, we will treat each term in the summand separately and use that \(\int _0^{2\pi } e^{ik\theta } = 0\) to only keep track of the constant terms:
The result immediately follows. \(\square \)
Lemma 12
(Average-case: Polyak Momentum) Let \(d\mu _{\mathrm {MP}}\) be the Marčenko-Pastur law defined in (2) and \(P_k\) be the residual polynomial for Polyak’s (heavy-ball) method (125). Then the following holds
Proof
In order to simplify notation, we define the following
Under the change of variables, \(u = \sigma (\lambda )\), we deduce the following for any \(\ell \ge 1\)
First, we consider when \(\ell = 1\). We convert this into a trig. integral using the substitution \(u = \cos (\theta )\) and its nice relationship with the Chebyshev polynomials. In particular, we deduce the following
Treating each term in the summand separately, we can evaluate each integral
The result follows. Next we consider when \(\ell = 2\). A quick calculation using (135) shows that
Since we evaluated the first integral, it suffices to analyze the second one. Again, we use a trig. substitution \(u = \cos (\theta )\) and deduce the following
Treating each term in the summand separately, we can evaluate each integral
\(\square \)
Lemma 13
(Average-case: Nesterov accelerated method (convex)) Let \(d\mu _{\mathrm {MP}}\) be the Marčenko-Pastur law defined in (2). Suppose the polynomials \(P_k\) are the residual polynomials for Nesterov’s accelerated gradient (105). If the ratio \(r = 1\), the following asymptotic holds
Proof
Define the polynomial \({\widetilde{P}}_k(u) = P_k( \lambda ^+ u; \lambda ^{\pm })\) where the polynomial \(P_k\) satisfies Nesterov’s recurrence (105). Using the change of variables \(u = \frac{\lambda }{\lambda ^+}\), we get the following relationship
In the equality above, the first integral will become the asymptotic and the second integral we bound using Corollary 1. We start by bounding the second integral. We break this integral into three components based on the value of u
For (i) in (140), we bound the integrand using Corollary 1 such that for all \(u \le k^{-4/3}\)
Therefore, we get that the integral (i) is bounded by
for sufficiently large k and absolute constant C.
For (ii) in (140), we bound the integrand using Corollary 1 to get for all \(k^{-4/3} \le u \le \log ^2(k)/k\) we have
Therefore, we get that the integral (ii) is bounded by
For (iii) in (140), we use a simple bound on the functions \({\widetilde{P}}_k(u) = \frac{2e^{-ku/2}}{k\sqrt{u}} I_k(u)\) where \(I_k(u)\) is defined in (109) and \(J_1(k\sqrt{u})\) for \(u \ge \log ^2(k)/k\)
In the last inequality, we used that the functions \(I_k(u)\) and \(J_1(k\sqrt{u})\) are bounded and \((1-u)^{k+1} \le e^{-uk}\). Since \(e^{-\log ^2(k)}\) decays faster than any polynomial, we have that
for sufficiently large k and some absolute constant C.
It follows by combining (142), (144), and (146) into (140) we have for \(\ell \ge 1\) the following
All that remains is to integrate the Bessel part in (139) to derive the asymptotic. Here we must consider cases when \(\ell =1\) and \(\ell =2\) separately. For \(\ell = 1\) using the change of variables \(v = k \sqrt{u}\) we have that
where \({\mathcal {E}}_i\) is the exponential integral. It is known that the exponential integral \(\frac{-1}{2 \pi } {\mathcal {E}}_i(-\sqrt{k}) \sim \frac{\log (k)}{4\pi }\).
For \(\ell =2\) using the change of variables \(v = uk\) we have the following
The results follow. \(\square \)
Adversarial Model Computations
In this section, we derive the adversarial guarantees for gradient descent and Nesterov’s accelerated method.
Lemma 14
(Adversarial model: Gradient descent) Suppose Assumption 1 holds. Let \(\lambda ^+\) (\(\lambda ^-\)) be the upper (lower) edge of the Marčenko Pastur distribution (2) and \(P_k\) the residual polynomial for gradient descent. Then the adversarial model for the maximal expected squared norm of the gradient is the following.
-
1.
If there is no noise \({\widetilde{R}} = 0\), then
$$\begin{aligned} \lim _{d \rightarrow \infty } \max _{{{\varvec{H}}}} {\mathbb {E}} \big [ \Vert \nabla f({{\varvec{x}}}_k)\Vert ^2 \big ] \sim {\left\{ \begin{array}{ll} \frac{R^2 (\lambda ^+)^2}{(k+1)^2} e^{-2}, &{} \text {if }\lambda ^- = 0\\ R^2 (\lambda ^-)^2 \left( 1 - \frac{\lambda ^-}{\lambda ^+} \right) ^{2k}, &{} \text {if }\lambda ^- > 0. \end{array}\right. } \end{aligned}$$ -
2.
If \({\widetilde{R}} > 0\), then the following holds
$$\begin{aligned} \lim _{d \rightarrow \infty } \max _{{{\varvec{H}}}} {\mathbb {E}} \big [ \Vert \nabla f({{\varvec{x}}}_k) \Vert ^2 \big ] \sim {\left\{ \begin{array}{ll} \left[ \frac{R^2 (\lambda ^+)^2}{4} \frac{1}{k^2} + \frac{ {\widetilde{R}}^2 \lambda ^+}{2} \frac{1}{k} \right] e^{-2}, &{} \text {if }\lambda ^- = 0\\ \big [ R^2 (\lambda ^-)^2 + r {\widetilde{R}}^2 \lambda ^- \big ] \big (1- \frac{\lambda ^-}{\lambda ^+} \big )^{2k}, &{} \text {if }\lambda ^- > 0. \end{array}\right. } \end{aligned}$$
Proof
Suppose we are in the noiseless setting. By a change of variables, setting \(u = \lambda /\lambda ^+\), the following holds
Taking derivatives, we get that the maximum of the RHS occurs when \(u = \frac{1}{k+1}\). For sufficiently large k and \(\lambda ^- > 0\), the maximum lies outside the constraint of \(\big [\tfrac{\lambda ^-}{\lambda ^+}, 1 \big ]\). Hence, the maximum occurs on the boundary, or equivalently, where \(u = \tfrac{\lambda ^-}{\lambda ^+}\). The result in the setting when \(\lambda ^- > 0\) immediately follows from this. When \(\lambda ^- = 0\), then the maximum does occur at \(\frac{1}{k+1}\). Plugging this value into the RHS of (148) and noting that for sufficiently large k, \((1-1/(k+1))^{2k} \rightarrow e^{-2}\), we get the other result for noiseless case.
Now suppose that \({\widetilde{R}} > 0\). By a change variables, setting \(u = \lambda / \lambda ^+\), we have that
The derivative \(h'(u)\) equals 0 at \(u = 1\) (local minimum) and at solutions to the quadratic
There is only one positive root of this quadratic so
We can approximate the square root using Taylor approximation to get that
Putting this together into (150), we get that
As before for sufficiently large k and \(\lambda ^- > 0\), the root above lies outside the constraint of \(\big [ \frac{\lambda ^-}{\lambda ^+}, 1 \big ]\) and so maximum occurs on the boundary, or equivalently, \(u = \frac{\lambda ^-}{\lambda ^+}\). The result immediately follows by plugging this u into (150). When \(\lambda ^- =0\), then the maximum is the root of the above quadratic which asymptotically equals 1/k. Plugging this value into (150) and noting for sufficiently large k that \((1-1/k)^{2k} \approx e^{-2}\), we get the result for the noiseless setting. \(\square \)
Lemma 15
(Adversarial model: Nesterov (convex)) Suppose Assumption 1 holds. Let \(\lambda ^+\) be the upper edge of the Marčenko Pastur distribution (2) and \(P_k\) the residual polynomial for gradient descent. Suppose \(r = 1\). Then the adversarial model for the maximal expected squared norm of the gradient is the following.
-
1.
If there is no noise \({\widetilde{R}} = 0\), then
$$\begin{aligned} \lim _{d \rightarrow \infty } \max _{{{\varvec{H}}}} {\mathbb {E}} \big [ \Vert \nabla f({{\varvec{x}}}_k)\Vert ^2 \big ] \sim \frac{8e^{-1/2}}{\sqrt{2}\pi } (\lambda ^+)^2 R^2 \frac{1}{k^{7/2}}. \end{aligned}$$ -
2.
If \({\widetilde{R}} > 0\), then the following holds
$$\begin{aligned} \lim _{d \rightarrow \infty } \max _{{{\varvec{H}}}} {\mathbb {E}} \big [ \Vert \nabla f({{\varvec{x}}}_k) \Vert ^2 \big ] \sim \Vert J_1^2(x)\Vert _{\infty } (\lambda ^+){\widetilde{R}}^2 \frac{1}{k^2}. \end{aligned}$$
Proof
First, we claim that
Using the definitions in (108) and (109) for \({\widetilde{P}}_k(u)\) and \(I_k(u)\), respectively, we can write
Now by a change of variables we have the following
Let’s first consider the second term in the maximum. Here we use that \(|I_k(u)|\) is bounded so that
The function h(u) is maximized when \(u = \tfrac{1}{k+2}\) and hence the maximum over the constrained set occurs at the endpoint \(\tfrac{\log ^2(k)}{k}\). With this value, it is immediately clear that the maximum over \(u \in [ \tfrac{\log ^2(k)}{k}, 1]\) of \((\lambda ^+) k^{7/2} u^2 {\widetilde{P}}_k^2(u) \rightarrow 0\). Now we consider the first term in (152). In this regime, the polynomial \({\widetilde{P}}_k^2(u)\) behaves like the Bessel function in (34). We further break up the interval \([0, \log ^2(k)/k]\) into larger or smaller than \(k^{-4/3}\). When \(u \in [0, k^{-4/3}]\), Corollary 1 says there exists a constant C such that
Similarly when \(u \in [k^{-4/3}, \log ^2(k)/k]\), Corollary 1 yields that
Using a change of variables, the relevant asymptotic to compute is
From the uniform boundedness of the function \(y \mapsto \sqrt{y}J_1^2(\sqrt{y}),\) there is a constant \({\mathcal {C}} > 0\) so that
Moreover, the Bessel function satisfies
and so for any fixed \(\delta >0\)
As \(\delta > 0\) is arbitrary, picking it sufficiently small completes the claim.
Next we claim that the following holds
A similar argument as above using that in this regime the exponential dominates the polynomial \({\widetilde{P}}_k^2(u)\) we have
Now we need to consider the regime where the Bessel function (34) becomes important. We use our asymptotic in Corollary 1 to show that the polynomial is close to the Bessel, namely,
It remains to compute the maximum of the Bessel equation in (34) for u in 0 to \(\log ^2(k)/k\). Now there exists an absolute constant \({\mathcal {C}}\) so that \(|J_1(x)^2| \le \tfrac{{\mathcal {C}}}{|x|}\) and there is also an \(\eta > 0\) so that \(\displaystyle \max _{0 \le x \le \tfrac{1}{\eta }} |J_1^2(x)| > \eta \). Moreover, the maximizer of \(J_1^2(x)\) exists. By picking R sufficiently large, we see that
This means that the maximum must occur for u between 0 and \(R/k^2\). Hence, by picking R sufficiently large, we have the following
Consequently, for sufficiently large k, the maximizer of
\(\square \)
Simulation Details
For Figs. 1 and 7, which show that the halting time concentrates, we perform \(\frac{2^{12}}{\sqrt{d}}\) training runs for each value of d. In our initial simulations we observed that the empirical standard deviation was decreasing as \(d^{-1/2}\) as the model grows. Because the larger models have a significant runtime, but very little variance in the halting time, we decided to scale the number of experiments based on this estimate of the variance.
As discussed in the text, the Student’s t-distribution can produce ill-conditioned matrices with large halting times. To make the numerical experiments feasible we limit the number of iterations to 1000 steps for the GD and Nesterov experiments and discard the very few runs that have not converged by this time (less than 0.1%).
For Fig. 3, which shows the convergence rates, we trained 8192 models for \(d = n = 4096\) steps both with (\({\widetilde{R}}^2 = 0.05\)) and without noise. The convergence rates were estimated by fitting a line to the second half of the log-log curve.
For each run we calculate the worst-case upper bound on \(\Vert \nabla f({{\varvec{x}}}_k)\Vert ^2\) at \(k = n\) using [62, Conjecture 3].
where \({{\varvec{x}}}^{\star }\) is the argmin of f calculated using the linear solver in JAX [11]. To visualize the difference between the worst-case and average-case rates, we draw a log-log histogram of the ratio,
1.1 Step Sizes
In this appendix section, we discuss our choices of step sizes for logistic regression and stochastic gradient descent (SGD).
1.1.1 Logistic Regression
For both gradient descent and Nesterov’s accelerated method (convex) on the least squares problem, we use the step size \(\frac{1}{L}\). The Lipschitz constant, L, is equal to the largest eigenvalue of \({{\varvec{H}}}\) which can be quickly approximated using a power iteration method.
For logistic regression the Hessian is equal to \({{\varvec{A}}}^T{{\varvec{D}}}{{\varvec{A}}}\), where \({{\varvec{D}}}\) is the Jacobian matrix of the sigmoid activation function. Hence, the Hessian’s eigenvalues are equal to those of \({{\varvec{H}}}\) scaled by the diagonal entries \({{\varvec{D}}}_{ii} = \sigma (({{\varvec{A}}}{{\varvec{x}}})_i)(1 - \sigma ({{\varvec{A}}}{{\varvec{x}}})_i))\). Since the maximum value of these entries is \(\frac{1}{4}\) we use a step size of \(\frac{4}{L}\) for our logistic regression experiments.
1.1.2 Stochastic Gradient Descent (SGD)
The least squares problem (10) can be reformulated as
where \({{\varvec{a}}}_i\) is the ith row of the matrix \({{\varvec{A}}}\). We perform a mini-batch SGD, i.e., at each iteration we select uniformly at random a subset of the samples \(b_k \subset \{1, \ldots , n\}\) and perform the update
With a slight abuse of notation, we denote by \(\nabla f_i({{\varvec{x}}}_k) = \frac{1}{|b_k|} \sum _{i \in b_k} \nabla f_i({{\varvec{x}}}_k)\) the update direction and use the shorthand \(b = |b_k|\) for the mini-batch size since it is fixed across iterations. The rest of this section is devoted to choosing the step size \(\alpha \) so that the halting time is consistent across dimensions n and d. Contrary to (full) gradient descent, the step size in SGD is dimension-dependent because a typical step size in SGD uses the variance in the gradients which grows as the dimension d increases.
Over-parametrized. If \(n \le d\) we call the model over-parametrized. In this case, the strong growth condition from [56] holds. This implies that training will converge when we use a fixed step size \(\frac{2}{LB^2}\) where B is defined as a constant verifying for all \({{\varvec{x}}}\)
To estimate B we will compute the expected values of \(\Vert \nabla f_i({{\varvec{x}}})\Vert ^2\) and \(\Vert \nabla f({{\varvec{x}}})\Vert ^2\). To simplify the derivation, we will assume that \({\widetilde{{{\varvec{x}}}}}\) and \({\varvec{\eta }}\) are normally distributed. At iterate \({{\varvec{x}}}\) we then have
Hence, the expected value of \(\Vert \nabla f({{\varvec{x}}})\Vert ^2\) is \(\Vert {{\varvec{H}}}{{\varvec{x}}}\Vert ^2 + \text {tr}\left( \frac{1}{d}{{\varvec{H}}}^2+\frac{{\widetilde{R}}^2}{n}{{\varvec{H}}}\right) \). Following [5, Equation 3.1.6 and Lemma 3.1] we know that for large values of n and d, the expected trace \(\frac{1}{d}\text {tr}{{\varvec{H}}}\approx 1\) and \(\frac{1}{d}\mathrm{tr }{{\varvec{H}}}^2 \approx 1 + r\). Further, \({{\mathbb {E}}}\,\left[ \Vert {{\varvec{H}}}{{\varvec{x}}}\Vert ^2\right] = (1 + r)\Vert {{\varvec{x}}}\Vert ^2\) and hence
We can approximate the same value for a mini-batch gradient, where
for batch size b and \(r' = \frac{d}{b}\). Note that \(\Vert {{\varvec{x}}}\Vert \approx 1\) because of the normalization of both the initial point and the solution, so for our experiments we set \(B^2 = \frac{2 + r'(2 + {\widetilde{R}}^2)}{2+r(2 + {\widetilde{R}}^2)}\).
Under-parametrized. In the under-parametrized case SGD will not converge but reach a stationary distribution around the optimum. Given a step size \({\overline{\alpha }} \le \frac{1}{LM_G}\) the expected square norm of the mini-batch gradients will converge to \({\overline{\alpha }}LM\) where M and \(M_G\) are constants such that \({{\mathbb {E}}}\,\left[ \Vert \nabla f_i({{\varvec{x}}})\Vert ^2\right] \le M + M_G\Vert \nabla f({{\varvec{x}}})\Vert ^2\) [10, Theorem 4.8, Equation 4.28]. We will use rough approximations of both M and \(M_G\). In fact, we will set \(M_G = B^2 = \frac{2 + 3r'}{2+3r}\).
To approximate M we will estimate the norm of the mini-batch gradients at the optimum for our least squares model. Set \({{\varvec{x}}}^* = {{\varvec{A}}}^+{{\varvec{b}}}= {{\varvec{A}}}^+{\varvec{\eta }}+ {\widetilde{{{\varvec{x}}}}}\) where \({{\varvec{A}}}^+\) is the Moore-Penrose pseudoinverse. We will write the row-sampled matrix \({\widetilde{{{\varvec{A}}}}}\) in mini-batch SGD \({\widetilde{{{\varvec{A}}}}} = {{\varvec{P}}}{{\varvec{A}}}\) where \({{\varvec{P}}}\) consists of exactly b rows of the identity matrix. Note that \({{\varvec{P}}}^T{{\varvec{P}}}\) is idempotent.
To simplify the derivation, we will again assume that \({\varvec{\eta }}\) is normally distributed and that \({\widetilde{R}} = 1\). Thus, we have
By taking the expectation of the squared norm, we derive the following
Now, we must find an approximation of \(\frac{1}{n}\mathrm {tr}\left( {\widetilde{{{\varvec{H}}}}}^2{{\varvec{H}}}^+\right) \). For \(b \approx n\) we have \({\widetilde{{{\varvec{H}}}}}^2{{\varvec{H}}}^+ \approx {{\varvec{H}}}\), whereas for \(b \approx 1\) we argue that \({\widetilde{{{\varvec{H}}}}}\) and \({{\varvec{H}}}^+\) can be seen as independent matrices with \({{\varvec{H}}}^+ \approx {{\varvec{I}}}\). We can linearly interpolate between these two extremes,
Experimentally these approximations work well. Hence, in our simulations we set \(M = {\widetilde{R}}^2(1 - r)(r' - r)\).
Rights and permissions
About this article
Cite this article
Paquette, C., van Merriënboer, B., Paquette, E. et al. Halting Time is Predictable for Large Models: A Universality Property and Average-Case Analysis. Found Comput Math (2022). https://doi.org/10.1007/s10208-022-09554-y
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10208-022-09554-y
Keywords
- Universality
- Random matrix theory
- Optimization
Mathematics Subject Classification
- 60B20
- 65K10
- 68T07
- 90C06
- 90C25