Abstract
We establish lower bounds on the complexity of finding \(\epsilon \)-stationary points of smooth, non-convex high-dimensional functions using first-order methods. We prove that deterministic first-order methods, even applied to arbitrarily smooth functions, cannot achieve convergence rates in \(\epsilon \) better than \(\epsilon ^{-8/5}\), which is within \(\epsilon ^{-1/15}\log \frac{1}{\epsilon }\) of the best known rate for such methods. Moreover, for functions with Lipschitz first and second derivatives, we prove that no deterministic first-order method can achieve convergence rates better than \(\epsilon ^{-12/7}\), while \(\epsilon ^{-2}\) is a lower bound for functions with only Lipschitz gradient. For convex functions with Lipschitz gradient, accelerated gradient descent achieves a better rate, showing that finding stationary points is easier given convexity.
Similar content being viewed by others
Notes
Given a bound \(\Vert {x^{(0)}-x^\star }\Vert \le D\) where , as is standard for convex optimization, the optimal rate is \(\widetilde{\Theta }(\sqrt{L_{1}D}\epsilon ^{-1/2})\) [24]. The two rates are not directly comparable.
References
Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., Ma, T.: Finding approximate local minima faster than gradient descent. In: Proceedings of the Forty-Ninth Annual ACM Symposium on the Theory of Computing (2017)
Allen-Zhu, Z.: Natasha 2: Faster non-convex optimization than SGD (2017). arXiv:1708.08694 [math.OC]
Allen-Zhu, Z., Hazan, E.: Variance reduction for faster non-convex optimization. In: Proceedings of the 33rd International Conference on Machine Learning (2016)
Arjevani, Y., Shamir, O., Shiff, R.: Oracle complexity of second-order methods for smooth convex optimization (2017). arXiv:1705.07260 [math.OC]
Birgin, E.G., Gardenghi, J.L., Martínez, J.M., Santos, S.A., Toint, P.L.: Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models. Math. Program. 163(1–2), 359–368 (2017)
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)
Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Convex until proven guilty: dimension-free acceleration of gradient descent on non-convex functions. In: Proceedings of the 34th International Conference on Machine Learning (2017)
Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Accelerated methods for non-convex optimization. SIAM J. Optim. 28(2), 1751–1772 (2018)
Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Lower bounds for finding stationary points I. Math. Program (to Appear) (2019). https://doi.org/10.1007/s10107-019-01406-y
Cartis, C., Gould, N.I., Toint, P.L.: On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization problems. SIAM J. Optim. 20(6), 2833–2852 (2010)
Cartis, C., Gould, N.I., Toint, P.L.: Complexity bounds for second-order optimality in unconstrained optimization. J. Complex. 28(1), 93–108 (2012)
Cartis, C., Gould, N.I.M., Toint, P.L.: How much patience do you have? A worst-case perspective on smooth nonconvex optimization. Optima 88 (2012)
Cartis, C., Gould, N.I.M., Toint, P.L.: Worst-case evaluation complexity and optimality of second-order methods for nonconvex smooth optimization (2017). arXiv:1709.07180 [math.OC]
Chowla, S., Herstein, I.N., Moore, W.K.: On recursions connected with symmetric groups I. Can. J. Math. 3, 328–334 (1951)
Chung, F.R.K.: Spectral Graph Theory. AMS (1998)
Hinder, O.: Cutting plane methods can be extended into nonconvex optimization. In: Proceedings of the Thirty First Annual Conference on Computational Learning Theory (2018)
Jarre, F.: On Nesterov’s smooth Chebyshev–Rosenbrock function. Optim. Methods Softw. 28(3), 478–500 (2013)
Jin, C., Ge, R., Netrapalli, P., Kakade, S.M., Jordan, M. I.: How to escape saddle points efficiently. In: Proceedings of the 34th International Conference on Machine Learning (2017)
Lei, L., Ju, C., Chen, J., Jordan, M. I.: Nonconvex finite-sum optimization via SCSG methods. In: Advances in Neural Information Processing Systems, vol. 31 (2017)
Monteiro, R.D., Svaiter, B.F.: An accelerated hybrid proximal extragradient method for convex optimization and its implications to second-order methods. SIAM J. Optim. 23(2), 1092–1125 (2013)
Nemirovski, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Wiley, Hoboken (1983)
Nesterov, Y.: Introductory Lectures on Convex Optimization. Kluwer Academic Publishers, Dordrecht (2004)
Nesterov, Y.: How to make the gradients small. Optima 88 (2012)
Nesterov, Y., Polyak, B.: Cubic regularization of Newton method and its global performance. Math. Program. Ser. A 108, 177–205 (2006)
Reddi, S.J., Hefny, A., Sra, S., Poczos, B., Smola, A.: Stochastic variance reduction for nonconvex optimization. In: Proceedings of the 33rd International Conference on Machine Learning (2016)
Simchowitz, M.: On the randomized complexity of minimizing a convex quadratic function (2018). arXiv:1807.09386 [cs.LG]
Simchowitz, M., Aloui, A.E., Recht, B.: Tight query complexity lower bounds for PCA via finite sample deformed Wigner law. In: Proceedings of the Fiftieth Annual ACM Symposium on the Theory of Computing (2018)
Vavasis, S.A.: Black-box complexity of local minimization. SIAM J. Optim. 3(1), 60–80 (1993)
Woodworth, B.E., Srebro, N.: Tight complexity bounds for optimizing composite objectives. Adv. Neural Inf. Process. Syst. 30, 3639–3647 (2016)
Zhang, X., Ling, C., Qi, L.: The best rank-1 approximation of a symmetric tensor and related spherical optimization problems. SIAM J. Matrix Anal. Appl. 33(3), 806–821 (2012)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
OH was supported by the PACCAR INC fellowship. YC and JCD were partially supported by the SAIL-Toyota Center for AI Research, NSF-CAREER Award 1553086, and a Sloan Foundation Fellowship in Mathematics. YC was partially supported by the Stanford Graduate Fellowship and the Numerical Technologies Fellowship. AS was supported by the National Science Foundation (CCF-1844855).
Appendices
Additional results for convex functions
1.1 An upper bound for finding stationary points of value-bounded functions
Here we give a first-order method that finds \(\epsilon \)-stationary points of a function \(f\in \mathcal {K}_1\left( \Delta , L_{1}\right) \) in \(O(\sqrt{L_{1}\Delta }\epsilon ^{-1}\log \frac{L_{1}\Delta }{\epsilon ^2})\) iterations. The method consists of Nesterov’s accelerated gradient descent (AGD) applied on the sum of f and a standard quadratic regularizer.
Our starting point is AGD for strongly convex functions; a function f is \(\sigma \)-strongly convex if
for every x, y in the domain of f. Let \(\mathsf {AGD}_{\sigma , L_{1}}\in \mathcal {A}^{(1)}_{ \textsf {zr} }\cap \mathcal {A}^{(1)}_{ \textsf {det} }\) be the accelerated gradient scheme developed in [23, §2.2.1] for \(\sigma \)-strongly convex functions with \(L_{1}\)-Lipschitz gradient, initialized at \(x^{(1)}=0\) (the exact step size scheme is not important). For any \(L_{1}\)-smooth f with global minimizer \(x^\star _f\), \(\epsilon ^2/(2L_{1})\)-suboptimality guarantees \(\epsilon \)-stationarity, since \(\left\| {\nabla f(x)}\right\| ^2 \le 2L_{1}(f(x)-f(x^\star _f))\) [6, Eq. (9.14)]. Therefore, adapting [23, Thm. 2.2.2] to our notation gives
with \(\log _+(x) :=\max \{0, \log x\}\).
Now suppose that f is convex with \(L_{1}\)-Lipschitz gradient but not necessarily strongly-convex. We can add strong convexity to f by means of a proximal term; for any \(\sigma >0\), the function
is \(\sigma \)-strongly-convex with \((L_{1}+\sigma )\)-Lipschitz gradient. With this in mind, we define a proximal version of AGD as follows,
Proposition 2
Let \(\Delta ,L_{1}\) and \(\epsilon \) be positive, and let \(\sigma = \frac{\epsilon ^2}{3\Delta }\). Then, algorithm \(\mathsf {PAGD}_{\sigma , L_{1}} \in \mathcal {A}^{(1)}_{ \textsf {det} }\) satisfies
Proof
For any \(f\in \mathcal {K}_1\left( \Delta , L_{1}\right) \), recall that \(f_\sigma (x) :=f(x) +\frac{\sigma }{2}\left\| {x}\right\| ^2\) and let \( \{x^{(t)}\}_{t\in \mathbb {N}} = \mathsf {PAGD}_{\sigma , L_{1}}[f] = \mathsf {AGD}_{\sigma , L_{1}+\sigma }[f_\sigma ] \) be the sequence of iterates \(\mathsf {PAGD}_{\sigma , L_{1}}\) produces on f. Then by guarantee (20), we have
for some T such that
For any point y such that \(f_\sigma (y) = f(y) + \frac{\sigma }{2} \left\| {y}\right\| ^2 \le f_\sigma (0)=f(0)\), we have
Clearly, \(f_\sigma (x_{f_\sigma }^\star ) \le f_\sigma (0)\) and [23, Thm. 2.2.2] also guarantees that \(f_\sigma (x^{(T)}) \le f_\sigma (0)\). Consequently,
and so
In inequality (i) we substituted bounds (21) and (23), and in (ii) we used \(\sigma = \epsilon ^2/(3\Delta )\). We conclude that \(\mathsf {T}_{\epsilon }\big (\mathsf {PAGD}_{\sigma , L_{1}}, f\big ) \le T\), and substituting (23) and the definition of \(\sigma \) into (22) we have
Without loss of generality, we may assume \(\frac{2 L_{1}\Delta }{\epsilon ^2} \ge 1\), as otherwise \(\mathsf {T}_{\epsilon }\big (\mathsf {PAGD}_{\sigma , L_{1}}, f\big ) = 1\). We thus simplify the expression slightly to obtain the proposition. \(\square \)
1.2 The impossibility of approximate optimality without a bounded domain
Lemma 6
Let \(L_{1},\Delta >0\) and \(\epsilon < \Delta \). For any first-order algorithm \(\mathsf {A}\in \mathcal {A}^{(1)}_{ \textsf {det} }\cup \mathcal {A}^{(1)}_{ \textsf {zr} }\) and any \(T\in \mathbb {N}\), there exists a function \(f\in \mathcal {Q}\left( \Delta , L_{1}\right) \) such that the iterates \(\{x^{(t)}\}_{t \in \mathbb {N}} = \mathsf {A}[f]\) satisfy
Proof
By Proposition 1 it suffices to consider \(\mathsf {A}\in \mathcal {A}^{(1)}_{ \textsf {zr} }\) (see additional discussion of the generality of Proposition 1 in Section Pi.3.3). Consider the function \(f:\mathbb {R}^T\rightarrow \mathbb {R}\),
where \(0<\beta < 1\), and we take
Since f(x) is of the form \(\lambda \left\| {A x-b}\right\| ^2\) where \(\left\| {A}\right\| _\mathrm{op}\le 1+\beta \), we have \(\left\| {\nabla ^{{2}}f(x)}\right\| _\mathrm{op}\le 2\lambda \left\| {A}\right\| _\mathrm{op}^2\) for every \(x\in \mathbb {R}^T\) and therefore f has \(2 \lambda (1 + 2 \beta + \beta ^2)\)-Lipschitz gradient. Additionally, f satisfies \(\inf _x f(x) = 0\) and \(f(0) = \lambda \sigma ^2\), ans so the above choices of \(\lambda \) and \(\sigma \) guarantee that \(f\in \mathcal {Q}\left( \Delta , L_{1}\right) \). Moreover, f is a a first-order zero-chain (Definition 3), and thus for any \(\mathsf {A}\in \mathcal {A}^{(1)}_{ \textsf {zr} }\) and \(\{x^{(t)}\}_{t \in \mathbb {N}} = \mathsf {A}[f]\), we have \(x^{(t)}_{T} = 0\) for \(t\le T\) (Observation 1). Therefore, it suffices to show that \(f(x) > \inf _y f(y) + \epsilon \) whenever \(x_T =0\).
We make the following inductive claim: if \(f(x) \le \inf _y f(y) + \epsilon = \epsilon \), then
for all \(i\le T\). Indeed, each term in the sum (24) defining f is non-negative, so for the base case of the induction \(i=1\), we have \(\lambda (\sigma - \beta x_1)^2 \le \epsilon \), or \(\left| x_1 - \sigma \beta ^{-1}\right| \le \beta ^{-1} \sqrt{\epsilon / \lambda }\). For \(i<T\), assuming that \(x_i\) satisfies the bound (25), we have that \(\lambda (x_i - \beta x_{i + 1})^2 \le \epsilon \), which implies
which is the desired claim (25) for \(x_{i+1}\).
The bound (25) implies \(x_i\ne 0\) for all \(i\le T\) whenever \(\sigma \ge (1-\beta )^{-1}\sqrt{\epsilon /\lambda }\). Therefore, we choose \(\beta \) to satisfy \(\sigma = (1-\beta )^{-1}\sqrt{\epsilon /\lambda }\), that is
for which \(0<\beta <1\) since we assume \(\epsilon <\Delta \). Thus, we guarantee that when \(x_T=0\) we must have \(f(x) > \inf _y f(y) + \epsilon \), giving the result. \(\square \)
Technical results
1.1 Proof of Lemma 2
Lemma 2
The function \(\Upsilon _r\) satisfies the following.
-
i.
We have \(\Upsilon _r'(0) = \Upsilon _r'(1) = 0\).
-
ii.
For all \(x \le 1\), \(\Upsilon _r'(x) \le 0\), and for all \(x \ge 1\), \(\Upsilon _r'(x) \ge 0\).
-
iii.
For all \(x \in \mathbb {R}\) we have \(\Upsilon _r(x) \ge \Upsilon _r(1) = 0\), and for all r, \(\Upsilon _r(0) \le 10\).
-
iv.
For every \(r\ge 1\), \(\Upsilon _r'(x) < -1\) for every \(x\in (-\infty ,-0.1] \cup [0.1,0.9]\).
-
v.
For every \(r \ge 1\) and every \(p \ge 1\), the p-th order derivatives of \(\Upsilon _r\) are \(r^{3-p}\ell _{p}\)-Lipschitz continuous, where \(\ell _{p} \le \exp (\frac{3}{2}p \log p + c p)\) for a numerical constant \(c < \infty \).
Proof
Parts i and ii are evident from inspection, as
To see the part iii, note that \(\Upsilon _r\) is non-increasing for every \(x<1\) and non-decreasing for every \(x>1\) and therefore \(x=1\) is its global minimum. That \(\Upsilon _r(1)=0\) is immediate from its definition, and, for every r, \(\Upsilon _r(0) = 120\int _0^1 \frac{t^2(1-t)}{1+(t/r)^2}dt \le 120\int _0^1 t^2(1-t)dt = 10\). To see part iv, note that \(|\Upsilon _r'(x)| \ge |\Upsilon _1'(x)|\) for every \(r \ge 1\), and a calculation shows \(|\Upsilon _1'(x)| > 1\) for \(x\in \left( {-\infty }, {-0.1}\right] \cup [0.1,0.9]\) (see Fig. 1).
To see the fifth part of the claim, note that
where the functions \(\varphi _1\) and \(\varphi _2\) are \(\varphi _1(\xi ) = \xi /(1+\xi ^2)\) and \(\varphi _2(\xi ) = 1/(1+\xi ^2)\). We thus bound the derivatives of \(\varphi _1\) and \(\varphi _2\). We begin with \(\varphi _2\), which we can write as the composition \(\varphi _2(x) = (h \circ g)(x)\) where \(h(x) = \frac{1}{x}\) and \(g(x) = 1 + x^2\). Let \(\mathcal {P}_{k,2}\) denote the collection of all partitions of \(\{1, \ldots , k\}\) where each element of the partition has at most 2 indices. That is, if \(P \in \mathcal {P}_{k,2}\), then \(P = (S_1, \ldots , S_l)\) for some \(l \le k\), the \(S_i\) are disjoint, \(1 \le |S_i| \le 2\), and \(\cup _i S_i = [k]\). The cardinality \(|\mathcal {P}_{k,2}|\) is the number of matchings in the complete graph on k vertices, or the kth telephone number, which has bound [15, Lemma 2]
We may then apply Faà di Bruno’s formula for the chain rule to obtain
where \(\mathsf {C}_i(P)\) denotes the number of sets in P with precisely i elements. Of course, we have \(|x|^{\mathsf {C}_1(P)} / (1 + x^2)^{|P|} \le 1\), and thus
The proof of the upper bound on \(\varphi _1^{(k)}(x)\) is similar (\(2\varphi _1(x)=\frac{d}{dx}[(\hat{h}\circ g)(x)]\) with \(\hat{h}(x) = \log x\) and g as defined above), so for every \(r \ge 1\) and \(p \ge 1\), the \(p+1\)-th derivative of \(\Upsilon _r\) has the bound
where \(c < \infty \) is a numerical constant. \(\square \)
1.2 Proof of Lemma 3
Lemma 3
Let \(r\ge 1\) and \(\mu \le 1\). For any \(x\in \mathbb {R}^{T+1}\) such that \(x_T = x_{T+1} = 0\),
Proof
Throughout the proof, we fix \(x\in \mathbb {R}^{T+1}\) such that \(x_{T}=x_{T+1}=0\); for convenience in notation, we define \(x_{0} :=1\). Our strategy is to carefully pick two indices \(i_1\in \{0, \ldots , T-1\}\) and \(i_2\in \{i_1+1, \ldots , T\}\), such that \(\left\| {\nabla \bar{f}_{T, \mu , r}(x)}\right\| ^2 \ge \sum _{i=i_1+1}^{i_2} \left| \nabla _i \bar{f}_{T, \mu , r}(x) \right| ^2 > (\mu ^{3/4}/4)^2\). We call the set of indices from \(i_1+1\) to \(i_2\) the transition region, and construct it as follows.
so that \(x_{j}\le 0.9\) for every \(j>i\). Note that \(i_1=0\) when \(x_i \le 0.9\) for every \(i\in [T+1]\). This is a somewhat special case due to the coefficient \(\sqrt{\mu }\le 1\) of the first “link” in the quadratic chain term in (9). To handle it cleanly we define
Continuing with construction of the transition region, we make the following definition.
and let \(m'=i_2'-i_1\), so \(m' \ge 1\). Roughly, our transition region consists of the \(m'\) indices \(i_1+1,\ldots ,i_2'\), but for technical reasons we attach to it the following decreasing ‘tail’.
With these definitions, \(i_2\) is well-defined and \(0 \le i_1< i_2\le T\), since \(x_{T+1}-x_{T}=0\). We denote the transition region and associated length by
We illustrate our definition of the transition region in Fig. 3.
Let us describe the transition region. In the “head” of the region, we have \(0.1\le x_{i}\le 0.9\) for every \(i\in \left\{ i_1+1,\ldots ,i_2'-1\right\} \); a total of \(m'-1\) indices. The “tail” of the transition region is strictly decreasing, \(x_{i_2}<x_{i_2-1}<\cdots <x_{i_2'}\). Moreover, for any \(j \in \{i_2'+1, \ldots i_2-1\}\) such that \(x_j > -0.1\), the decrease is rapid; \(x_j < x_{j - 1} - 0.2 / (m'-1+1/\alpha )\). This descriptions leads us to the following technical properties.
Lemma 7
Let the transition region \(\mathcal {I}_\mathrm {trans}\) be defined as above (26). Then
-
i.
\(x_{i_1}> 0.9> 0.1 > x_{i_2}\) and \(-x_{i_2} + \left( m-1+\alpha ^{-1}\right) \left( x_{i_2+1}-x_{i_2}\right) > -0.3\).
-
ii.
\(\Upsilon _r'\left( x_{i}\right) \le 0\) for every \(i\in \mathcal {I}_\mathrm {trans}\), and \(\Upsilon _r'\left( x_{i}\right) <-1\) for at least \(\left( m-\alpha ^{-1}\right) /2\) indices in \(\mathcal {I}_\mathrm {trans}\).
We defer the proof of the lemma to the end of this section, continuing the proof assuming it.
We now lower bound \(\Vert {\nabla \bar{f}_{T, \mu , r}(x)}\Vert \). For notational convenience, define \(g_{i}=\mu \Upsilon _r'\left( x_{i}\right) \), and recalling that \(x_{T}=x_{T+1}=0\), we see that the norm of the gradient of \(\bar{f}_{T, \mu , r}\) is
where we made use of the notation \(\alpha :=1\) if \(i_1> 0\) and \(\alpha :=\sqrt{\mu }\) if \(i_1=0\). We obtain a lower bound for the final sum of m squares (27) by fixing \(x_{i_1}\), \(x_{i_2}\), and \(g_{i_1+1}, \ldots , g_{i_2}\), then minimizing the quadratic form explicitly over the \(m - 1\) variables \(x_{i_1+1},\ldots ,x_{i_2-1}\). We obtain
where the matrix A and vector b have definitions
and \(z \in \mathbb {R}^{m}\) is a unit-norm solution to \(A^{\top }z=0\). The vector \(z \in \mathbb {R}^m\) with
is such a solution. Thus
We now bring to bear the properties of the transition region Lemma 7 supplies. By Lemma 7.i,
and by Lemma 7.ii, using \(1 \le \alpha ^{-1} \le 1/\sqrt{\mu }\),
Substituting \(\sum _{i=1}^{m}\left( i-1+\alpha ^{-1}\right) ^{2} \le \frac{1}{2}m\left( m+1/\sqrt{\mu }\right) \left( m+2/\sqrt{\mu }\right) \) and the bounds (29) and (30) into the gradient lower bound (28), we have that
A quick computation reveals that \(\inf _{t> 0}\zeta (t) \approx 0.28 > 1/4\), which gives the result. \(\square \)
Proof of lemma 7
We have by definition that \(x_{i_1} > 0.9\) and \(x_{i_2} \le x_{i_2'} < 0.1\). To see that
holds, consider the two cases that \(x_{i_2} \le -0.1\) or \(x_{i_2} > -0.1\). In the first case that \(x_{i_2}\le -0.1\), by definition \(x_{i_2+1}\ge x_{i_2}\) so \(-x_{i_2} + \left( m-1+\alpha ^{-1}\right) \left( x_{i_2+1}-x_{i_2}\right)>0.1>-0.3\). The second case that \(x_{i_2}>-0.1\) is a bit more subtle. By definition of the sequence \(x_{i_2}, \ldots , x_{i_2'}\), we have
Combining this bound on \(x_{i_2}\) and the inequality \(x_{i_2+ 1} \ge x_{i_2} - \frac{0.2}{m'-1+1/\alpha }\) due to the construction of \(i_2\), we obtain
We note for the proof of property ii that the chain of inequalities (31) is possible only for \(m \le 2m'-1+1/\alpha \), which implies there are at most \(m'-1+1/\alpha \) indices \(i \in \mathcal {I}_\mathrm {trans}\) such that \(|x_i| < 0.1\).
The first part of property ii follows from Lemma 2.ii, since \(x_{i}\le 0.9 \le 1\) for every \(i\in \mathcal {I}_\mathrm {trans}\). To see that the second part of the property holds, let N be the number of indices in \(i\in \mathcal {I}_\mathrm {trans}\) for which \(\Upsilon _r'\left( x_{i}\right) <-1\). By Lemma 2.iv and the fact that \(0.1\le x_{i}\le 0.9\) for every \(i\in \left\{ i_1+1,\ldots ,i_2'-1\right\} \), \(N\ge m'-1\). Moreover, since there can be at most \(m'-1+1/\alpha \) indices \(i\in \mathcal {I}_\mathrm {trans}\) for which \(\left| x_{i}\right| <0.1\), \(N\ge m-(m'-1+1/\alpha )\). Averaging the two lower bounds gives \(N\ge \left( m-1/\alpha \right) /2\). \(\square \)
1.3 Proof of Theorem 3
Theorem 3
There exists a numerical constant \(c < \infty \) such that the following lower bound holds. Let \(p\ge 2\), \(p \in \mathbb {N}\), and let \(D, L_{1}, L_{2}, \ldots , L_{p}\), and \(\epsilon \) be positive. Then
where \(\tilde{\ell }_{q} \le \exp (c q \log q + c)\).
Proof
The proof builds off of those of Theorems 2 and Pi.3. We begin by recalling the following bump function construction
Adding a scaled version of \(-\bar{h}_T\) to our hard instance construction allows us to “plant” a global minimum that is both close to the origin and essentially invisible to zero-respecting method. For convenience, we restate Lemma Pi.10,
Lemma 8
The function \(\bar{h}_T\) satisfies the following.
-
i.
For all \(x \in \mathbb {R}^T\) we have \(\bar{h}_T(x) \in [0, 1]\), and \(\bar{h}_T(0.8 e^{(T)}) = 1\).
-
ii.
On the set \(\{x\in \mathbb {R}^d \mid x_{T} \le \frac{3}{5} \} \cup \{x \mid \left\| {x}\right\| \ge 1\}\), we have \(\bar{h}_T(x)=0\).
-
iii.
For every \(p \ge 1\), the \(p\hbox {th}\) order derivative of \(\bar{h}_T\) is \(\tilde{\ell }_{p}\)-Lipschitz continuous, where \(\tilde{\ell }_{p} \le e^{c p\log p + c}\) for a numerical constant \(c < \infty \).
With this lemma in place, we follow the broad outline of the proof of Theorem 2, with modifications to make sure the norm of the minimizers of f is small. Indeed, letting \(\lambda , \sigma > 0\), we define our scaled hard instance \(f:\mathbb {R}^{T+ 2}\rightarrow \mathbb {R}\) by
that is, the hard instance we construct in Theorem 2 minus a scaled bump function (32). For every \(p \in \mathbb {N}\), we set the parameters \(\lambda , \sigma , \mu \) and r as in the proof of Theorem 2, so that we satisfy inequality (12) except we replace \(L_{q}\) with \(L_{q}/2\) for every \(q\in [p]\) (including in the definitions of \(\lambda , \sigma , \mu \)). Thus, as in inequality (12), for each \(q \in \mathbb {N}\) the function \(f_0(x) :=\lambda \sigma ^2\bar{f}_{T, \mu , r}\left( x/\sigma \right) \) has \(L_{q}/2\)-Lipschitz qth order derivative and satisfies \(\left\| {\nabla f_0(x)}\right\| >\epsilon \) for all \(x\in \mathbb {R}^{T+1}\) with \(x_{T} = x_{T+1} = 0\). By Lemma 8.iii, setting
guarantees that the function \(x \mapsto - {\tilde{\lambda }}\cdot \bar{h}_{T+2}(x/D)\) also has \(L_{q}/2\)-Lipschitz qth order derivatives, so that overall, for each \(q \in [p]\) the function f defined in Eq. (33) has \(L_{q}\)-Lipschitz qth order derivative.
We note that by Lemma 8.ii, \(\bar{h}_{T+2}(x)\) is identically 0 at a neighborhood of any x with \(x_{T+2}=0\), which immediate implies that \(\bar{h}_{T+2}\) and f are zero-chains. Therefore for any \(\mathsf {A}\in \mathcal {A}^{(1)}_{ \textsf {zr} }\) producing iterates \(x^{(1)}=0,x^{(2)},x^{(3)},\ldots \) when operating on f, we have \(x^{(t)}_T= x^{(t)}_{T+1}=x^{(t)}_{T+2}=0\) for any \(t\le T\). Thus, by our choices of \(\lambda ,\sigma ,\mu \) and r, \(\left\| {\nabla f(x^{(t)})}\right\| = \left\| {\nabla f_0(x^{(t)})}\right\| > \epsilon \) for every \(t\le T\), and so
To establish that \(f\in \mathcal {F}^\mathrm{dist}_{1: p}(D, L_{1}, ..., L_{p})\), it remains to show that every global minimizer of f has norm at most D. Let \(x^\star \) denote a global minimizer of f, and temporarily assume that
Therefore, \(f(x^\star )< f\left( 0.8D\cdot e^{(T+2)}\right) <0\) and \(\bar{h}_{T+2}(x^\star /D) \ne 0\), as otherwise we have the contradiction \(f(x^\star ) = \lambda \sigma ^2 \bar{f}_{T, \mu , r}(x^\star /\sigma ) \ge 0\). By the definition (32), \(\bar{h}_{T+2}(x^\star /D) \ne 0\) implies that \(1-\frac{25}{2}\left\| {x^\star /D - 0.8e^{(T+2)}}\right\| ^2\ge 0.5\), and therefore \(\left\| {x^\star }\right\| \le D\). To verify the assumed inequality (35), we use Lemma 8.i to obtain
Therefore, if we set
then inequality (35) holds and \(\left\| {x^\star }\right\| \le D\), and so \(f\in \mathcal {F}^\mathrm{dist}_{1: p}(D, L_{1}, ..., L_{p})\). Comparing the setting (36) of T above to the setting (13) of T in the proof of Theorem 2, we see they are identical except that we replace the term \(\Delta \) in (13) with \({\tilde{\lambda }} :=\min _{q\in [p]} ({2\tilde{\ell }_{q})^{-1}}L_{q}D^{q+1}\). Thus, mimicking the proof of Theorem 2 after the step (13), mutatis mutandis, yields the result. \(\square \)
1.4 Proof of Lemma 5
Lemma 5
Let \(T\in \mathbb {N}\), \(0 < \alpha \le 1\), \(\mu \in [T^{-2}, 1]\) and \(\widetilde{f}_{T, \alpha , \mu }\) be defined as in (19), with \(\Lambda \) and \(\widetilde{\Upsilon }\) satisfying
for \(G>0\) independent of \(T, \alpha \) and \(\mu \). Then there exists \(x \in \mathbb {R}^{T+1}\) such that \(x_T= x_{T+1} = 0\) and
where \(C \le 27 + \sqrt{3}G\).
Proof
We construct x as follows. We let \(x_1 = 1\), and for \(n>1\) let (with \(x_0 :=1\)),
where we take
for some \(m\in \mathbb {N}\) which we will later determine. The elements of \(\nabla \widetilde{f}_{T, \alpha , \mu }\) are given by
where for \(n=1\) we used \(x_1=1\) and \(\Lambda '(0)=0\) to write \(\alpha \cdot \Lambda '(x_1-1) = 0 = \Lambda '(x_1-1)\). Since \(\Lambda '\) is 1-Lipschitz, we have
Moreover, one can readily verify that \(x_n\in [0,1]\) for every n and that \(x_n = 0\) for every \(n > 2m+1\). Therefore, using using \(\widetilde{\Upsilon }'(0) = 0\) and \(\max _{z\in [0,1]} | \widetilde{\Upsilon }'(z) | \le G\) we have that \(\left| \widetilde{\Upsilon }'(x_n)\right| \le G\cdot 1_{\left( {n \le 2m+1}\right) }\), which gives the overall bound
and thus,
Taking \(m = \left\lceil \frac{1}{3\sqrt{\mu }} \right\rceil \), we have
where we have used \(\left\lceil 1/(3\sqrt{\mu }) \right\rceil \le 1/\sqrt{\mu }\) since \(\mu \le 1\). Thus, \( \left\| {\nabla \widetilde{f}_{T, \alpha , \mu }(x)}\right\| \le C\mu ^{3/4}\) holds for \(C=27 + \sqrt{3}G\). For \(T \ge 8\), since \(\mu \ge T^{-2}\), we have \(2m+1 \le 2\left\lceil T/3 \right\rceil +1 < T\) and therefore \(x_T= x_{T+1}=0\) holds as required (since \(x_n = 0\) for every \(n > 2m+1\)). In the edge case \(T\le 8\) we have \(\mu \ge T^{-2} \ge 1/64\) and therefore \(x=0\) yields \(\left\| {\nabla \widetilde{f}_{T, \alpha , \mu }(x)}\right\| = \alpha \le 1 \le 27\cdot (1/64)^{3/4} \le C\mu ^{3/4}\). \(\square \)
Rights and permissions
About this article
Cite this article
Carmon, Y., Duchi, J.C., Hinder, O. et al. Lower bounds for finding stationary points II: first-order methods. Math. Program. 185, 315–355 (2021). https://doi.org/10.1007/s10107-019-01431-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-019-01431-x
Keywords
- Non-convex optimization
- Information-based complexity
- Dimension-free rates
- Gradient methods
- Accelerated gradient descent