Skip to main content
Log in

Lower bounds for finding stationary points II: first-order methods

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

We establish lower bounds on the complexity of finding \(\epsilon \)-stationary points of smooth, non-convex high-dimensional functions using first-order methods. We prove that deterministic first-order methods, even applied to arbitrarily smooth functions, cannot achieve convergence rates in \(\epsilon \) better than \(\epsilon ^{-8/5}\), which is within \(\epsilon ^{-1/15}\log \frac{1}{\epsilon }\) of the best known rate for such methods. Moreover, for functions with Lipschitz first and second derivatives, we prove that no deterministic first-order method can achieve convergence rates better than \(\epsilon ^{-12/7}\), while \(\epsilon ^{-2}\) is a lower bound for functions with only Lipschitz gradient. For convex functions with Lipschitz gradient, accelerated gradient descent achieves a better rate, showing that finding stationary points is easier given convexity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. Given a bound \(\Vert {x^{(0)}-x^\star }\Vert \le D\) where , as is standard for convex optimization, the optimal rate is \(\widetilde{\Theta }(\sqrt{L_{1}D}\epsilon ^{-1/2})\) [24]. The two rates are not directly comparable.

References

  1. Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., Ma, T.: Finding approximate local minima faster than gradient descent. In: Proceedings of the Forty-Ninth Annual ACM Symposium on the Theory of Computing (2017)

  2. Allen-Zhu, Z.: Natasha 2: Faster non-convex optimization than SGD (2017). arXiv:1708.08694 [math.OC]

  3. Allen-Zhu, Z., Hazan, E.: Variance reduction for faster non-convex optimization. In: Proceedings of the 33rd International Conference on Machine Learning (2016)

  4. Arjevani, Y., Shamir, O., Shiff, R.: Oracle complexity of second-order methods for smooth convex optimization (2017). arXiv:1705.07260 [math.OC]

  5. Birgin, E.G., Gardenghi, J.L., Martínez, J.M., Santos, S.A., Toint, P.L.: Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models. Math. Program. 163(1–2), 359–368 (2017)

    Article  MathSciNet  Google Scholar 

  6. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)

    Book  Google Scholar 

  7. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)

    Article  Google Scholar 

  8. Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Convex until proven guilty: dimension-free acceleration of gradient descent on non-convex functions. In: Proceedings of the 34th International Conference on Machine Learning (2017)

  9. Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Accelerated methods for non-convex optimization. SIAM J. Optim. 28(2), 1751–1772 (2018)

    Article  MathSciNet  Google Scholar 

  10. Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Lower bounds for finding stationary points I. Math. Program (to Appear) (2019). https://doi.org/10.1007/s10107-019-01406-y

  11. Cartis, C., Gould, N.I., Toint, P.L.: On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization problems. SIAM J. Optim. 20(6), 2833–2852 (2010)

    Article  MathSciNet  Google Scholar 

  12. Cartis, C., Gould, N.I., Toint, P.L.: Complexity bounds for second-order optimality in unconstrained optimization. J. Complex. 28(1), 93–108 (2012)

    Article  MathSciNet  Google Scholar 

  13. Cartis, C., Gould, N.I.M., Toint, P.L.: How much patience do you have? A worst-case perspective on smooth nonconvex optimization. Optima 88 (2012)

  14. Cartis, C., Gould, N.I.M., Toint, P.L.: Worst-case evaluation complexity and optimality of second-order methods for nonconvex smooth optimization (2017). arXiv:1709.07180 [math.OC]

  15. Chowla, S., Herstein, I.N., Moore, W.K.: On recursions connected with symmetric groups I. Can. J. Math. 3, 328–334 (1951)

    Article  MathSciNet  Google Scholar 

  16. Chung, F.R.K.: Spectral Graph Theory. AMS (1998)

  17. Hinder, O.: Cutting plane methods can be extended into nonconvex optimization. In: Proceedings of the Thirty First Annual Conference on Computational Learning Theory (2018)

  18. Jarre, F.: On Nesterov’s smooth Chebyshev–Rosenbrock function. Optim. Methods Softw. 28(3), 478–500 (2013)

    Article  MathSciNet  Google Scholar 

  19. Jin, C., Ge, R., Netrapalli, P., Kakade, S.M., Jordan, M. I.: How to escape saddle points efficiently. In: Proceedings of the 34th International Conference on Machine Learning (2017)

  20. Lei, L., Ju, C., Chen, J., Jordan, M. I.: Nonconvex finite-sum optimization via SCSG methods. In: Advances in Neural Information Processing Systems, vol. 31 (2017)

  21. Monteiro, R.D., Svaiter, B.F.: An accelerated hybrid proximal extragradient method for convex optimization and its implications to second-order methods. SIAM J. Optim. 23(2), 1092–1125 (2013)

    Article  MathSciNet  Google Scholar 

  22. Nemirovski, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Wiley, Hoboken (1983)

    Google Scholar 

  23. Nesterov, Y.: Introductory Lectures on Convex Optimization. Kluwer Academic Publishers, Dordrecht (2004)

    Book  Google Scholar 

  24. Nesterov, Y.: How to make the gradients small. Optima 88 (2012)

  25. Nesterov, Y., Polyak, B.: Cubic regularization of Newton method and its global performance. Math. Program. Ser. A 108, 177–205 (2006)

    Article  MathSciNet  Google Scholar 

  26. Reddi, S.J., Hefny, A., Sra, S., Poczos, B., Smola, A.: Stochastic variance reduction for nonconvex optimization. In: Proceedings of the 33rd International Conference on Machine Learning (2016)

  27. Simchowitz, M.: On the randomized complexity of minimizing a convex quadratic function (2018). arXiv:1807.09386 [cs.LG]

  28. Simchowitz, M., Aloui, A.E., Recht, B.: Tight query complexity lower bounds for PCA via finite sample deformed Wigner law. In: Proceedings of the Fiftieth Annual ACM Symposium on the Theory of Computing (2018)

  29. Vavasis, S.A.: Black-box complexity of local minimization. SIAM J. Optim. 3(1), 60–80 (1993)

    Article  MathSciNet  Google Scholar 

  30. Woodworth, B.E., Srebro, N.: Tight complexity bounds for optimizing composite objectives. Adv. Neural Inf. Process. Syst. 30, 3639–3647 (2016)

    Google Scholar 

  31. Zhang, X., Ling, C., Qi, L.: The best rank-1 approximation of a symmetric tensor and related spherical optimization problems. SIAM J. Matrix Anal. Appl. 33(3), 806–821 (2012)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yair Carmon.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

OH was supported by the PACCAR INC fellowship. YC and JCD were partially supported by the SAIL-Toyota Center for AI Research, NSF-CAREER Award 1553086, and a Sloan Foundation Fellowship in Mathematics. YC was partially supported by the Stanford Graduate Fellowship and the Numerical Technologies Fellowship. AS was supported by the National Science Foundation (CCF-1844855).

Appendices

Additional results for convex functions

1.1 An upper bound for finding stationary points of value-bounded functions

Here we give a first-order method that finds \(\epsilon \)-stationary points of a function \(f\in \mathcal {K}_1\left( \Delta , L_{1}\right) \) in \(O(\sqrt{L_{1}\Delta }\epsilon ^{-1}\log \frac{L_{1}\Delta }{\epsilon ^2})\) iterations. The method consists of Nesterov’s accelerated gradient descent (AGD) applied on the sum of f and a standard quadratic regularizer.

Our starting point is AGD for strongly convex functions; a function f is \(\sigma \)-strongly convex if

$$\begin{aligned} f(y) \ge f(x) + \langle \nabla f(x), y-x\rangle + \frac{\sigma }{2}\left\| {y-x}\right\| ^2, \end{aligned}$$

for every xy in the domain of f. Let \(\mathsf {AGD}_{\sigma , L_{1}}\in \mathcal {A}^{(1)}_{ \textsf {zr} }\cap \mathcal {A}^{(1)}_{ \textsf {det} }\) be the accelerated gradient scheme developed in [23, §2.2.1] for \(\sigma \)-strongly convex functions with \(L_{1}\)-Lipschitz gradient, initialized at \(x^{(1)}=0\) (the exact step size scheme is not important). For any \(L_{1}\)-smooth f with global minimizer \(x^\star _f\), \(\epsilon ^2/(2L_{1})\)-suboptimality guarantees \(\epsilon \)-stationarity, since \(\left\| {\nabla f(x)}\right\| ^2 \le 2L_{1}(f(x)-f(x^\star _f))\) [6, Eq. (9.14)]. Therefore, adapting [23, Thm. 2.2.2] to our notation gives

$$\begin{aligned} \mathsf {T}_{\epsilon }\big (\mathsf {AGD}_{\sigma , L_{1}}, f\big ) \le 1+ 2\sqrt{\frac{L_{1}}{\sigma }}\log _+\left( \frac{L_{1}\Vert {x_f^\star }\Vert }{\epsilon }\right) , \end{aligned}$$
(20)

with \(\log _+(x) :=\max \{0, \log x\}\).

Now suppose that f is convex with \(L_{1}\)-Lipschitz gradient but not necessarily strongly-convex. We can add strong convexity to f by means of a proximal term; for any \(\sigma >0\), the function

$$\begin{aligned} f_\sigma (x):=f(x) + \frac{\sigma }{2}\left\| {x}\right\| ^2 \end{aligned}$$

is \(\sigma \)-strongly-convex with \((L_{1}+\sigma )\)-Lipschitz gradient. With this in mind, we define a proximal version of AGD as follows,

$$\begin{aligned} \mathsf {PAGD}_{\sigma , L_{1}}[f] :=\mathsf {AGD}_{\sigma , L_{1}+\sigma }[f_\sigma ] = \mathsf {AGD}_{\sigma , L_{1}+\sigma }\left[ f(\cdot ) + \frac{\sigma }{2}\left\| {\cdot }\right\| ^2\right] . \end{aligned}$$

Proposition 2

Let \(\Delta ,L_{1}\) and \(\epsilon \) be positive, and let \(\sigma = \frac{\epsilon ^2}{3\Delta }\). Then, algorithm \(\mathsf {PAGD}_{\sigma , L_{1}} \in \mathcal {A}^{(1)}_{ \textsf {det} }\) satisfies

$$\begin{aligned} \mathcal {T}_{\epsilon }\big (\mathcal {A}^{(1)}_{ \textsf {det} }, \mathcal {K}_1\left( \Delta , L_{1}\right) \big )\le & {} \sup _{f\in \mathcal {K}_1\left( \Delta , L_{1}\right) } \mathsf {T}_{\epsilon }\big (\mathsf {PAGD}_{\sigma , L_{1}}, f\big )\\\le & {} 1+5\frac{\sqrt{L_{1}\Delta }}{\epsilon }\log _+ \left( \frac{25L_{1}\Delta }{\epsilon ^2}\right) . \end{aligned}$$

Proof

For any \(f\in \mathcal {K}_1\left( \Delta , L_{1}\right) \), recall that \(f_\sigma (x) :=f(x) +\frac{\sigma }{2}\left\| {x}\right\| ^2\) and let \( \{x^{(t)}\}_{t\in \mathbb {N}} = \mathsf {PAGD}_{\sigma , L_{1}}[f] = \mathsf {AGD}_{\sigma , L_{1}+\sigma }[f_\sigma ] \) be the sequence of iterates \(\mathsf {PAGD}_{\sigma , L_{1}}\) produces on f. Then by guarantee (20), we have

$$\begin{aligned} \left\| {\nabla f_\sigma (x^{(T)})}\right\| \le \epsilon /6 \end{aligned}$$
(21)

for some T such that

$$\begin{aligned} T \le 1+ 2\sqrt{1+\frac{L_{1}}{\sigma }}\log _+\left( \frac{6(L_{1}+\sigma ) \Vert {x_{f_\sigma }^\star }\Vert }{\epsilon }\right) . \end{aligned}$$
(22)

For any point y such that \(f_\sigma (y) = f(y) + \frac{\sigma }{2} \left\| {y}\right\| ^2 \le f_\sigma (0)=f(0)\), we have

$$\begin{aligned} \left\| {y}\right\| ^2 \le \frac{2(f(0)-f(y))}{\sigma } \le \frac{2(f(0)-\inf _x f(x))}{\sigma } \le \frac{2\Delta }{\sigma }. \end{aligned}$$

Clearly, \(f_\sigma (x_{f_\sigma }^\star ) \le f_\sigma (0)\) and [23, Thm. 2.2.2] also guarantees that \(f_\sigma (x^{(T)}) \le f_\sigma (0)\). Consequently,

$$\begin{aligned} \max \left\{ \Vert {x^{(T)}}\Vert , \Vert {x_{f_\sigma }^\star }\Vert \right\} \le \sqrt{\frac{2\Delta }{\sigma }}, \end{aligned}$$
(23)

and so

$$\begin{aligned} \Vert {\nabla f(x^{(T)})}\Vert= & {} \Vert {\nabla f_\sigma (x^{(T)})-\sigma \cdot x^{(T)}}\Vert \le \Vert {\nabla f_\sigma (x^{(T)})}\Vert + \sigma \Vert {x^{(T)}}\Vert {\mathop {\le }\limits ^{(i)}} \frac{\epsilon }{6}\\&\quad + \sqrt{2\sigma \Delta } {\mathop {\le }\limits ^{(ii)}} \epsilon . \end{aligned}$$

In inequality (i) we substituted bounds (21) and (23), and in (ii) we used \(\sigma = \epsilon ^2/(3\Delta )\). We conclude that \(\mathsf {T}_{\epsilon }\big (\mathsf {PAGD}_{\sigma , L_{1}}, f\big ) \le T\), and substituting (23) and the definition of \(\sigma \) into (22) we have

$$\begin{aligned} T \le 1+ 2\sqrt{1+\frac{3L_{1}\Delta }{\epsilon ^2}}\log _+\left( 6\sqrt{\frac{2}{3}}+\frac{6\sqrt{6}L_{1}\Delta }{\epsilon ^2}\right) . \end{aligned}$$

Without loss of generality, we may assume \(\frac{2 L_{1}\Delta }{\epsilon ^2} \ge 1\), as otherwise \(\mathsf {T}_{\epsilon }\big (\mathsf {PAGD}_{\sigma , L_{1}}, f\big ) = 1\). We thus simplify the expression slightly to obtain the proposition. \(\square \)

1.2 The impossibility of approximate optimality without a bounded domain

Lemma 6

Let \(L_{1},\Delta >0\) and \(\epsilon < \Delta \). For any first-order algorithm \(\mathsf {A}\in \mathcal {A}^{(1)}_{ \textsf {det} }\cup \mathcal {A}^{(1)}_{ \textsf {zr} }\) and any \(T\in \mathbb {N}\), there exists a function \(f\in \mathcal {Q}\left( \Delta , L_{1}\right) \) such that the iterates \(\{x^{(t)}\}_{t \in \mathbb {N}} = \mathsf {A}[f]\) satisfy

$$\begin{aligned} \inf _{t \in \mathbb {N}} \left\{ t \mid f(x^{(t)}) \le \inf _x f(x) + \epsilon \right\} > T. \end{aligned}$$

Proof

By Proposition 1 it suffices to consider \(\mathsf {A}\in \mathcal {A}^{(1)}_{ \textsf {zr} }\) (see additional discussion of the generality of Proposition 1 in Section Pi.3.3). Consider the function \(f:\mathbb {R}^T\rightarrow \mathbb {R}\),

$$\begin{aligned} f(x) = \lambda \left[ (\sigma - \beta x_1)^2 + \sum _{i = 1}^{T-1} (x_i - \beta x_{i + 1})^2 \right] , \end{aligned}$$
(24)

where \(0<\beta < 1\), and we take

$$\begin{aligned} \lambda :=\frac{L_{1}}{2(1+2\beta +\beta ^2)} ~~\text{ and }~~ \sigma :=\sqrt{\frac{\Delta }{\lambda }}. \end{aligned}$$

Since f(x) is of the form \(\lambda \left\| {A x-b}\right\| ^2\) where \(\left\| {A}\right\| _\mathrm{op}\le 1+\beta \), we have \(\left\| {\nabla ^{{2}}f(x)}\right\| _\mathrm{op}\le 2\lambda \left\| {A}\right\| _\mathrm{op}^2\) for every \(x\in \mathbb {R}^T\) and therefore f has \(2 \lambda (1 + 2 \beta + \beta ^2)\)-Lipschitz gradient. Additionally, f satisfies \(\inf _x f(x) = 0\) and \(f(0) = \lambda \sigma ^2\), ans so the above choices of \(\lambda \) and \(\sigma \) guarantee that \(f\in \mathcal {Q}\left( \Delta , L_{1}\right) \). Moreover, f is a a first-order zero-chain (Definition 3), and thus for any \(\mathsf {A}\in \mathcal {A}^{(1)}_{ \textsf {zr} }\) and \(\{x^{(t)}\}_{t \in \mathbb {N}} = \mathsf {A}[f]\), we have \(x^{(t)}_{T} = 0\) for \(t\le T\) (Observation 1). Therefore, it suffices to show that \(f(x) > \inf _y f(y) + \epsilon \) whenever \(x_T =0\).

We make the following inductive claim: if \(f(x) \le \inf _y f(y) + \epsilon = \epsilon \), then

$$\begin{aligned} \left| x_i - \sigma \beta ^{-i} \right| \le \sum _{j=1}^{i} \beta ^{-j} \sqrt{\frac{\epsilon }{\lambda }} < \frac{\beta ^{-i}}{1-\beta }\sqrt{\frac{\epsilon }{\lambda }} \end{aligned}$$
(25)

for all \(i\le T\). Indeed, each term in the sum (24) defining f is non-negative, so for the base case of the induction \(i=1\), we have \(\lambda (\sigma - \beta x_1)^2 \le \epsilon \), or \(\left| x_1 - \sigma \beta ^{-1}\right| \le \beta ^{-1} \sqrt{\epsilon / \lambda }\). For \(i<T\), assuming that \(x_i\) satisfies the bound (25), we have that \(\lambda (x_i - \beta x_{i + 1})^2 \le \epsilon \), which implies

$$\begin{aligned} \left| x_{i + 1}-\sigma \beta ^{-(i+1)} \right|&\le \left| x_{i + 1}-\beta ^{-1}x_i \right| + \beta ^{-1}\left| x_i-\sigma \beta ^{-i} \right| \\&\le \beta ^{-1}\sqrt{\frac{\epsilon }{\lambda }} + \sum _{j=1}^{i} \beta ^{-(j+1)} \sqrt{\frac{\epsilon }{\lambda }}\\&= \sum _{j=1}^{i+1} \beta ^{-j} \sqrt{\frac{\epsilon }{\lambda }}, \end{aligned}$$

which is the desired claim (25) for \(x_{i+1}\).

The bound (25) implies \(x_i\ne 0\) for all \(i\le T\) whenever \(\sigma \ge (1-\beta )^{-1}\sqrt{\epsilon /\lambda }\). Therefore, we choose \(\beta \) to satisfy \(\sigma = (1-\beta )^{-1}\sqrt{\epsilon /\lambda }\), that is

$$\begin{aligned} \beta :=1- \sqrt{\frac{\epsilon }{\lambda \sigma ^2}} = 1- \sqrt{\frac{\epsilon }{\Delta }}, \end{aligned}$$

for which \(0<\beta <1\) since we assume \(\epsilon <\Delta \). Thus, we guarantee that when \(x_T=0\) we must have \(f(x) > \inf _y f(y) + \epsilon \), giving the result. \(\square \)

Technical results

1.1 Proof of Lemma 2

Lemma 2

The function \(\Upsilon _r\) satisfies the following.

  1. i.

    We have \(\Upsilon _r'(0) = \Upsilon _r'(1) = 0\).

  2. ii.

    For all \(x \le 1\), \(\Upsilon _r'(x) \le 0\), and for all \(x \ge 1\), \(\Upsilon _r'(x) \ge 0\).

  3. iii.

    For all \(x \in \mathbb {R}\) we have \(\Upsilon _r(x) \ge \Upsilon _r(1) = 0\), and for all r, \(\Upsilon _r(0) \le 10\).

  4. iv.

    For every \(r\ge 1\), \(\Upsilon _r'(x) < -1\) for every \(x\in (-\infty ,-0.1] \cup [0.1,0.9]\).

  5. v.

    For every \(r \ge 1\) and every \(p \ge 1\), the p-th order derivatives of \(\Upsilon _r\) are \(r^{3-p}\ell _{p}\)-Lipschitz continuous, where \(\ell _{p} \le \exp (\frac{3}{2}p \log p + c p)\) for a numerical constant \(c < \infty \).

Proof

Parts i and ii are evident from inspection, as

$$\begin{aligned} \Upsilon _r'(x) = 120 \frac{x^2 (x - 1)}{1 + (x/r)^2}. \end{aligned}$$

To see the part iii, note that \(\Upsilon _r\) is non-increasing for every \(x<1\) and non-decreasing for every \(x>1\) and therefore \(x=1\) is its global minimum. That \(\Upsilon _r(1)=0\) is immediate from its definition, and, for every r, \(\Upsilon _r(0) = 120\int _0^1 \frac{t^2(1-t)}{1+(t/r)^2}dt \le 120\int _0^1 t^2(1-t)dt = 10\). To see part iv, note that \(|\Upsilon _r'(x)| \ge |\Upsilon _1'(x)|\) for every \(r \ge 1\), and a calculation shows \(|\Upsilon _1'(x)| > 1\) for \(x\in \left( {-\infty }, {-0.1}\right] \cup [0.1,0.9]\) (see Fig. 1).

To see the fifth part of the claim, note that

$$\begin{aligned} \Upsilon _r'(x)= & {} 120r^2(x-1)\left( 1 - \frac{1}{1+(x/r)^2}\right) \\= & {} 120\left[ r^2 (x-1) - r^3 \varphi _1(x/r) + r^2 \varphi _2(x/r)\right] , \end{aligned}$$

where the functions \(\varphi _1\) and \(\varphi _2\) are \(\varphi _1(\xi ) = \xi /(1+\xi ^2)\) and \(\varphi _2(\xi ) = 1/(1+\xi ^2)\). We thus bound the derivatives of \(\varphi _1\) and \(\varphi _2\). We begin with \(\varphi _2\), which we can write as the composition \(\varphi _2(x) = (h \circ g)(x)\) where \(h(x) = \frac{1}{x}\) and \(g(x) = 1 + x^2\). Let \(\mathcal {P}_{k,2}\) denote the collection of all partitions of \(\{1, \ldots , k\}\) where each element of the partition has at most 2 indices. That is, if \(P \in \mathcal {P}_{k,2}\), then \(P = (S_1, \ldots , S_l)\) for some \(l \le k\), the \(S_i\) are disjoint, \(1 \le |S_i| \le 2\), and \(\cup _i S_i = [k]\). The cardinality \(|\mathcal {P}_{k,2}|\) is the number of matchings in the complete graph on k vertices, or the kth telephone number, which has bound [15, Lemma 2]

$$\begin{aligned} |\mathcal {P}_{k,2}| \le \exp \left( \frac{k}{2} \log k + k \log 2\right) . \end{aligned}$$

We may then apply Faà di Bruno’s formula for the chain rule to obtain

$$\begin{aligned} \varphi _2^{(k)}(x)= & {} \sum _{P \in \mathcal {P}_k} h^{(|P|)}(g(x)) \prod _{S \in P} g^{(|S|)}(x)\\= & {} \sum _{P \in \mathcal {P}_{k,2}} (-1)^{|P|} \frac{(|P| - 1)!}{(1 + x^2)^{|P|}} (2x)^{\mathsf {C}_1(P)} 2^{\mathsf {C}_2(P)}, \end{aligned}$$

where \(\mathsf {C}_i(P)\) denotes the number of sets in P with precisely i elements. Of course, we have \(|x|^{\mathsf {C}_1(P)} / (1 + x^2)^{|P|} \le 1\), and thus

$$\begin{aligned} |\varphi _2^{(k)}(x)| \le \sum _{P \in \mathcal {P}_{k,2}} (|P| - 1)! 2^{|P|} \le |\mathcal {P}_{k,2}| \cdot (k - 1)! \cdot 2^{k} \le e^{\frac{3k}{2} \log k + 2k \log 2}. \end{aligned}$$

The proof of the upper bound on \(\varphi _1^{(k)}(x)\) is similar (\(2\varphi _1(x)=\frac{d}{dx}[(\hat{h}\circ g)(x)]\) with \(\hat{h}(x) = \log x\) and g as defined above), so for every \(r \ge 1\) and \(p \ge 1\), the \(p+1\)-th derivative of \(\Upsilon _r\) has the bound

$$\begin{aligned} |\Upsilon _r^{(p+1)}(x)|&\le 120\left[ r^2 1_{\left( {p = 1}\right) } + r^{3-p}|\varphi _1^{(p)}(x/r)| + r^{2-p}|\varphi _2^{(p)}(x/r)| \right] \\&\le 120 r^{3 - p}e^{\frac{3}{2}\log p + cp}, \end{aligned}$$

where \(c < \infty \) is a numerical constant. \(\square \)

1.2 Proof of Lemma 3

Lemma 3

Let \(r\ge 1\) and \(\mu \le 1\). For any \(x\in \mathbb {R}^{T+1}\) such that \(x_T = x_{T+1} = 0\),

$$\begin{aligned} \left\| {\nabla \bar{f}_{T, \mu , r}(x)}\right\| > \mu ^{3/4}/4. \end{aligned}$$

Proof

Throughout the proof, we fix \(x\in \mathbb {R}^{T+1}\) such that \(x_{T}=x_{T+1}=0\); for convenience in notation, we define \(x_{0} :=1\). Our strategy is to carefully pick two indices \(i_1\in \{0, \ldots , T-1\}\) and \(i_2\in \{i_1+1, \ldots , T\}\), such that \(\left\| {\nabla \bar{f}_{T, \mu , r}(x)}\right\| ^2 \ge \sum _{i=i_1+1}^{i_2} \left| \nabla _i \bar{f}_{T, \mu , r}(x) \right| ^2 > (\mu ^{3/4}/4)^2\). We call the set of indices from \(i_1+1\) to \(i_2\) the transition region, and construct it as follows.

$$\begin{aligned} \text{ Let } i_1\ge 0 \text{ be } \text{ the } \text{ largest } \text{ i } \text{ such } \text{ that } x_{i}>0.9, \end{aligned}$$

so that \(x_{j}\le 0.9\) for every \(j>i\). Note that \(i_1=0\) when \(x_i \le 0.9\) for every \(i\in [T+1]\). This is a somewhat special case due to the coefficient \(\sqrt{\mu }\le 1\) of the first “link” in the quadratic chain term in (9). To handle it cleanly we define

$$\begin{aligned} \alpha :={\left\{ \begin{array}{ll} 1 &{} i_1> 0 \\ \sqrt{\mu } &{} i_1= 0. \end{array}\right. } \end{aligned}$$

Continuing with construction of the transition region, we make the following definition.

$$\begin{aligned} \text{ Let } i_2'\le T \text{ be } \text{ the } \text{ smallest } \text{ j } \text{ such } \text{ that } j>i_1 \text{ and } x_{j}<0.1\text{, } \end{aligned}$$

and let \(m'=i_2'-i_1\), so \(m' \ge 1\). Roughly, our transition region consists of the \(m'\) indices \(i_1+1,\ldots ,i_2'\), but for technical reasons we attach to it the following decreasing ‘tail’.

$$\begin{aligned} \text{ Let } i_2 \text{ be } \text{ the } \text{ smallest } k \text{ such } \text{ that } k\ge i_2' \text{ and } x_{k+1}\ge x_{k}-\frac{0.2}{m'-1+1/\alpha }1_{\left( {x_{k} > -0.1}\right) }. \end{aligned}$$

With these definitions, \(i_2\) is well-defined and \(0 \le i_1< i_2\le T\), since \(x_{T+1}-x_{T}=0\). We denote the transition region and associated length by

$$\begin{aligned} \mathcal {I}_\mathrm {trans}:=\left\{ i_1+1,\ldots ,i_2\right\} ~~ \text{ and } ~~ m :=i_2-i_1\ge 1. \end{aligned}$$
(26)

We illustrate our definition of the transition region in Fig. 3.

Fig. 3
figure 3

The transition region (26) in the proof of Lemma 3. Each plot shows the entries of a vector \(x\in \mathbb {R}^{T+1}\) that satisfies \(x_T= x_{T+1} = 0\). The entries of x belonging to the transition region \(\mathcal {I}_\mathrm {trans}\) are blue (color figure online)

Let us describe the transition region. In the “head” of the region, we have \(0.1\le x_{i}\le 0.9\) for every \(i\in \left\{ i_1+1,\ldots ,i_2'-1\right\} \); a total of \(m'-1\) indices. The “tail” of the transition region is strictly decreasing, \(x_{i_2}<x_{i_2-1}<\cdots <x_{i_2'}\). Moreover, for any \(j \in \{i_2'+1, \ldots i_2-1\}\) such that \(x_j > -0.1\), the decrease is rapid; \(x_j < x_{j - 1} - 0.2 / (m'-1+1/\alpha )\). This descriptions leads us to the following technical properties.

Lemma 7

Let the transition region \(\mathcal {I}_\mathrm {trans}\) be defined as above (26). Then

  1. i.

    \(x_{i_1}> 0.9> 0.1 > x_{i_2}\) and \(-x_{i_2} + \left( m-1+\alpha ^{-1}\right) \left( x_{i_2+1}-x_{i_2}\right) > -0.3\).

  2. ii.

    \(\Upsilon _r'\left( x_{i}\right) \le 0\) for every \(i\in \mathcal {I}_\mathrm {trans}\), and \(\Upsilon _r'\left( x_{i}\right) <-1\) for at least \(\left( m-\alpha ^{-1}\right) /2\) indices in \(\mathcal {I}_\mathrm {trans}\).

We defer the proof of the lemma to the end of this section, continuing the proof assuming it.

We now lower bound \(\Vert {\nabla \bar{f}_{T, \mu , r}(x)}\Vert \). For notational convenience, define \(g_{i}=\mu \Upsilon _r'\left( x_{i}\right) \), and recalling that \(x_{T}=x_{T+1}=0\), we see that the norm of the gradient of \(\bar{f}_{T, \mu , r}\) is

$$\begin{aligned} \left\| {\nabla \bar{f}_{T, \mu , r}\left( x\right) }\right\| ^{2}&=\left( (1+\sqrt{\mu })x_{1}-\sqrt{\mu }-x_2+g_1\right) ^{2}+\sum _{i=1}^{T}\left( 2x_{i}-x_{i-1}-x_{i+1}+g_{i}\right) ^{2} \nonumber \\&\ge \left( (1+\alpha )x_{i_1+1}-\alpha x_{i_1}-x_{i_1+2}+g_{i_1+1}\right) ^{2}\nonumber \\&\quad +\sum _{i=i_1+2}^{i_2}\left( 2x_{i}-x_{i-1}-x_{i+1}+g_{i}\right) ^{2}, \end{aligned}$$
(27)

where we made use of the notation \(\alpha :=1\) if \(i_1> 0\) and \(\alpha :=\sqrt{\mu }\) if \(i_1=0\). We obtain a lower bound for the final sum of m squares (27) by fixing \(x_{i_1}\), \(x_{i_2}\), and \(g_{i_1+1}, \ldots , g_{i_2}\), then minimizing the quadratic form explicitly over the \(m - 1\) variables \(x_{i_1+1},\ldots ,x_{i_2-1}\). We obtain

$$\begin{aligned} \left\| {\nabla \bar{f}_{T, \mu , r}\left( x\right) }\right\| ^{2}&\ge \inf _{v \in \mathbb {R}^{m-1} }\Big \{ \left( (1+\alpha )v_{1}-\alpha x_{i_1}-v_{2}+g_{i_1+1}\right) ^{2} \\&\quad + \sum _{j=2}^{m-2}\left( 2v_{j}-v_{j-1}-v_{j+1} + g_{i_1+ j} \right) ^{2} \\&\quad +\left( 2v_{m-1}-v_{m-2}-x_{i_2}+g_{i_2-1}\right) ^{2} + \left( 2x_{i_2}-v_{m}-x_{i_2+1}+g_{i_2}\right) ^{2}\Big \} \\&=\inf _{v\in \mathbb {R}^{m-1}} \left\| {Av-b}\right\| ^{2} =b^\top \left( I-A\left( A^{\top }A\right) ^{-1}A^{\top }\right) b =\left( z^{\top }b\right) ^{2}, \end{aligned}$$

where the matrix A and vector b have definitions

$$\begin{aligned} A=\left[ \begin{array}{lllll} 1+\alpha &{} -1\\ -1 &{}\quad 2 &{} -1\\ &{}\quad \ddots &{}\quad \ddots &{}\quad \ddots \\ &{} &{} -1 &{}\quad 2 &{} -1\\ &{} &{} &{} -1 &{}\quad 2\\ &{} &{} &{} &{} -1 \end{array}\right] \in \mathbb {R}^{m\times \left( m-1\right) } ~~ \text{ and } ~~ b=\left[ \begin{array}{c} \alpha x_{i_1}-g_{i_1+1}\\ -g_{i_1+2}\\ \vdots \\ -g_{i_2-2}\\ x_{i_2}-g_{i_2-1}\\ -2x_{i_2}+x_{i_2+1}-g_{i_2} \end{array}\right] \in \mathbb {R}^{m}, \end{aligned}$$

and \(z \in \mathbb {R}^{m}\) is a unit-norm solution to \(A^{\top }z=0\). The vector \(z \in \mathbb {R}^m\) with

$$\begin{aligned} z_{j}= \frac{j-1+\frac{1}{\alpha }}{\sqrt{\sum _{i=1}^{m}(i-1+\frac{1}{\alpha })^{2}}} \end{aligned}$$

is such a solution. Thus

$$\begin{aligned}&\left\| {\nabla \bar{f}_{T, \mu , r}\left( x\right) }\right\| ^{2}\nonumber \\&\quad \ge \frac{\left( x_{i_1}- \sum _{j=1}^{m}\left( j-1+\frac{1}{\alpha }\right) \cdot g_{i_1+j} + \big (m-2+\frac{1}{\alpha }\big )\cdot x_{i_2} +\left( m-1+\frac{1}{\alpha }\right) (-2x_{i_2}+x_{i_2+1})\right) ^{2}}{\sum _{i=1}^{m}(i-1+\frac{1}{\alpha })^{2}} \nonumber \\&\quad =\frac{1}{\sum _{i=1}^{m}(i-1+\frac{1}{\alpha })^{2}} \bigg (x_{i_1}-x_{i_2}+\left( m-1+\frac{1}{\alpha }\right) \left( x_{i_2+1}-x_{i_2}\right) \nonumber \\&\qquad -\sum _{j=1}^{m}\left( j-1+\frac{1}{\alpha }\right) \cdot g_{i_1+j}\bigg )^{2}. \end{aligned}$$
(28)

We now bring to bear the properties of the transition region Lemma 7 supplies. By Lemma 7.i,

$$\begin{aligned} x_{i_1}-x_{i_2}+(m-1+\alpha ^{-1})\left( x_{i_2+1}-x_{i_2}\right) \ge 0.9-0.3=\frac{3}{5}, \end{aligned}$$
(29)

and by Lemma 7.ii, using \(1 \le \alpha ^{-1} \le 1/\sqrt{\mu }\),

$$\begin{aligned} -\sum _{j=1}^{m}(j-1+\alpha ^{-1}) g_{i_1+j}\ge & {} \mu \sum _{j=1}^{\left( m-\alpha ^{-1}\right) /2}(j-1+\alpha ^{-1})\nonumber \\\ge & {} \frac{\mu }{8}\left[ m^2-\frac{1}{\alpha ^2}\right] _+\nonumber \\\ge & {} \frac{1}{8}\left[ \mu m^2 - 1\right] _+. \end{aligned}$$
(30)

Substituting \(\sum _{i=1}^{m}\left( i-1+\alpha ^{-1}\right) ^{2} \le \frac{1}{2}m\left( m+1/\sqrt{\mu }\right) \left( m+2/\sqrt{\mu }\right) \) and the bounds (29) and (30) into the gradient lower bound (28), we have that

$$\begin{aligned} \left\| { \nabla \bar{f}_{T, \mu , r}\left( x\right) }\right\|\ge & {} \mu ^{3/4} \cdot \zeta (m\sqrt{\mu }) ~~\text{ where }~~ \zeta (t) \\:= & {} \sqrt{\frac{2}{t(t+1)(t+2)}}\left( \frac{3}{5}+\frac{1}{8}\left[ t^2 - 1\right] _+ \right) . \end{aligned}$$

A quick computation reveals that \(\inf _{t> 0}\zeta (t) \approx 0.28 > 1/4\), which gives the result. \(\square \)

Proof of lemma 7

We have by definition that \(x_{i_1} > 0.9\) and \(x_{i_2} \le x_{i_2'} < 0.1\). To see that

$$\begin{aligned} -x_{i_2} + \left( m-1+\alpha ^{-1}\right) \left( x_{i_2+1}-x_{i_2}\right) \ge -0.3 \end{aligned}$$

holds, consider the two cases that \(x_{i_2} \le -0.1\) or \(x_{i_2} > -0.1\). In the first case that \(x_{i_2}\le -0.1\), by definition \(x_{i_2+1}\ge x_{i_2}\) so \(-x_{i_2} + \left( m-1+\alpha ^{-1}\right) \left( x_{i_2+1}-x_{i_2}\right)>0.1>-0.3\). The second case that \(x_{i_2}>-0.1\) is a bit more subtle. By definition of the sequence \(x_{i_2}, \ldots , x_{i_2'}\), we have

$$\begin{aligned} -0.1< & {} x_{i_2}< x_{i_2- 1} - \frac{0.2}{m'-1+\frac{1}{\alpha }}< \cdots \le x_{i_2'} - \frac{0.2}{m'-1+\frac{1}{\alpha }}(i_2- i_2')\nonumber \\< & {} 0.1 - 0.2\frac{m-m'}{m'-1+\frac{1}{\alpha }}. \end{aligned}$$
(31)

Combining this bound on \(x_{i_2}\) and the inequality \(x_{i_2+ 1} \ge x_{i_2} - \frac{0.2}{m'-1+1/\alpha }\) due to the construction of \(i_2\), we obtain

$$\begin{aligned}&-x_{i_2} + \left( m-1+\alpha ^{-1}\right) \left( x_{i_2+1}-x_{i_2}\right) > -0.1 + 0.2\frac{m-m'}{m'-1+\frac{1}{\alpha }}\\&\quad - 0.2\frac{m-1+\frac{1}{\alpha }}{m'-1+\frac{1}{\alpha }} = -0.3. \end{aligned}$$

We note for the proof of property ii that the chain of inequalities (31) is possible only for \(m \le 2m'-1+1/\alpha \), which implies there are at most \(m'-1+1/\alpha \) indices \(i \in \mathcal {I}_\mathrm {trans}\) such that \(|x_i| < 0.1\).

The first part of property ii follows from Lemma 2.ii, since \(x_{i}\le 0.9 \le 1\) for every \(i\in \mathcal {I}_\mathrm {trans}\). To see that the second part of the property holds, let N be the number of indices in \(i\in \mathcal {I}_\mathrm {trans}\) for which \(\Upsilon _r'\left( x_{i}\right) <-1\). By Lemma 2.iv and the fact that \(0.1\le x_{i}\le 0.9\) for every \(i\in \left\{ i_1+1,\ldots ,i_2'-1\right\} \), \(N\ge m'-1\). Moreover, since there can be at most \(m'-1+1/\alpha \) indices \(i\in \mathcal {I}_\mathrm {trans}\) for which \(\left| x_{i}\right| <0.1\), \(N\ge m-(m'-1+1/\alpha )\). Averaging the two lower bounds gives \(N\ge \left( m-1/\alpha \right) /2\). \(\square \)

1.3 Proof of Theorem 3

Theorem 3

There exists a numerical constant \(c < \infty \) such that the following lower bound holds. Let \(p\ge 2\), \(p \in \mathbb {N}\), and let \(D, L_{1}, L_{2}, \ldots , L_{p}\), and \(\epsilon \) be positive. Then

$$\begin{aligned} \mathcal {T}_{\epsilon }\big (\mathcal {A}^{(1)}_{ \textsf {det} }\cup \mathcal {A}^{(1)}_{ \textsf {zr} }, \mathcal {F}^\mathrm{dist}_{1: p}(D, L_{1}, ..., L_{p})\big ) \ge B_\epsilon \left( \min _{q\in [p]} \left\{ \frac{L_{q}}{2\tilde{\ell }_{q}}D^{q+1}\right\} , \frac{L_{1}}{2}, \frac{L_{2}}{2}, \ldots , \frac{L_{p}}{2} \right) , \end{aligned}$$

where \(\tilde{\ell }_{q} \le \exp (c q \log q + c)\).

Proof

The proof builds off of those of Theorems 2 and Pi.3. We begin by recalling the following bump function construction

$$\begin{aligned} \bar{h}_T(x) :=\Psi \left( 1 - \frac{25}{2} \Big \Vert {x - \frac{4}{5} e^{(T)}}\Big \Vert ^2 \right) ~~\text{ where }~~\Psi (t) :=e \cdot \exp \left( -\frac{1}{\left[ {2 t - 1}\right] _+^2}\right) . \end{aligned}$$
(32)

Adding a scaled version of \(-\bar{h}_T\) to our hard instance construction allows us to “plant” a global minimum that is both close to the origin and essentially invisible to zero-respecting method. For convenience, we restate Lemma Pi.10,

Lemma 8

The function \(\bar{h}_T\) satisfies the following.

  1. i.

    For all \(x \in \mathbb {R}^T\) we have \(\bar{h}_T(x) \in [0, 1]\), and \(\bar{h}_T(0.8 e^{(T)}) = 1\).

  2. ii.

    On the set \(\{x\in \mathbb {R}^d \mid x_{T} \le \frac{3}{5} \} \cup \{x \mid \left\| {x}\right\| \ge 1\}\), we have \(\bar{h}_T(x)=0\).

  3. iii.

    For every \(p \ge 1\), the \(p\hbox {th}\) order derivative of \(\bar{h}_T\) is \(\tilde{\ell }_{p}\)-Lipschitz continuous, where \(\tilde{\ell }_{p} \le e^{c p\log p + c}\) for a numerical constant \(c < \infty \).

With this lemma in place, we follow the broad outline of the proof of Theorem 2, with modifications to make sure the norm of the minimizers of f is small. Indeed, letting \(\lambda , \sigma > 0\), we define our scaled hard instance \(f:\mathbb {R}^{T+ 2}\rightarrow \mathbb {R}\) by

$$\begin{aligned} f(x) = {\lambda \sigma ^2} \bar{f}_{T, \mu , r}\left( x_1 / \sigma , \ldots , x_{T+ 1} / \sigma \right) - {\tilde{\lambda }} \bar{h}_{T+2}\left( {x}/{D}\right) , \end{aligned}$$
(33)

that is, the hard instance we construct in Theorem 2 minus a scaled bump function (32). For every \(p \in \mathbb {N}\), we set the parameters \(\lambda , \sigma , \mu \) and r as in the proof of Theorem 2, so that we satisfy inequality (12) except we replace \(L_{q}\) with \(L_{q}/2\) for every \(q\in [p]\) (including in the definitions of \(\lambda , \sigma , \mu \)). Thus, as in inequality (12), for each \(q \in \mathbb {N}\) the function \(f_0(x) :=\lambda \sigma ^2\bar{f}_{T, \mu , r}\left( x/\sigma \right) \) has \(L_{q}/2\)-Lipschitz qth order derivative and satisfies \(\left\| {\nabla f_0(x)}\right\| >\epsilon \) for all \(x\in \mathbb {R}^{T+1}\) with \(x_{T} = x_{T+1} = 0\). By Lemma 8.iii, setting

$$\begin{aligned} {\tilde{\lambda }} = \min _{q\in [p]} \frac{1}{2\tilde{\ell }_{q}}L_{q}D^{q+1} \end{aligned}$$
(34)

guarantees that the function \(x \mapsto - {\tilde{\lambda }}\cdot \bar{h}_{T+2}(x/D)\) also has \(L_{q}/2\)-Lipschitz qth order derivatives, so that overall, for each \(q \in [p]\) the function f defined in Eq. (33) has \(L_{q}\)-Lipschitz qth order derivative.

We note that by Lemma 8.ii, \(\bar{h}_{T+2}(x)\) is identically 0 at a neighborhood of any x with \(x_{T+2}=0\), which immediate implies that \(\bar{h}_{T+2}\) and f are zero-chains. Therefore for any \(\mathsf {A}\in \mathcal {A}^{(1)}_{ \textsf {zr} }\) producing iterates \(x^{(1)}=0,x^{(2)},x^{(3)},\ldots \) when operating on f, we have \(x^{(t)}_T= x^{(t)}_{T+1}=x^{(t)}_{T+2}=0\) for any \(t\le T\). Thus, by our choices of \(\lambda ,\sigma ,\mu \) and r, \(\left\| {\nabla f(x^{(t)})}\right\| = \left\| {\nabla f_0(x^{(t)})}\right\| > \epsilon \) for every \(t\le T\), and so

$$\begin{aligned} \mathsf {T}_{\epsilon }\big (\mathcal {A}^{(1)}_{ \textsf {zr} }, \mathcal {F}^\mathrm{dist}_{1: p}(D, L_{1}, ..., L_{p})\big ) \ge \inf _{\mathsf {A}\in \mathcal {A}^{(1)}_{ \textsf {zr} }} \mathsf {T}_{\epsilon }\big (\mathsf {A}, f\big ) \ge T+1. \end{aligned}$$

To establish that \(f\in \mathcal {F}^\mathrm{dist}_{1: p}(D, L_{1}, ..., L_{p})\), it remains to show that every global minimizer of f has norm at most D. Let \(x^\star \) denote a global minimizer of f, and temporarily assume that

$$\begin{aligned} f\left( 0.8D\cdot e^{(T+2)}\right) < 0, \end{aligned}$$
(35)

Therefore, \(f(x^\star )< f\left( 0.8D\cdot e^{(T+2)}\right) <0\) and \(\bar{h}_{T+2}(x^\star /D) \ne 0\), as otherwise we have the contradiction \(f(x^\star ) = \lambda \sigma ^2 \bar{f}_{T, \mu , r}(x^\star /\sigma ) \ge 0\). By the definition (32), \(\bar{h}_{T+2}(x^\star /D) \ne 0\) implies that \(1-\frac{25}{2}\left\| {x^\star /D - 0.8e^{(T+2)}}\right\| ^2\ge 0.5\), and therefore \(\left\| {x^\star }\right\| \le D\). To verify the assumed inequality (35), we use Lemma 8.i to obtain

$$\begin{aligned} f\left( 0.8D \cdot e^{(T+2)}\right)= & {} {\lambda \sigma ^2}\cdot \bar{f}_{T, \mu , r}\left( 0\right) - {\tilde{\lambda }}\cdot \bar{h}_{T+2}\left( 0.8\cdot e^{(T+2)}\right) \\= & {} \frac{\lambda \sqrt{\mu }\sigma ^2}{2} + 10\lambda \sigma ^2\mu T- {{\tilde{\lambda }}}. \end{aligned}$$

Therefore, if we set

$$\begin{aligned} T = \left\lfloor \frac{{\tilde{\lambda }}-\lambda \sqrt{\mu }\sigma ^2/2}{10\lambda \mu \sigma ^2} \right\rfloor \end{aligned}$$
(36)

then inequality (35) holds and \(\left\| {x^\star }\right\| \le D\), and so \(f\in \mathcal {F}^\mathrm{dist}_{1: p}(D, L_{1}, ..., L_{p})\). Comparing the setting (36) of T above to the setting (13) of T in the proof of Theorem 2, we see they are identical except that we replace the term \(\Delta \) in (13) with \({\tilde{\lambda }} :=\min _{q\in [p]} ({2\tilde{\ell }_{q})^{-1}}L_{q}D^{q+1}\). Thus, mimicking the proof of Theorem 2 after the step (13), mutatis mutandis, yields the result. \(\square \)

1.4 Proof of Lemma 5

Lemma 5

Let \(T\in \mathbb {N}\), \(0 < \alpha \le 1\), \(\mu \in [T^{-2}, 1]\) and \(\widetilde{f}_{T, \alpha , \mu }\) be defined as in (19), with \(\Lambda \) and \(\widetilde{\Upsilon }\) satisfying

$$\begin{aligned} \Lambda '(0) = \widetilde{\Upsilon }'(0) = 0 ~~\text{ and }~~ \Lambda ' \text{ is } \text{1-Lipschitz } \text{ continuous } ~~\text{ and }~~ \max _{z\in [0,1]} | \widetilde{\Upsilon }'(z) | \le G, \end{aligned}$$

for \(G>0\) independent of \(T, \alpha \) and \(\mu \). Then there exists \(x \in \mathbb {R}^{T+1}\) such that \(x_T= x_{T+1} = 0\) and

$$\begin{aligned} \left\| {\nabla \widetilde{f}_{T, \alpha , \mu }(x)}\right\| < C\mu ^{3/4}, \end{aligned}$$

where \(C \le 27 + \sqrt{3}G\).

Proof

We construct x as follows. We let \(x_1 = 1\), and for \(n>1\) let (with \(x_0 :=1\)),

$$\begin{aligned} x_n = x_{n-1} - (x_{n-2}-x_{n-1}) - \delta _{n-1} = 1 - \sum _{i=1}^{n-1} \sum _{j=1}^i \delta _i \,, \end{aligned}$$

where we take

$$\begin{aligned} \delta _n = \frac{1}{m(m+1)} {\left\{ \begin{array}{ll} 1 &{} n \le m \\ 0 &{} n=m+1 \text { or } n > 2m +1 \\ -1 &{} m+1 < n \le 2m +1 \end{array}\right. } \end{aligned}$$

for some \(m\in \mathbb {N}\) which we will later determine. The elements of \(\nabla \widetilde{f}_{T, \alpha , \mu }\) are given by

$$\begin{aligned} \nabla _n \widetilde{f}_{T, \alpha , \mu }(x) = \Lambda '(x_n -x_{n-1}) - \Lambda '(x_{n+1} -x_n ) + \mu \widetilde{\Upsilon }'(x_n), \end{aligned}$$

where for \(n=1\) we used \(x_1=1\) and \(\Lambda '(0)=0\) to write \(\alpha \cdot \Lambda '(x_1-1) = 0 = \Lambda '(x_1-1)\). Since \(\Lambda '\) is 1-Lipschitz, we have

$$\begin{aligned} \left| \Lambda '(x_n -x_{n-1}) - \Lambda '(x_{n+1} -x_n ) \right| \le \left| (x_n -x_{n-1}) - (x_{n+1} -x_n )\right| = \left| \delta _n \right| . \end{aligned}$$

Moreover, one can readily verify that \(x_n\in [0,1]\) for every n and that \(x_n = 0\) for every \(n > 2m+1\). Therefore, using using \(\widetilde{\Upsilon }'(0) = 0\) and \(\max _{z\in [0,1]} | \widetilde{\Upsilon }'(z) | \le G\) we have that \(\left| \widetilde{\Upsilon }'(x_n)\right| \le G\cdot 1_{\left( {n \le 2m+1}\right) }\), which gives the overall bound

$$\begin{aligned} \left| \nabla _n \widetilde{f}_{T, \alpha , \mu }(x)\right| \le \left| \delta _n \right| + \mu \left| \widetilde{\Upsilon }(x_n)\right| \le \left( \frac{1}{m^2} + G\mu \right) 1_{\left( {n \le 2m+1}\right) }, \end{aligned}$$

and thus,

$$\begin{aligned} \left\| {\nabla \widetilde{f}_{T, \alpha , \mu }(x)}\right\| \le \sqrt{2m+1}\left( m^{-2} + G\mu \right) \le \sqrt{3}\left( m^{-3/2} + \sqrt{m}G\mu \right) . \end{aligned}$$

Taking \(m = \left\lceil \frac{1}{3\sqrt{\mu }} \right\rceil \), we have

$$\begin{aligned} \left\| {\nabla \widetilde{f}_{T, \alpha , \mu }(x)}\right\| \le \sqrt{3}\left( \left\lceil \frac{1}{3\sqrt{\mu }} \right\rceil ^{-3/2} + G\left\lceil \frac{1}{3\sqrt{\mu }} \right\rceil ^{1/2}\mu \right) \le \left( 27 + \sqrt{3}G\right) \mu ^{3/4}, \end{aligned}$$

where we have used \(\left\lceil 1/(3\sqrt{\mu }) \right\rceil \le 1/\sqrt{\mu }\) since \(\mu \le 1\). Thus, \( \left\| {\nabla \widetilde{f}_{T, \alpha , \mu }(x)}\right\| \le C\mu ^{3/4}\) holds for \(C=27 + \sqrt{3}G\). For \(T \ge 8\), since \(\mu \ge T^{-2}\), we have \(2m+1 \le 2\left\lceil T/3 \right\rceil +1 < T\) and therefore \(x_T= x_{T+1}=0\) holds as required (since \(x_n = 0\) for every \(n > 2m+1\)). In the edge case \(T\le 8\) we have \(\mu \ge T^{-2} \ge 1/64\) and therefore \(x=0\) yields \(\left\| {\nabla \widetilde{f}_{T, \alpha , \mu }(x)}\right\| = \alpha \le 1 \le 27\cdot (1/64)^{3/4} \le C\mu ^{3/4}\). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Carmon, Y., Duchi, J.C., Hinder, O. et al. Lower bounds for finding stationary points II: first-order methods. Math. Program. 185, 315–355 (2021). https://doi.org/10.1007/s10107-019-01431-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-019-01431-x

Keywords

Mathematics Subject Classification

Navigation