Skip to main content

Lower bounds for finding stationary points I

Abstract

We prove lower bounds on the complexity of finding \(\epsilon \)-stationary points (points x such that \(\Vert \nabla f(x)\Vert \le \epsilon \)) of smooth, high-dimensional, and potentially non-convex functions f. We consider oracle-based complexity measures, where an algorithm is given access to the value and all derivatives of f at a query point x. We show that for any (potentially randomized) algorithm \(\mathsf {A}\), there exists a function f with Lipschitz pth order derivatives such that \(\mathsf {A}\) requires at least \(\epsilon ^{-(p+1)/p}\) queries to find an \(\epsilon \)-stationary point. Our lower bounds are sharp to within constants, and they show that gradient descent, cubic-regularized Newton’s method, and generalized pth order regularization are worst-case optimal within their natural function classes.

This is a preview of subscription content, access via your institution.

Fig. 1

Notes

  1. 1.

    Higher order methods can yield improvements under additional smoothness: if in addition f has \(L_{2}\)-Lipschitz Hessian and \(\epsilon \le L_{1}^{7/3}L_{2}^{-4/3}D^{2/3}\), an accelerated Newton method achieves the (optimal) rate \((L_{2}D^3/\epsilon )^{2/7}\) [4, 32].

  2. 2.

    If the initial Hessian approximation is a diagonal matrix, as is typical.

  3. 3.

    We can readily adapt this property for lower bounds on other termination criteria, e.g. require \(f(x)-\inf _y f(y) > \epsilon \) for every x such that \(x_T=0\).

  4. 4.

    In a recent note Woodworth and Srebro [46] independently provide a revision of their proof that is similar, but not identical, to the one we propose here.

  5. 5.

    To simplify notation we allow c to change from equation to equation throughout the proof, always representing a finite numerical constant independent of d, \(T\), k or p.

References

  1. 1.

    Agarwal, A., Bartlett, P.L., Ravikumar, P., Wainwright, M.J.: Information-theoretic lower bounds on the oracle complexity of convex optimization. IEEE Trans. Inf. Theory 58(5), 3235–3249 (2012)

    MathSciNet  MATH  Google Scholar 

  2. 2.

    Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., Ma, T.: Finding approximate local minima faster than gradient descent. In: Proceedings of the Forty-Ninth Annual ACM Symposium on the Theory of Computing (2017)

  3. 3.

    Arjevani, Y., Shalev-Shwartz, S., Shamir, O.: On lower and upper bounds in smooth and strongly convex optimization. J. Mach. Learn. Res. 17(126), 1–51 (2016)

    MathSciNet  MATH  Google Scholar 

  4. 4.

    Arjevani, Y., Shamir, O., Shiff, R.: Oracle complexity of second-order methods for smooth convex optimization (2017). arXiv:1705.07260 [math.OC]

  5. 5.

    Ball, K.: An elementary introduction to modern convex geometry. In: Levy, S. (ed.) Flavors of Geometry, pp. 1–58. MSRI Publications, Cambridge (1997)

    Google Scholar 

  6. 6.

    Berend, D., Tassa, T.: Improved bounds on Bell numbers and on moments of sums of random variables. Prob. Math. Stat. 30(2), 185–205 (2010)

    MathSciNet  MATH  Google Scholar 

  7. 7.

    Birgin, E.G., Gardenghi, J.L., Martínez, J.M., Santos, S.A., Toint, P.L.: Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models. Math. Program. 163(1–2), 359–368 (2017)

    MathSciNet  MATH  Google Scholar 

  8. 8.

    Boumal, N., Voroninski, V., Bandeira, A.: The non-convex Burer–Monteiro approach works on smooth semidefinite programs. Adv. Neural Inf. Process. Syst. 30, 2757–2765 (2016)

    Google Scholar 

  9. 9.

    Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004)

    MATH  Google Scholar 

  10. 10.

    Braun, G., Guzmán, C., Pokutta, S.: Lower bounds on the oracle complexity of nonsmooth convex optimization via information theory. IEEE Trans. Inf. Theory 63(7), 4709–4724 (2017)

    MathSciNet  MATH  Google Scholar 

  11. 11.

    Burer, S., Monteiro, R.D.: A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Math. Program. 95(2), 329–357 (2003)

    MathSciNet  MATH  Google Scholar 

  12. 12.

    Candès, E.J., Li, X., Soltanolkotabi, M.: Phase retrieval via Wirtinger flow: theory and algorithms. IEEE Trans. Inf. Theory 61(4), 1985–2007 (2015)

    MathSciNet  MATH  Google Scholar 

  13. 13.

    Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Convex until proven guilty: dimension-free acceleration of gradient descent on non-convex functions. In: Proceedings of the 34th International Conference on Machine Learning (2017)

  14. 14.

    Carmon, Y., Duchi, J. C., Hinder, O., Sidford, A.: Lower bounds for finding stationary points II: first-order methods (2017). arXiv: 1711.00841 [math.OC]. URL https://arxiv.org/pdf/1711.00841.pdf

  15. 15.

    Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Accelerated methods for non-convex optimization. SIAM J. Optim. 28(2), 1751–1772 (2018)

    MathSciNet  MATH  Google Scholar 

  16. 16.

    Cartis, C., Gould, N.I., Toint, P.L.: On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization problems. SIAM J. Optim. 20(6), 2833–2852 (2010)

    MathSciNet  MATH  Google Scholar 

  17. 17.

    Cartis, C., Gould, N.I., Toint, P.L.: Complexity bounds for second-order optimality in unconstrained optimization. J. Complex. 28(1), 93–108 (2012)

    MathSciNet  MATH  Google Scholar 

  18. 18.

    Cartis, C., Gould, N.I.M., Toint, P.L.: How much patience do you have? A worst-case perspective on smooth nonconvex optimization. Optima 88, 1–10 (2012)

  19. 19.

    Cartis, C., Gould, N.I., Toint, P.L.: A note about the complexity of minimizing nesterov’s smooth Chebyshev–Rosenbrock function. Optim. Methods Softw. 28, 451–457 (2013)

    MathSciNet  MATH  Google Scholar 

  20. 20.

    Cartis, C., Gould, N.I.M., Toint, P.L.: Worst-case evaluation complexity and optimality of second-order methods for nonconvex smooth optimization (2017). arXiv:1709.07180 [math.OC]

  21. 21.

    Chowla, S., Herstein, I.N., Moore, W.K.: On recursions connected with symmetric groups I. Can. J. Math. 3, 328–334 (1951)

    MathSciNet  MATH  Google Scholar 

  22. 22.

    Conn, A.R., Gould, N.I.M., Toint, P.L.: Trust Region Methods. MPS-SIAM Series on Optimization. SIAM, Bangkok (2000)

    MATH  Google Scholar 

  23. 23.

    Hager, W.W., Zhang, H.: A survey of nonlinear conjugate gradient methods. Pac. J. Optim. 2(1), 35–58 (2006)

    MathSciNet  MATH  Google Scholar 

  24. 24.

    Hinder, O.: Cutting plane methods can be extended into nonconvex optimization. In: Proceedings of the Thirty First Annual Conference on Computational Learning Theory (2018)

  25. 25.

    Jarre, F.: On Nesterov’s smooth Chebyshev–Rosenbrock function. Optim. Methods Softw. 28(3), 478–500 (2013)

    MathSciNet  MATH  Google Scholar 

  26. 26.

    Jin, C., Ge, R., Netrapalli, P., Kakade, S.M., Jordan, M.I.: How to escape saddle points efficiently. In: Proceedings of the 34th International Conference on Machine Learning (2017)

  27. 27.

    Keshavan, R.H., Montanari, A., Oh, S.: Matrix completion from noisy entries. J. Mach. Learn. Res. 11, 2057–2078 (2010)

    MathSciNet  MATH  Google Scholar 

  28. 28.

    LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)

    Article  Google Scholar 

  29. 29.

    Liu, D., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math. Program. 45(1), 503–528 (1989)

    MathSciNet  MATH  Google Scholar 

  30. 30.

    Loh, P.-L., Wainwright, M.J.: High-dimensional regression with noisy and missing data: provable guarantees with nonconvexity. Ann. Stat. 40(3), 1637–1664 (2012)

    MathSciNet  MATH  Google Scholar 

  31. 31.

    Loh, P.-L., Wainwright, M.J.: Regularized M-estimators with nonconvexity: statistical and algorithmic theory for local optima. J. Mach. Learn. Res. 16, 559–616 (2013)

    MathSciNet  MATH  Google Scholar 

  32. 32.

    Monteiro, R.D., Svaiter, B.F.: An accelerated hybrid proximal extragradient method for convex optimization and its implications to second-order methods. SIAM J. Optim. 23(2), 1092–1125 (2013)

    MathSciNet  MATH  Google Scholar 

  33. 33.

    Murty, K., Kabadi, S.: Some NP-complete problems in quadratic and nonlinear programming. Math. Program. 39, 117–129 (1987)

    MathSciNet  MATH  Google Scholar 

  34. 34.

    Nemirovski, A.: Efficient methods in convex programming. The Israel Institute of Technology, Technion (1994)

  35. 35.

    Nemirovski, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Wiley, Hoboken (1983)

    Google Scholar 

  36. 36.

    Nesterov, Y.: A method of solving a convex programming problem with convergence rate \({O}(1/k^2)\). Sov. Math. Dokl. 27(2), 372–376 (1983)

    MATH  Google Scholar 

  37. 37.

    Nesterov, Y.: Introductory Lectures on Convex Optimization. Kluwer Academic Publishers, Cambridge (2004)

    MATH  Google Scholar 

  38. 38.

    Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)

    MathSciNet  MATH  Google Scholar 

  39. 39.

    Nesterov, Y.: How to make the gradients small. Optima 88, 10–11 (2012)

    Google Scholar 

  40. 40.

    Nesterov, Y., Polyak, B.: Cubic regularization of Newton method and its global performance. Math. Program. Ser. A 108, 177–205 (2006)

    MathSciNet  MATH  Google Scholar 

  41. 41.

    Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, Berlin (2006)

    MATH  Google Scholar 

  42. 42.

    Sun, J., Qu, Q., Wright, J.: A geometric analysis of phase retrieval. Found. Comput. Math. 18(5), 1131–1198 (2018)

    MathSciNet  MATH  Google Scholar 

  43. 43.

    Traub, J., Wasilkowski, H., Wozniakowski, H.: Information-Based Complexity. Academic Press, Cambridge (1988)

    MATH  Google Scholar 

  44. 44.

    Vavasis, S.A.: Black-box complexity of local minimization. SIAM J. Optim. 3(1), 60–80 (1993)

    MathSciNet  MATH  Google Scholar 

  45. 45.

    Woodworth, B.E., Srebro, N.: Tight complexity bounds for optimizing composite objectives. Adv. Neural Inf. Process. Syst. 30, 3639–3647 (2016)

    Google Scholar 

  46. 46.

    Woodworth, B. E., Srebro, N.: Lower bound for randomized first order convex optimization (2017). arXiv:1709.03594 [math.OC]

  47. 47.

    Zhang, X., Ling, C., Qi, L.: The best rank-1 approximation of a symmetric tensor and related spherical optimization problems. SIAM J. Matrix Anal. Appl. 33(3), 806–821 (2012)

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

OH was supported by the PACCAR INC fellowship. YC and JCD were partially supported by the SAIL-Toyota Center for AI Research, NSF-CAREER award 1553086, and a Sloan Foundation Fellowship in Mathematics. YC was partially supported by the Stanford Graduate Fellowship and the Numerical Technologies Fellowship.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Yair Carmon.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

OH was supported by the PACCAR INC fellowship. YC and JCD were partially supported by the SAIL-Toyota Center for AI Research, NSF-CAREER award 1553086, and a Sloan Foundation Fellowship in Mathematics. YC was partially supported by the Stanford Graduate Fellowship and the Numerical Technologies Fellowship.

Appendices

Proof of Propositions 1 and 2

The core of the proofs of Propositions 1 and 2 is the following construction.

Lemma 7

Let \(p\in {\mathbb {N}}\cup \{\infty \}\), \(T_0\in {\mathbb {N}}\) and \({\mathsf {A}}\in \mathcal {A}_{ \textsf {det} }^{(p)}\). There exists an algorithm \({\mathsf {Z}}_{{\mathsf {A}}}\in \mathcal {A}_{ \textsf {zr} }^{(p)}\) with the following property. For every \(f:{\mathbb {R}}^d\rightarrow {\mathbb {R}}\) there exists an orthogonal matrix \(U\in {\mathbb {R}}^{(d+T_0)\times d}\) such that, for every \(\epsilon > 0\),

$$\begin{aligned} {\mathsf {T}}_{\epsilon }\big ({\mathsf {A}}, f_U\big ) > T_0 ~~\text{ or }~~ {\mathsf {T}}_{\epsilon }\big ({\mathsf {A}}, f_U\big ) = {\mathsf {T}}_{\epsilon }\big ({\mathsf {Z}}_{{\mathsf {A}}}, f\big ), \end{aligned}$$

where \(f_U(x) :=f(U^\top x)\).

Proof

We explicitly construct \({\mathsf {Z}}_{{\mathsf {A}}}\) with the following slightly stronger property. For every every \(f:{\mathbb {R}}^d\rightarrow {\mathbb {R}}\) in \(\mathcal {F}\), there exists an orthogonal \(U\in {\mathbb {R}}^{(d+T_0)\times d}\), \(U^{\top } U = I_d\), such that \(f_U(x) :=f(U^{\top } x)\) satisfies that the first \(T_0\) iterates in sequences \({\mathsf {Z}}_{{\mathsf {A}}}[f]\) and \(U^{\top } {\mathsf {A}}[f_U]\) are identical. (Recall the notation \({\mathsf {A}}[f] = \{a^{(t)}\}_{t \in {\mathbb {N}}}\) where \(a^{(t)}\) are the iterates of \({\mathsf {A}}\) on f, and we use the obvious shorthand \(U^{\top } \{a^{(t)}\}_{t\in {\mathbb {N}}} = \{U^{\top } a^{(t)}\}_{t\in {\mathbb {N}}}\).)

Before explaining the construction of \({\mathsf {Z}}_{{\mathsf {A}}}\), let us see how its defining property implies the lemma. If \({\mathsf {T}}_{\epsilon }\big ({\mathsf {A}}, f_U\big ) > T_0\), we are done. Otherwise, \({\mathsf {T}}_{\epsilon }\big ({\mathsf {A}}, f_U\big ) \le T_0\) and we have

$$\begin{aligned} {\mathsf {T}}_{\epsilon }\big ({\mathsf {A}}, f_U\big ) :={\mathsf {T}}_{\epsilon }\big ({\mathsf {A}}[f_U], f_U\big ) {\mathop {=}\limits ^{(i)}} {\mathsf {T}}_{\epsilon }\big (U^{\top } {\mathsf {A}}[f_U], f\big ) {\mathop {=}\limits ^{(ii)}} {\mathsf {T}}_{\epsilon }\big ({\mathsf {Z}}_{{\mathsf {A}}}, f\big ), \end{aligned}$$
(19)

as required. The equality (i) follows because \(\left\| {U g}\right\| = \left\| {g}\right\| \) for all orthogonal U, so for any sequence \(\{a^{(t)}\}_{t\in {\mathbb {N}}}\)

$$\begin{aligned} {\mathsf {T}}_{\epsilon }\big (\{a^{(t)}\}_{t \in {\mathbb {N}}}, f_U\big )&= \inf \left\{ t\in {\mathbb {N}}\mid \Vert {\nabla f_U(a^{(t)})}\Vert \le \epsilon \right\} \\&= \inf \left\{ t\in {\mathbb {N}}\mid \Vert { \nabla f(U^{\top } a^{(t)})}\Vert \le \epsilon \right\} = {\mathsf {T}}_{\epsilon }\big (\{U^{\top } a^{(t)}\}_{t \in {\mathbb {N}}}, f\big ) \end{aligned}$$

and in equality (i) we let \(\{a^{(t)}\}_{t \in {\mathbb {N}}} = {\mathsf {A}}[f_U]\). The equality (ii) holds because \({\mathsf {T}}_{\epsilon }\big (\cdot , \cdot \big )\) is a “stopping time”: if \({\mathsf {T}}_{\epsilon }\big (U^{\top }{\mathsf {A}}[f_U], f\big ) \le T_0\) then the first \(T_0\) iterates of \(U^{\top }{\mathsf {A}}[f_U]\) determine \({\mathsf {T}}_{\epsilon }\big (U^{\top } {\mathsf {A}}[f_U], f\big )\), and these \(T_0\) iterates are identical to the first \(T_0\) iterates of \({\mathsf {Z}}_{{\mathsf {A}}}[f]\) by assumption.

It remains to construct the zero-respecting algorithm \({\mathsf {Z}}_{{\mathsf {A}}}\) with iterates matching those of \({\mathsf {A}}\) under appropriate rotation. We do this by describing its operation inductively on any given \(f:{\mathbb {R}}^d\rightarrow {\mathbb {R}}\), which we denote \(\{z^{(t)}\}_{t\in {\mathbb {N}}} = {\mathsf {Z}}_{{\mathsf {A}}}[f]\). Letting \(d' = d+T_0\), the state of the algorithm \({\mathsf {Z}}_{{\mathsf {A}}}\) at iteration t is determined by a support \(S_t \subseteq [d]\) and orthonormal vectors \(\{u^{(i)}\}_{i \in S_t} \subset {\mathbb {R}}^{d'}\) identified with this support. The support condition (5) defines the set \(S_t\),

$$\begin{aligned} S_t = \bigcup _{q \in [p]} \bigcup _{s < t} \mathop {\text {supp}} \left\{ \nabla ^{{q}}f(z^{(s)})\right\} , \end{aligned}$$

so that \(\emptyset = S_1 \subseteq S_2 \subseteq \cdots \) and the collection \(\{u^{(i)}\}_{i \in S_t}\) grows with t. We let \(U \in {\mathbb {R}}^{d'\times d}\) be the orthogonal matrix whose ith column is \(u^{(i)}\)—even though U may not be completely determined throughout the runtime of \({\mathsf {Z}}_{{\mathsf {A}}}\), our partial knowledge of it will suffice to simulate the operation of \({\mathsf {A}}\) on \(f_U(a) = f(U^{\top } a)\). Letting \(\{a^{(t)}\}_{t \in {\mathbb {N}}} = {\mathsf {A}}[f_U]\), our requirements \({\mathsf {Z}}_{{\mathsf {A}}}[f] = U^{\top } {\mathsf {A}}[f_U]\) and \({\mathsf {Z}}_{{\mathsf {A}}}\in \mathcal {A}_{ \textsf {zr} }\) are equivalent to

$$\begin{aligned} z^{(t)} = U^{\top } a^{(t)} \text{ and } \mathop {\text {supp}} \{z^{(t)}\} \subseteq S_t \end{aligned}$$
(20)

for every \(t \le T_0\) (we set \(z^{(i)}=0\) for every \(i>T_0\) without loss of generality).

Let us proceed with the inductive argument. The iterate \(a^{(1)} \in {\mathbb {R}}^{d'}\) is an arbitrary (but deterministic) vector in \({\mathbb {R}}^{d'}\). We thus satisfy (20) at \(t=1\) by requiring that \(\langle u^{(j)}, a^{(1)}\rangle =0\) for every \(j\in [d]\), whence the first iterate of \({\mathsf {Z}}_{{\mathsf {A}}}\) satisfies \(z^{(1)} = 0 \in {\mathbb {R}}^d\). Assume now the equality and containment (20) holds for every \(s < t\), where \(t \le T_0\) (implying that \({\mathsf {Z}}_{{\mathsf {A}}}\) has emulated the iterates \(a^{(2)}, \ldots , a^{(t-1)}\) of \({\mathsf {A}}\)); we show how \({\mathsf {Z}}_{{\mathsf {A}}}\) can emulate \(a^{(t)}\), the t’th iterate of \({\mathsf {A}}\), and from it can construct \(z^{(t)}\) that satisfies (20). To obtain \(a^{(t)}\), note that for every \(q\le p\), and every \(s < t\), the derivatives \(\nabla ^{{q}} f_U(a^{(s)})\) are a function of \(\nabla ^{{q}} f(z^{(s)})\) and orthonormal the vectors \(\{u^{(i)}\}_{i \in S_{s+1}}\), because \(\mathop {\text {supp}} \{\nabla ^{{q}} f(z^{(s)})\} \subseteq S_{s+1}\) and therefore the chain rule implies

$$\begin{aligned} \left[ \nabla ^{{q}}f_U(a^{(s)})\right] _{j_1, \ldots , j_q} = \sum _{i_1, \ldots , i_q \in S_{s+1}} \left[ \nabla ^{{q}} f(z^{(s)})\right] _{i_1, \ldots , i_q} u^{(i_1)}_{j_1} \cdots u^{(i_q)}_{j_q}. \end{aligned}$$

Since \({\mathsf {A}}\in \mathcal {A}_{ \textsf {det} }^{(p)}\) is deterministic, \(a^{(t)}\) is a function of \(\nabla ^{{q}} f(z^{(s)})\) for \(q\in [p]\) and \(s\in [t-1]\), and thus \({\mathsf {Z}}_{{\mathsf {A}}}\) can simulate and compute it. To satisfy the support condition \(\mathop {\text {supp}} \{z^{(t)}\} \subseteq S_t\) we require that \(\langle u^{(j)}, a^{(t)}\rangle =0\) for every \(j\not \in S_t\). This also means that to compute \(z^{(t)} = U^{\top } a^{(t)}\) we require only the columns of U indexed by the support \(S_t\).

Finally, we need to show that after computing \(S_{t+1}\) we can find the vectors \(\{u^{(i)}\}_{i \in S_{t+1} {\setminus } S_t}\) satisfying \(\langle u^{(j)}, a^{(s)}\rangle =0\) for every \(s\le t\) and \(j\in S_{t+1}{\setminus } S_t\), and additionally that U be orthogonal. Thus, we need to choose \(\{u^{(i)}\}_{i \in S_{t+1}{\setminus } S_t}\) in the orthogonal complement of \(\mathrm {span}\left\{ a^{(1)}, \ldots , a^{(t)}, \{u^{(i)}\}_{i \in S_t}\right\} \). This orthogonal complement has dimension at least \(d'-t-|S_t| = |S_t^c| + T_0 - t \ge |S_t^c|\). Since \(|S_{t+1}{\setminus } S_t| \le |S_t^c|\), there exist orthonormal vectors \(\{u^{(i)}\}_{i \in S_{t+1}{\setminus } S_t}\) that meet the requirements. This completes the induction.

Finally, note that the arguments above hold unchanged for \(p=\infty \). \(\square \)

With Lemma 7 in hand, the propositions follow easily.

Proposition 1

Let \(p\in {\mathbb {N}}\cup \{\infty \}\), \({\mathcal {F}}\) be an orthogonally invariant function class and \(\epsilon >0\). Then

$$\begin{aligned} \mathcal {T}_{\epsilon }\big (\mathcal {A}_{ \textsf {det} }^{(p)}, {\mathcal {F}}\big ) \ge \mathcal {T}_{\epsilon }\big (\mathcal {A}_{ \textsf {zr} }^{(p)}, {\mathcal {F}}\big ). \end{aligned}$$

Proof

We may assume that \(\mathcal {T}_{\epsilon }\big (\mathcal {A}_{ \textsf {det} }^{(p)}, \mathcal {F}\big ) < T_0\) for some integer \(T_0 < \infty \), as otherwise we have \(\mathcal {T}_{\epsilon }\big (\mathcal {A}_{ \textsf {det} }^{(p)}, \mathcal {F}\big )=\infty \) and the result holds trivially. For any \({\mathsf {A}}\in \mathcal {A}_{ \textsf {det} }^{(p)}\) and the value \(T_0\), we invoke Lemma 7 to construct \({\mathsf {Z}}_{{\mathsf {A}}}\in \mathcal {A}_{ \textsf {zr} }^{(p)}\) such that \({\mathsf {T}}_{\epsilon }\big ({\mathsf {A}}, f_U\big ) \ge \min \{T_0, {\mathsf {T}}_{\epsilon }\big ({\mathsf {Z}}_{{\mathsf {A}}}, f\big )\}\) for every \(f\in \mathcal {F}\) and some orthogonal matrix U that depends on f and \({\mathsf {A}}\). Consequently, we have

$$\begin{aligned}&\mathcal {T}_{\epsilon }\big (\mathcal {A}_{ \textsf {det} }^{(p)}, \mathcal {F}\big )\\&\; = \inf _{{\mathsf {A}}\in \mathcal {A}_{ \textsf {det} }^{(p)}}\sup _{f\in \mathcal {F}} {\mathsf {T}}_{\epsilon }\big ({\mathsf {A}}, f\big ) {\mathop {\ge }\limits ^{(i)}} \inf _{{\mathsf {A}}\in \mathcal {A}_{ \textsf {det} }^{(p)}}\sup _{f\in \mathcal {F}} {\mathsf {T}}_{\epsilon }\big ({\mathsf {A}}, f_U\big ) {\mathop {\ge }\limits ^{(ii)}} \min \Big \{T_0,\inf _{{\mathsf {A}}\in \mathcal {A}_{ \textsf {det} }^{(p)}}\sup _{f\in \mathcal {F}} {\mathsf {T}}_{\epsilon }\big ({\mathsf {Z}}_{{\mathsf {A}}}, f\big )\Big \} \\&\; {\mathop {\ge }\limits ^{(iii)}} \min \Big \{T_0,\inf _{{\mathsf {B}} \in \mathcal {A}_{ \textsf {zr} }^{(p)}}\sup _{f\in \mathcal {F}}{\mathsf {T}}_{\epsilon }\big ({\mathsf {B}}, f\big )\Big \} = \min \Big \{T_0,\mathcal {T}_{\epsilon }\big (\mathcal {A}_{ \textsf {zr} }^{(p)}, \mathcal {F}\big )\Big \}, \end{aligned}$$

where inequality (i) uses that \(f_U \in {\mathcal {F}}\) because \({\mathcal {F}}\) is orthogonally invariant, step (ii) uses \({\mathsf {T}}_{\epsilon }\big ({\mathsf {A}}, f_U\big ) \ge \min \{T_0, {\mathsf {T}}_{\epsilon }\big ({\mathsf {Z}}_{{\mathsf {A}}}, f\big )\}\) and step (iii) is due to \({\mathsf {Z}}_{{\mathsf {A}}}\in \mathcal {A}_{ \textsf {zr} }^{(p)}\) by construction. As we chose \(T_0\) for which \(\mathcal {T}_{\epsilon }\big (\mathcal {A}_{ \textsf {det} }^{(p)}, \mathcal {F}\big ) < T_0\), the chain of inequalities implies \( \mathcal {T}_{\epsilon }\big (\mathcal {A}_{ \textsf {det} }^{(p)}, \mathcal {F}\big ) \ge \mathcal {T}_{\epsilon }\big (\mathcal {A}_{ \textsf {zr} }^{(p)}, \mathcal {F}\big )\), concluding the proof. \(\square \)

Proposition 2

Let \(p\in {\mathbb {N}}\cup \{\infty \}\), \(\mathcal {F}\) be an orthogonally invariant function class, \(f\in \mathcal {F}\) with domain of dimension d, and \(\epsilon >0\). If \(\mathcal {T}_{\epsilon }\big (\mathcal {A}_{ \textsf {zr} }^{(p)}, \{f\}\big ) \ge T\), then

$$\begin{aligned} \mathcal {T}_{\epsilon }\big (\mathcal {A}_{ \textsf {det} }^{(p)}, \mathcal {F}\big ) \ge \mathcal {T}_{\epsilon }\big (\mathcal {A}_{ \textsf {det} }^{(p)}, \{f_U \mid U\in \mathsf {O}(d+T,d)\}\big ) \ge T, \end{aligned}$$

where \(f_U := f(U^\top z)\) and \(\mathsf {O}(d+T,d)\) is the set of \((d+T)\times d\) orthogonal matrices, so that \(\{f_U \mid U\in \mathsf {O}(d+T,d)\}\) contains only function with domain of dimension \(d+T\).

Proof

For any \({\mathsf {A}}\in \mathcal {A}_{ \textsf {det} }^{(p)}\), we invoke Lemma 7 with \(T_0 = T\) to obtain \({\mathsf {Z}}_{{\mathsf {A}}}\in \mathcal {A}_{ \textsf {zr} }^{(p)}\) and orthogonal matrix \(U'\) (dependent on f and \({\mathsf {A}}\)) for which

$$\begin{aligned} {\mathsf {T}}_{\epsilon }\big ({\mathsf {A}}, f_{U'}\big ) \ge \min \{T, {\mathsf {T}}_{\epsilon }\big ({\mathsf {Z}}_{{\mathsf {A}}}, f\big )\} = T, \end{aligned}$$

where the last equality is due to \(\inf _{{\mathsf {B}}\in \mathcal {A}_{ \textsf {zr} }^{(p)}}{\mathsf {T}}_{\epsilon }\big ({\mathsf {B}}, f\big ) = \mathcal {T}_{\epsilon }\big (\mathcal {A}_{ \textsf {zr} }^{(p)}, \{f\}\big ) \ge T\). Since \(f_{U'} \in \{f_U \mid U\in \mathsf {O}(d+T,d)\}\), we have

$$\begin{aligned} \sup _{f'\in \{f_U \mid U\in \mathsf {O}(d+T,d)\}}{\mathsf {T}}_{\epsilon }\big ({\mathsf {A}}, f'\big ) \ge T, \end{aligned}$$

and taking the infimum over \({\mathsf {A}}\in \mathcal {A}_{ \textsf {det} }^{(p)}\) concludes the proof. \(\square \)

Technical results

Proof of Lemma 1

Lemma 1

The functions \(\varPsi \) and \(\varPhi \) satisfy the following.

  1. i.

    For all \(x \le \frac{1}{2}\) and all \(k \in {\mathbb {N}}\), \(\varPsi ^{(k)}(x) = 0\).

  2. ii.

    For all \(x \ge 1\) and \(|y| < 1\), \(\varPsi (x)\varPhi '(y) > 1\).

  3. iii.

    Both \(\varPsi \) and \(\varPhi \) are infinitely differentiable, and for all \(k \in {\mathbb {N}}\) we have

    $$\begin{aligned} \sup _x |\varPsi ^{(k)}(x)| \le \exp \left( \frac{5 k}{2}\log (4 k)\right) ~~\text{ and }~~ \sup _x |\varPhi ^{(k)}(x)| \le \exp \left( \frac{3k}{2} \log \frac{3k}{2} \right) . \end{aligned}$$
  4. iv.

    The functions and derivatives \(\varPsi , \varPsi ', \varPhi \) and \(\varPhi '\) are non-negative and bounded, with

    $$\begin{aligned} 0 \le \varPsi< e, ~~ 0 \le \varPsi ' \le \sqrt{54/e}, ~~ 0< \varPhi< \sqrt{2\pi e}, ~~ \text{ and } ~~ 0 < \varPhi ' \le \sqrt{e}. \end{aligned}$$

Each of the statements in the lemma is immediate except for part iii. To see this part, we require a few further calculations. We begin by providing bounds on the derivatives of \(\varPhi (x) = e^\frac{1}{2}\int _{-\infty }^x e^{-\frac{1}{2}t^2} dt\). To avoid annoyances with scaling factors, we define \(\phi (t) = e^{-\frac{1}{2}t^2}\).

Lemma 8

For all \(k \in {\mathbb {N}}\), there exist constants \(c_i^{(k)}\) satisfying \(|c_i^{(k)}| \le (2\max \{i,1\})^k\), and

$$\begin{aligned} \phi ^{(k)}(t) = \bigg (\sum _{i = 0}^k c_i^{(k)} t^i\bigg ) \phi (t). \end{aligned}$$

Proof

We prove the result by induction. We have \(\phi '(t) = -t e^{-\frac{1}{2}t^2}\), so that the base case of the induction is satisfied. Now, assume for our induction that

$$\begin{aligned} \phi ^{(k)}(t) = \sum _{i = 0}^k c_i^{(k)} t^i e^{-\frac{1}{2}t^2} = \sum _{i = 0}^k c_i^{(k)} t^i \phi (t). \end{aligned}$$

where \(|c_i^{(k)}| \le 2^k (\max \{i, 1\})^k\). Then taking derivatives, we have

$$\begin{aligned} \phi ^{(k + 1)}(t) = \sum _{i = 1}^k \left[ i \cdot c_i^{(k)} t^{i - 1} - c_i^{(k)} t^{i + 1} \right] \phi (t) - c_0^{(k)} t \phi (t) = \sum _{i = 0}^{k + 1} c_i^{(k + 1)} t^i \phi (t) \end{aligned}$$

where \(c_i^{(k + 1)} = (i + 1) c_{i + 1}^{(k)} -c_{i-1}^{(k)}\) (and we treat \(c_{k + 1}^{(k)} = 0\)) and \(|c_{k + 1}^{(k + 1)}| = 1\). With the induction hypothesis that \(c_i^{(k)} \le (2\max \{i,1\})^k\), we obtain

$$\begin{aligned} |c_i^{(k + 1)}| \le 2^k (i + 1) (i + 1)^k + 2^k (\max \{i,1\})^k \le 2^{k+1} (i + 1)^{k + 1}. \end{aligned}$$

This gives the result. \(\square \)

With this result, we find that for any \(k\ge 1\),

$$\begin{aligned} \varPhi ^{(k)}(x) = \sqrt{e} \bigg (\sum _{i = 0}^{k - 1} c_i^{(k-1)} x^i \bigg ) \phi (x). \end{aligned}$$

The function \(\log (x^i \phi (x)) = i \log x - \frac{1}{2}x^2\) is maximized at \(x = \sqrt{i}\), so that \(x^i \phi (x) \le \exp (\frac{i}{2} \log \frac{i}{e})\). We thus obtain the numerically verifiable upper bound

$$\begin{aligned} |\varPhi ^{(k)}(x)|&\le \sqrt{e} \sum _{i = 0}^{k - 1} \left( 2\max \{i,1\}\right) ^{k-1} \exp \left( \frac{i}{2} \log \frac{i}{e} \right) \le \exp \left( 1.5 k \log (1.5k)\right) . \end{aligned}$$

Now, we turn to considering the function \(\varPsi (x)\). We assume w.l.o.g. that \(x > \frac{1}{2}\), as otherwise \(\varPsi ^{(k)}(x) = 0\) for all k. Recall \(\varPsi (x) = \exp \left( 1-\frac{1}{(2x - 1)^2}\right) \) for \(x > \frac{1}{2}\). We have the following lemma regarding its derivatives.

Lemma 9

For all \(k \in {\mathbb {N}}\), there exist constants \(c_i^{(k)}\) satisfying \(|c_i^{(k)}| \le 6^k (2i + k)^k\) such that

$$\begin{aligned} \varPsi ^{(k)}(x) = \bigg (\sum _{i = 1}^k \frac{c_i^{(k)}}{(2x - 1)^{k + 2i}} \bigg ) \varPsi (x). \end{aligned}$$

Proof

We provide the proof by induction over k. For \(k = 1\), we have that

$$\begin{aligned} \varPsi '(x) = \frac{4}{(2x - 1)^3} \exp \left( 1- \frac{1}{(2 x - 1)^2} \right) = \frac{4}{(2 x - 1)^3} \varPsi (x), \end{aligned}$$

which yields the base case of the induction. Now, assume that for some k, we have

$$\begin{aligned} \varPsi ^{(k)}(x) = \left( \sum _{i = 1}^k \frac{c_i^{(k)}}{(2x - 1)^{k + 2i}} \right) \varPsi (x). \end{aligned}$$

Then

$$\begin{aligned} \varPsi ^{(k + 1)}(x)&= \left( -\sum _{i = 1}^k \frac{2 (k + 2i) c_i^{(k)}}{(2x - 1)^{k + 1 + 2i}} + \sum _{i = 1}^k \frac{4 c_i^{(k)}}{(2 x - 1)^{k + 3 + 2i}} \right) \varPsi (x) \\&= \left( \sum _{i = 1}^{k + 1} \frac{4 c_{i-1}^{(k)} - 2 (k + 2i) c_i^{(k)}}{ (2x - 1)^{k + 1 + 2i}}\right) \varPsi (x), \end{aligned}$$

where \(c_{k + 1}^{(k)} = 0\) and \(c_{0}^{(k)} = 0\). Defining \(c_{1}^{1} = 4\) and \(c_i^{(k + 1)} = 4 c_{i - 1}^{(k)} - 2 (k + 2i) c_i^{(k)}\) for \(i > 1\), then, under the inductive hypothesis that \(|c_i^{(k)}| \le 6^k (2i + k)^k\), we have

$$\begin{aligned} |c_i^{(k + 1)}|\le & {} 4 \cdot 6^k (k - 2 + 2i)^k + 2 \cdot 6^k (k + 2i) (k + 2i)^k \\\le & {} 6^{k + 1} (k + 2i)^{k + 1} \le 6^{k + 1} (k + 1 + 2i)^{k + 1} \end{aligned}$$

which gives the result. \(\square \)

As in the derivation immediately following Lemma 8, by replacing \(t = \frac{1}{2x - 1}\), we have that \(t^{k + 2i} e^{- t^2}\) is maximized by \(t = \sqrt{(k + 2i)/2}\), so that

$$\begin{aligned} \frac{1}{(2 x - 1)^{k + 2i}} \varPsi (x) \le \exp \left( 1 + \frac{k + 2i}{2} \log \frac{k + 2i}{2e} \right) , \end{aligned}$$

which yields the numerically verifiable upper bound

$$\begin{aligned} |\varPsi ^{(k)}(x)| \le \sum _{i = 1}^k \exp \left( 1 + k\log (6k+12i) + \frac{k + 2i}{2} \log \frac{k + 2i}{2e}\right) \le \exp \left( 2.5 k \log (4 k)\right) . \end{aligned}$$

Proof of Lemma 3

Lemma 3

The function \(\bar{f}_T\) satisfies the following.

  1. i.

    We have \(\bar{f}_T(0) - \inf _x \bar{f}_T(x) \le 12T\).

  2. ii.

    For all \(x \in {\mathbb {R}}^d\), \(\left\| {\nabla \bar{f}_T(x)}\right\| \le 23\sqrt{T}\).

  3. iii.

    For every \(p \ge 1\), the p-th order derivatives of \(\bar{f}_T\) are \(\ell _{p}\)-Lipschitz continuous, where \(\ell _{p} \le \exp (\frac{5}{2} p \log p + c p)\) for a numerical constant \(c<\infty \).

Proof

Part i follows because \(\bar{f}_T(0) < 0\) and, since \(0 \le \varPsi (x) \le e\) and \(0 \le \varPhi (x) \le \sqrt{2 \pi e}\),

$$\begin{aligned} \bar{f}_T(x) \ge -\varPsi \left( 1\right) \varPhi \left( x_{1}\right) -\sum _{i=2}^{T} \varPsi \left( x_{i-1}\right) \varPhi \left( x_{i}\right) > -T \cdot e \cdot \sqrt{2 \pi e} \ge - 12T. \end{aligned}$$

Part ii follows additionally from \(\varPsi (x) = 0\) on \(x < 1/2\), \(0 \le \varPsi '(x) \le \sqrt{54e^{-1}}\) and \(0 \le \varPhi '(x) \le \sqrt{e}\), which when substituted into

$$\begin{aligned} \frac{\partial \bar{f}_T}{\partial x_j}(x)= & {} -\varPsi \left( -x_{j-1}\right) \varPhi '\left( -x_{j}\right) -\varPsi \left( x_{j-1}\right) \varPhi '\left( x_{j}\right) \\&-\varPsi '\left( -x_{j}\right) \varPhi \left( -x_{j+1}\right) -\varPsi '\left( x_{j}\right) \varPhi \left( x_{j+1}\right) \end{aligned}$$

yields

$$\begin{aligned} \left| \frac{\partial \bar{f}_T}{\partial x_j}(x)\right| \le e\cdot \sqrt{e} + \sqrt{54e^{-1}} \cdot \sqrt{2 \pi e} \le 23 \end{aligned}$$

for every x and j. Consequently, \(\left\| {\nabla \bar{f}_T(x)}\right\| \le \sqrt{T} \le 23 \sqrt{T}\).

To establish part iii, fix a point \(x\in {\mathbb {R}}^{T}\) and a unit vector \(v\in {\mathbb {R}}^{T}\). Define the real function \(h_{x,v} : {\mathbb {R}}\rightarrow {\mathbb {R}}\) by the directional projection of \(\bar{f}_T\), \(h_{x,v}(\theta ) :=\bar{f}_T(x+\theta v)\). The function \(\theta \mapsto h_{x,v}(\theta )\) is infinitely differentiable for every x and v. Therefore, \(\bar{f}_T\) has \(\ell _{p}\)-Lipschitz p-th order derivatives if and only if \(|h_{x,v}^{(p+1)}(0)| \le \ell _{p}\) for every x, v. Using the shorthand notation \(\partial _{i_1}\cdots \partial _{i_k}\) for \(\frac{\partial ^k}{\partial x_{i_1} \cdots \partial x_{i_k}}\), we have

$$\begin{aligned} h_{x,v}^{\left( p+1\right) }\left( 0\right) =\sum _{i_{1}, \ldots , i_{p+1}=1}^{T} \partial _{i_{1}}\cdots \partial _{i_{p+1}}\bar{f}_T\left( x\right) v_{i_{1}}\cdots v_{i_{p+1}}\,. \end{aligned}$$

Examining \(\bar{f}_T\), we see that \(\partial _{i_{1}}\cdots \partial _{i_{p+1}}\bar{f}_T\) is non-zero if and only if \(\left| i_{j}-i_{k}\right| \le 1\) for every \(j,k\in \left[ p+1\right] \). Consequently, we can rearrange the above summation as

$$\begin{aligned} h_{x,v}^{\left( p+1\right) }\left( 0\right) = \sum _{\delta _{1},\delta _{2},\ldots ,\delta _{p}\in \left\{ 0,1\right\} ^{p} \cup \left\{ 0,-1\right\} ^{p}} \sum _{i=1}^{T}\partial _{i+\delta _{1}}\cdots \partial _{i+\delta _{p}}\partial _{i} \bar{f}_T\left( x\right) v_{i+\delta _{1}}\cdots v_{i+\delta _{p}}v_{i}, \end{aligned}$$

where we take \(v_0 :=0\) and \(v_{T+1}:=0\). Brief calculation show that

$$\begin{aligned}&\sup _{x \in {\mathbb {R}}^T} \max _{i \in [T]} \max _{\delta \in \{0, 1\}^p \cup \{0, -1\}^p} \left| \partial _{i+\delta _{1}}\cdots \partial _{i+\delta _{p}}\partial _{i}\bar{f}_T(x)\right| \\&\quad \le \max _{k\in [p+1]} \left\{ 2\sup _{x\in {\mathbb {R}}} \left| \varPsi ^{(k)}(x)\right| \sup _{x'\in {\mathbb {R}}} \left| \varPhi ^{(p+1-k)}(x')\right| \right\} \\&\quad \le 2\sqrt{2\pi e}\cdot e^{2.5(p+1)\log (4 (p+1))} \le \exp \left( 2.5p\log p + 4p + 9\right) . \end{aligned}$$

where the second inequality uses Lemma 1.iii, and \(\varPhi (x') \le \sqrt{2\pi e}\) for the case \(k = p+1\). Defining \(\ell _{p} = 2^{p + 1} e^{2.5p\log p + 4p + 9} \le e^{2.5p\log p + 5p + 10}\), we thus have

$$\begin{aligned} \left| h_{x,v}^{\left( p+1\right) }\left( 0\right) \right|\le & {} \sum _{\delta \in \left\{ 0,1\right\} ^{p}\cup \left\{ 0,-1\right\} ^{p}} 2^{-\left( p+1\right) }\ell _{p} \left| \sum _{i=1}^{T}v_{i+\delta _{1}}\cdots v_{i+\delta _{p}}v_{i}\right| \\\le & {} \left( 2^{p+1}-1\right) 2^{-\left( p+1\right) }\ell _{p}\le \ell _{p}, \end{aligned}$$

where we have used \(|\sum _{i=1}^{T}v_{i+\delta _{1}}\cdots v_{i+\delta _{p}}v_{i}|\le 1\) for every \(\delta \in \{0, 1\}^p \cup \{0, -1\}^p\). To see this last claim is true, recall that v is a unit vector and note that

$$\begin{aligned} \sum _{i=1}^{T}v_{i+\delta _{1}}\cdots v_{i + \delta _{p}} v_{i} = \sum _{i=1}^{T} v_{i}^{p + 1- \sum _{j=1}^{p} \delta _{j}} v_{i\pm 1}^{\sum _{j=1}^{p}\delta _{j}}. \end{aligned}$$

If \(\delta = 0\) then \(|\sum _{i=1}^{T}v_{i+\delta _{1}}\cdots v_{i+\delta _{p}}v_{i}| = | \sum _{i=1}^{T}v_{i}^{p+1} | \le \sum _{i=1}^{T}v_{i}^{2}=1\). Otherwise, letting \(1\le \sum _{j=1}^{p}|\delta _{j}|=n\le p\), the Cauchy-Swartz inequality implies

$$\begin{aligned} \left| \sum _{i=1}^{T}v_{i+\delta _{1}}\cdots v_{i+\delta _{p}}v_{i}\right|= & {} \left| \sum _{i=1}^{T}v_{i}^{p+1-n}v_{i+s}^{n}\right| \\\le & {} \sqrt{\sum _{i=1}^{T}v_{i}^{2\left( p+1-n\right) }} \sqrt{\sum _{i=1}^{T}v_{i+s}^{2n}}\le \sum _{i=1}^{T}v_{i}^{2}=1, \end{aligned}$$

where \(s = -1\) or 1. This gives the result. \(\square \)

Proof of Lemma 4

The proof of Lemma 4 uses a number of auxiliary arguments, marked as Lemmas 4a, 4b and 4c . Readers looking to gain a high-level view of the proof of Lemma 4 can safely skip the proofs of these sub-lemmas. In the following, recall that \(U\in {\mathbb {R}}^{d\times T}\) is drawn from the uniform distribution over \(d\times T\) orthogonal matrices (satisfying \(U^T U = I\), as \(d>T\)), that the columns of U are denoted \(u^{(1)}, \ldots , u^{(T)}\), and that \(\tilde{f}_{T;U}(x) = \bar{f}_T(U^\top x)\).

Lemma 4

Let \(\delta > 0\) and \(R \ge \sqrt{T}\), and let \(x^{(1)}, \ldots , x^{(T)}\) be informed by \(\tilde{f}_{T;U}\) and bounded, so that \(\Vert {x^{(t)}}\Vert \le R\) for each T. If \(d \ge 52TR^2 \log \frac{2T^2}{\delta }\) then with probability at least \(1-\delta \), for all \(t \le T\) and each \(j \in \{t, \ldots , T\}\), we have

$$\begin{aligned} |\langle u^{(j)}, x^{(t)}\rangle | < 1/2. \end{aligned}$$

For \(t \in {\mathbb {N}}\), let \(P_t \in {\mathbb {R}}^{d\times d}\) denote the projection operator to the span of \(x^{(1)}, u^{(1)}, \ldots , x^{(t)}, u^{(t)}\), and let \(P_t^\perp = I - P_t\) denote its orthogonal complement. We define the event \(G_t\) as

$$\begin{aligned} G_{t}=\left\{ \max _{j\in \{t,\ldots ,T\}} \left| \left\langle u^{(j)}, P_{t-1}^\perp x^{(t)}\right\rangle \right| \le \alpha \left\| {P_{t-1}^\perp x^{(t)}}\right\| \right\} \text { where }\alpha =\frac{1}{5R\sqrt{T}}. \end{aligned}$$
(21)

For every t, define

$$\begin{aligned} G_{\le t}=\cap _{i\le t}G_{i}\text { and }G_{<t}=\cap _{i<t}G_{i}\,. \end{aligned}$$

The following linear-algebraic result justifies the definition (21) of \(G_t\).

Lemma 4a

For all \(t\le T\), \(G_{\le t}\) implies \(|\langle u^{(j)}, x^{(s)}\rangle | < 1/2\) for every \(s \in \{1,\ldots ,t\}\) and every \(j \in \{s, \ldots , T\}\).

Proof

First, notice that since \(G_{\le t}\) implies \(G_{\le s}\) for every \(s \le t\), it suffices to show that \(G_{\le t}\) implies \(|\langle u^{(j)}, x^{(t)}\rangle | < 1/2\) for every \(j \in \{t, \ldots , T\}\). We will in fact prove a stronger statement:

$$\begin{aligned} \text { For every }t,\, G_{< t}\text { implies }\left\| { P_{t-1}u^{(j)}}\right\| ^{2} \le 2\alpha ^{2}\left( t-1\right) \text { for every }j \in \{t, \ldots , T\},\nonumber \\ \end{aligned}$$
(22)

where we recall that \(P_t \in {\mathbb {R}}^{d\times d}\) is the projection operator to the span of \(x^{(1)}, u^{(1)}, \ldots , x^{(t)}, u^{(t)}\), \(P_t^\perp = I_d - P_t\) and \(\alpha = 1/(5R\sqrt{T})\). Before proving (22), let us show that it implies our result. Fixing \(j \in \{t, \ldots , T\}\), we have

$$\begin{aligned} \left| \left\langle u^{(j)}, x^{(t)}\right\rangle \right| \le \left| \left\langle u^{(j)}, P_{t-1}^\perp x^{(t)}\right\rangle \right| + \left| \left\langle u^{(j)}, P_{t-1}x^{(t)}\right\rangle \right| . \end{aligned}$$

Since \(G_t\) holds, its definition (21) implies \(|\langle u^{(j)}, P_{t-1}^\perp x^{(t)}\rangle | \le \alpha \left\| {P_{t-1}^\perp x^{(t)}}\right\| \le \alpha \left\| {x^{(t)}}\right\| \). Moreover, by Cauchy-Schwarz and the implication (22), we have \(|\langle u^{(j)}, P_{t-1}x^{(t)}\rangle | \le \left\| {P_{t-1}u^{(j)}}\right\| \left\| {x^{(t)}}\right\| \le \sqrt{2\alpha ^2(t-1)}\left\| {x^{(t)}}\right\| \). Combining the two bounds, we obtain the result of the lemma,

$$\begin{aligned} \left| \left\langle u^{(j)}, x^{(t)}\right\rangle \right| \le \left\| {x^{(t)}}\right\| (\alpha + \sqrt{2\alpha ^2(t-1)}) < \frac{5}{2}\sqrt{t}R\alpha \le \frac{1}{2}, \end{aligned}$$

where we have used \(\left\| {x^{(t)}}\right\| \le R\) and \(\alpha = 1/(5R\sqrt{T})\).

We prove bound (22) by induction. The basis of the induction, \(t=1\), is trivial, as \(P_{0}=0\). We shall assume (22) holds for \(s\in \{1, \ldots , t-1\}\) and show that it consequently holds for \(s=t\) as well. We may apply the Graham-Schmidt procedure on the sequence \(x^{(1)}, u^{(1)}, \ldots , x^{(t-1)}, u^{(t-1)}\) to write

$$\begin{aligned} \left\| {P_{t-1}u^{(j)}}\right\| ^{2} = \sum _{i=1}^{t-1} \left| \left\langle \frac{P_{i-1}^\perp x^{(i)}}{\left\| {P_{i-1}^\perp x^{(i)}}\right\| }, u^{(j)}\right\rangle \right| ^2 + \sum _{i=1}^{t-1} \left| \left\langle \frac{{\hat{P}}_{i-1}^\perp u^{(i)}}{\left\| {{\hat{P}}_{i-1}^\perp u^{(i)}}\right\| }, u^{(j)}\right\rangle \right| ^2 \end{aligned}$$
(23)

where \({\hat{P}}_{k}\) is the projection to the span of \(\{x^{(1)}, u^{(1)}, \ldots , x^{(k)}, u^{(k)}, x^{(k+1)}\} \),

$$\begin{aligned} {\hat{P}}_{k}=P_{k}+\frac{1}{\left\| { P_{k}^{\perp }x^{(k+1)}}\right\| ^{2}}\left( P_{k}^{\perp }x^{(k+1)}\right) \left( P_{k}^{\perp }x^{(k+1)}\right) ^{\top }. \end{aligned}$$

Then for every \(j > i\) we have

$$\begin{aligned} \left\langle {\hat{P}}_{i-1}^\perp u^{(i)}, u^{(j)}\right\rangle= & {} - \left\langle {\hat{P}}_{i-1} u^{(i)}, u^{(j)}\right\rangle = - \left\langle {P}_{i-1} u^{(i)}, u^{(j)}\right\rangle \\&-\frac{ \left\langle u^{(i)}, P_{i-1}^\perp x^{(i)}\right\rangle \left\langle u^{(j)}, P_{i-1}^\perp x^{(i)}\right\rangle }{\left\| {P_{i-1}^\perp x^{(i)}}\right\| ^2}, \end{aligned}$$

where the equalities hold by \(\left\langle u^{(i)}, u^{(j)}\right\rangle = 0\), \({\hat{P}}_{i-1}^\perp = I - {\hat{P}}_{i-1}\), and the definition of \({\hat{P}}_{i-1}\).

The \(P_i\) matrices are projections, so \({P}_{i-1}^2 = {P}_{i-1}\), and Cauchy-Swartz and the induction hypothesis imply

$$\begin{aligned} \left| \left\langle {P}_{i-1} u^{(i)}, u^{(j)}\right\rangle \right| = \left| \left\langle {P}_{i-1} u^{(i)}, {P}_{i-1} u^{(j)}\right\rangle \right| \le \left\| { P_{i-1}u^{(i)}}\right\| \left\| { P_{i-1}u^{(j)}}\right\| \le 2\alpha ^{2}\cdot \left( i-1\right) . \end{aligned}$$

Moreover, the event \(G_{i}\) implies \( \left| \langle u^{(i)}, P_{i-1}^\perp x^{(i)}\rangle \langle u^{(j)}, P_{i-1}^\perp x^{(i)}\rangle \right| \le \alpha ^{2}\left\| {P_{i-1}^\perp x^{(i)}}\right\| ^2\), so

$$\begin{aligned} \left| \left\langle {\hat{P}}_{i-1}^\perp u^{(i)}, u^{(j)}\right\rangle \right|\le & {} \left| \left\langle {P}_{i-1} u^{(i)}, u^{(j)}\right\rangle \right| + \left| \frac{ \left\langle u^{(i)}, P_{i-1}^\perp x^{(i)}\right\rangle \left\langle u^{(j)}, P_{i-1}^\perp x^{(i)}\right\rangle }{\left\| {P_{i-1}^\perp x^{(i)}}\right\| ^2}\right| \nonumber \\\le & {} \alpha ^{2}\left( 2i-1\right) \le \frac{\alpha }{2}, \end{aligned}$$
(24a)

where the last transition uses \(\alpha =\frac{1}{5R\sqrt{T}}\le \frac{1}{4i}\) because \(R\ge \sqrt{T}\ge i\). We also have the lower bound

$$\begin{aligned} \left\| { {\hat{P}}_{i-1}^{\perp }u^{(i)}}\right\| ^{2}= & {} \left| \left\langle {\hat{P}}_{i-1}^\perp u^{(i)}, u^{(i)}\right\rangle \right| =1-\left\| { P_{i-1}u^{(i)}}\right\| ^{2} - \frac{ \left( \left\langle u^{(i)}, P_{i-1}^\perp x^{(i)}\right\rangle \right) ^{2}}{\left\| { P_{i-1}^{\perp }x^{(i)}}\right\| ^{2}}\nonumber \\\ge & {} 1-\alpha ^{2}\left( 2i-1\right) \ge \frac{1}{2}, \end{aligned}$$
(24b)

where the first equality uses \((P_{i-1}^\perp )^2 = P_{i-1}^\perp \), the second the definition of \({\hat{P}}_{i-1}\), and the inequality uses \(\langle u^{(j)}, P_{i-1}^\perp x^{(i)}\rangle \le \alpha \Vert {P_{i-1}^\perp x^{(i)}}\Vert \) and \(\Vert { P_{i-1}u^{(j)}}\Vert ^{2} \le 2\alpha ^{2}\left( i-1\right) \).

Combining the observations (24a) and (24b), we can bound each summand in the second summation in (23). Since the summands in the first summation are bounded by \(\alpha ^2\) by definition (21) of \(G_i\), we obtain

$$\begin{aligned} \big \Vert {P_{t-1}u^{(j)}}\big \Vert ^{2} \le \sum _{i=1}^{t-1}\alpha ^{2} +\sum _{i=1}^{t-1}\frac{\left( \alpha /2\right) ^{2}}{1/2} =\alpha ^{2}\left( t-1+\frac{t-1}{2}\right) \le 2\alpha ^{2}\left( t-1\right) , \end{aligned}$$

which completes the induction. \(\square \)

By Lemma 4a the event \(G_{\le T}\) implies our result, so using \(\mathbb {P}(G_{\le T}^c) \le \sum _{t = 1}^T\mathbb {P}(G_t^c \mid G_{< t})\), it suffices to show that

$$\begin{aligned} \mathbb {P}\left( G_{\le T}^c\right) \le \sum _{t=1}^{T} \mathbb {P}( G_t^c \mid G_{<t} ) \le \delta . \end{aligned}$$
(25)

Let us therefore consider \(\mathbb {P}\left( G_{t}^{c} \mid G_{<t}\right) \). By the union bound and fact that \(\left\| { P_{t-1}^{\perp }u^{(j)}}\right\| \le 1\) for every t and j,

$$\begin{aligned}&\mathbb {P}(G_{t}^{c} \mid G_{<t})\nonumber \\&\quad \le \sum _{j \in \{t,\ldots ,T\}} \mathbb {P}\left( \left| \left\langle u^{(j)}, \frac{P_{t-1}^\perp x^{(t)}}{\left\| {P_{t-1}^\perp x^{(t)}}\right\| }\right\rangle \right|> \alpha \mid G_{<t} \right) \nonumber \\&\quad = \sum _{j \in \{t,\ldots ,T\}} \mathbb {E}_{\xi , U_{(<t)}} \mathbb {P}\left( \left| \left\langle u^{(j)}, \frac{P_{t-1}^\perp x^{(t)}}{\left\| {P_{t-1}^\perp x^{(t)}}\right\| }\right\rangle \right|> \alpha \mid \xi , U_{(<t)}, G_{<t} \right) \nonumber \\&\quad \le \sum _{j \in \{t,\ldots ,T\}} \mathbb {E}_{\xi , U_{(<t)}} \mathbb {P}\left( \left| \left\langle \frac{P_{t-1}^\perp u^{(j)}}{\left\| {P_{t-1}^\perp u^{(j)}}\right\| }, \frac{P_{t-1}^\perp x^{(t)}}{\left\| {P_{t-1}^\perp x^{(t)}}\right\| }\right\rangle \right| > \alpha \mid \xi , U_{(<t)}, G_{<t} \right) , \end{aligned}$$
(26)

where \(U_{(<t)}\) is shorthand for \(u^{(1)},\ldots ,u^{(t-1)}\) and \(\xi \) is the random variable generating \(x^{(1)},\ldots ,x^{(T)}\).

In the following lemma, we state formally that conditioned on \(G_{<i}\), the iterate \(x^{(i)}\) depends on U only through its first \((i-1)\) columns.

Lemma 4b

For every \(i\le T\), there exist measurable functions \({\mathsf {A}}^{(i)}_+\) and \({\mathsf {A}}^{(i)}_{-}\) such that

$$\begin{aligned} x^{(i)}={\mathsf {A}}^{(i)}_+\left( \xi ,U_{(<i)}\right) 1_{\left( {G_{<i}}\right) } + {\mathsf {A}}^{(i)}_{-}\left( \xi ,U\right) 1_{\left( {G_{<i}^{c}}\right) }. \end{aligned}$$
(27)

Proof

Since the iterates are informed by \(\tilde{f}_{T;U}\), we may write each one as (recall definition (4))

$$\begin{aligned} x^{(i)}={\mathsf {A}}^{(i)}\left( \xi , \nabla ^{{(0,\ldots ,p)}} \tilde{f}_{T;U}(x^{(1)}), \ldots , \nabla ^{{(0,\ldots ,p)}}\tilde{f}_{T;U}(x^{(i-1)})\right) = {\mathsf {A}}^{(i)}_{-}\left( \xi ,U\right) , \end{aligned}$$

for measurable functions \({\mathsf {A}}^{(i)},{\mathsf {A}}^{(i)}_{-}\), where we recall the shorthand \(\nabla ^{{(0,\ldots ,p)}} h(x)\) for the derivatives of h at x to order p. Crucially, by Lemma 4a, \(G_{<i}\) implies \(|\langle u^{(j)}, x^{(s)}\rangle | < \frac{1}{2}\) for every \(s<i\) and every \(j\ge s\). As \(\bar{f}_T\) is a fixed robust zero-chain (Definition 4), for any \(s<i\), the derivatives of \(\tilde{f}_{T;U}\) at \(x^{(s)}\) can therefore be expressed as functions of \(x^{(s)}\) and \(u^{(1)},\ldots ,u^{(s-1)}\), and—applying this argument recursively—we see that \(x^{(i)}\) is of the form (27) for every \(i\le T\). \(\square \)

Consequently (as \(G_{<t}\) implies \(G_{<i}\) for every \(i\le t\)), conditioned on \(\xi , U_{(<t)}\) and \(G_{<t}\), the iterates \(x^{(1)}, \ldots , x^{(t)}\) are deterministic, and so is \(P_{t-1}^{\perp }x^{(t)}\). If \(P_{t-1}^{\perp }x^{(t)}=0\) then \(G_t\) holds and \(\mathbb {P}(G_{t}^{c}\mid G_{<t})=0\), so we may assume without loss of generality that \(P_{t-1}^{\perp }x^{(t)}\ne 0\). We may therefore regard \({P_{t-1}^\perp x^{(t)}}/{\left\| {P_{t-1}^\perp x^{(t)}}\right\| }\) in (26) as a deterministic unit vector in the subspace \(P_{t-1}^\perp \) projects to. We now characterize the conditional distribution of \({P_{t-1}^\perp u^{(j)}}/{\left\| {P_{t-1}^\perp u^{(j)}}\right\| }\).

Lemma 4c

Let \(t\le T\), and \(j\in \{t, \ldots , T\}\). Then conditioned on \(\xi , U_{(<t)}\) and \(G_{<t}\), the vector \(\frac{P_{t-1}^\perp u^{(j)}}{\left\| {P_{t-1}^\perp u^{(j)}}\right\| }\) is uniformly distributed on the unit sphere in the subspace to which \(P_{t-1}^\perp \) projects.

Proof

This lemma is subtle. The vectors \(u^{(j)}\), \(j\ge t\), conditioned on \(U_{(<t)}\), are certainly uniformly distributed on the unit sphere in the subspace orthogonal to \(U_{(<t)}\). However, the additional conditioning on \(G_{<t}\) requires careful handling. Throughout the proof we fix \(t\le T\) and \(j\in \{t,\ldots ,T\}\). We begin by noting that by (22), \(G_{<t}\) implies

$$\begin{aligned} \left\| {P_{t-1}^\perp u^{(j)} }\right\| ^2 = 1 - \left\| {P_{t-1} u^{(j)} }\right\| ^2 \ge 1 - 2\alpha ^2(t-1) > 0. \end{aligned}$$

Therefore, when \(G_{<t}\) holds we have \(P_{t-1}^\perp u^{(j)} \ne 0\) so \({P_{t-1}^\perp u^{(j)}}/{\left\| {P_{t-1}^\perp u^{(j)}}\right\| }\) is well-defined.

To establish our result, we will show that the density of \(U_{(\ge t)} = [u^{(t)}, \ldots , u^{(T)}]\) conditioned on \(\xi , U_{(< t)}, G_{<t}\) is invariant to rotations that preserve the span of \(x^{(1)},u^{(1)},\ldots ,x^{(t-1)},u^{(t-1)}\). More formally, let \(p_{\ge t}\) denote the density of \(U_{(\ge t)}\) conditional on \(\xi ,U_{(< t)}\) and \(G_{<t}\). We wish to show that

$$\begin{aligned} p_{\ge t} \left( U_{(\ge t)} \mid \xi ,U_{(<t)},G_{<t}\right) = p_{\ge t} \left( Z U_{(\ge t)} \mid \xi ,U_{(<t)},G_{<t}\right) \end{aligned}$$
(28)

for every rotation \(Z\in {\mathbb {R}}^{d\times d}\), \(Z^{\top } Z = I_d\), satisfying

$$\begin{aligned} Zv=v=Z^{\top }v ~~ \text{ for } \text{ all } ~~ v\in \left\{ x^{(1)},u^{(1)},\ldots ,x^{(t-1)},u^{(t-1)}\right\} . \end{aligned}$$

Throughout, we let Z denote such a rotation. Letting \(p_{\xi ,U}\) and \(p_{U}\) denote the densities of \((\xi , U)\) and U, respectively, we have

$$\begin{aligned} p_{\ge t} \left( U_{(\ge t)} \mid \xi , U_{(<t)},G_{<t}\right)= & {} \frac{\mathbb {P}\left( G_{<t} \mid \xi ,U\right) p_{\xi , U}\left( \xi , U \right) }{\mathbb {P}\left( G_{<t} \mid \xi ,U_{(<t)}\right) p_{\xi ,U_{(<t)}}\left( \xi ,U_{(<t)} \right) }\\= & {} \frac{\mathbb {P}\left( G_{<t} \mid \xi ,U\right) p_{U}\left( U\right) }{\mathbb {P}\left( G_{<t} \mid \xi ,U_{(<t)}\right) p_{U_{(<t)}}\left( U_{(<t)}\right) } \end{aligned}$$

where the first equality holds by the definition of conditional probability and second by the independence of \(\xi \) and U. We have \(Z U_{(<t)} = U_{(<t)} \) and therefore, by the invariance of U to rotations, \(p_U([U_{(<t)}, Z U_{(\ge t)}]) = p_U(Z U) = p_U(U)\). Hence, replacing U with ZU in the above display yields

$$\begin{aligned} p_{\ge t} \left( Z U_{(\ge t)} \mid \xi ,U_{(<t)},G_{<t}\right) = \frac{\mathbb {P}\left( G_{<t} \mid \xi , Z U\right) p_{U}\left( U\right) }{\mathbb {P}\left( G_{<t} \mid \xi ,U_{(<t)}\right) p_{U_{(<t)}}\left( U_{(<t)}\right) }. \end{aligned}$$

Therefore if we prove \(\mathbb {P}(G_{<t}\mid \xi ,U) = \mathbb {P}(G_{<t} \mid \xi ,Z U)\)—as we proceed to do—then we can conclude the equality (28) holds.

First, note that \(\mathbb {P}\left( G_{<t}\mid \xi ,U\right) \) is supported on \(\{0,1\}\) for every \(\xi ,U\), as they completely determine \(x^{(1)},\ldots ,x^{(T)}\). It therefore suffices to show that \(\mathbb {P}(G_{<t} \mid \xi ,U)=1\) if and only if \(\mathbb {P}\left( G_{<t} \mid \xi ,Z U\right) =1\). Set \(U'=Z U\), observing that \({u'}^{(i)}=Z u^{(i)}=u^{(i)}\) for any \(i<t\), and let \({x'}^{(1)},\ldots ,{x'}^{(T)}\) be the sequence generated from \(\xi \) and \(U'\). We will prove by induction on i that \(\mathbb {P}(G_{<t}\mid \xi ,U)=1\) implies \(\mathbb {P}(G_{<i}\mid \xi ,U')=1\) for every \(i\le t\). The basis of the induction is trivial as \(G_{<1}\) always holds. Suppose now that \(\mathbb {P}(G_{<i}\mid \xi ,U')=1\) for \(i<t\), and therefore \({x'}^{(1)},\ldots ,{x'}^{(i)}\) can be written as functions of \(\xi \) and \({u'}^{(1)},\ldots ,{u'}^{(i-1)}=u^{(1)},\ldots ,u^{(i-1)}\) by Lemma 4b. Consequently, \({x'}^{(l)}=x^{(l)}\) for any \(l\le i\) and also \(P_{i-1}'^{\perp }{x'}^{(i)}=P_{i-1}^{\perp }x^{(i)}\). Therefore, for any \(l\ge i\),

$$\begin{aligned} \left| \left\langle {u'}^{(l)}, \frac{P_{i-1}'^{\perp }{x'}^{(i)}}{\left\| {P_{i-1}'^{\perp }{x'}^{(i)}}\right\| }\right\rangle \right| {\mathop {=}\limits ^{(i)}} \left| \left\langle u^{(l)}, Z^{\top } \frac{P_{i-1}^{\perp }{x}^{(i)}}{\left\| {P_{i-1}^{\perp }{x}^{(i)}}\right\| }\right\rangle \right| {\mathop {=}\limits ^{(ii)}} \left| \left\langle u^{(l)}, \frac{P_{i-1}^{\perp }{x}^{(i)}}{\left\| {P_{i-1}^{\perp }{x}^{(i)}}\right\| }\right\rangle \right| {\mathop {\le }\limits ^{(iii)}} \alpha , \end{aligned}$$

where in (i) we substituted \({u'}^{(l)} = Z u^{(l)}\) and \(P_{i-1}'^{\perp }{x'}^{(i)}=P_{i-1}^{\perp }x^{(i)}\), (ii) is because \(P_{i-1}^{\perp }x^{(i)}=x^{(i)}-P_{i-1}x^{(i)}\) is in the span of vectors \(\left\{ x^{(1)},u^{(1)},\ldots ,x^{(i-1)},u^{(i-1)},x^{(i)}\right\} \) and therefore not modified by \(Z^{\top }\), and (iii) is by our assumption that \(G_{<t}\) holds, and so \(G_{i}\) holds. Therefore \(\mathbb {P}\left( G_{i}\mid \xi ,U'\right) =1\) and \(\mathbb {P}\left( G_{<i+1}\mid \xi ,U'\right) =1\), concluding the induction. An analogous argument shows that \(\mathbb {P}\left( G_{<t}\mid \xi ,U'\right) =1\) implies \(\mathbb {P}\left( G_{<t}\mid \xi ,U\right) =\mathbb {P}\left( G_{<t}\mid \xi ,Z^{\top }U'\right) =1\) and thus \(\mathbb {P}\left( G_{<t}\mid \xi ,U\right) =\mathbb {P}\left( G_{<t}\mid \xi ,Z U\right) \) as required.

Marginalizing the density (28) to obtain a density for \(u^{(j)}\) and recalling that \(P_{t-1}^\perp \) is measurable \(\xi , U_{(< t)}, G_{<t}\), we conclude that, conditioned on \(\xi , U_{(< t)}, G_{<t}\) the random variable \(\frac{P_{t-1}^\perp u^{(j)} }{\left\| {P_{t-1}^\perp u^{(j)} }\right\| }\) has the same density as \(\frac{ P_{t-1}^\perp Z u^{(j)} }{\left\| { P_{t-1}^\perp Z u^{(j)} }\right\| }\). However, \(P_{t-1}^\perp Z = Z P_{t-1}^\perp \) by assumption on Z, and therefore

$$\begin{aligned} \frac{ P_{t-1}^\perp Z u^{(j)} }{\left\| { P_{t-1}^\perp Z u^{(j)} }\right\| } = Z \frac{ P_{t-1}^\perp u^{(j)} }{\left\| { P_{t-1}^\perp u^{(j)} }\right\| }. \end{aligned}$$

We conclude that the conditional distribution of the unit vector \(\frac{ P_{t-1}^\perp u^{(j)} }{\left\| { P_{t-1}^\perp u^{(j)} }\right\| }\) is invariant to rotations in the subspace to which \(P_{t-1}^\perp \) projects. \(\square \)

Summarizing the discussion above, the conditional probability in (26) measures the inner product of two unit vectors in a subspace of \({\mathbb {R}}^d\) of dimension \(d'=\mathrm {tr}\left( P_{t-1}^{\perp }\right) \ge d-2\left( t-1\right) \), with one of the vectors deterministic and the other uniformly distributed. We may write this as

$$\begin{aligned} \mathbb {P}\left( \left| \left\langle \frac{P_{t-1}^\perp u^{(j)}}{\left\| {P_{t-1}^\perp u^{(j)}}\right\| }, \frac{P_{t-1}^\perp x^{(t)}}{\left\| {P_{t-1}^\perp x^{(t)}}\right\| }\right\rangle \right|> \alpha \mid \xi , U_{(<t)}, G_{<t} \right) = \mathbb {P}( |v_1| > \alpha ), \end{aligned}$$

where v is uniformly distributed on the unit sphere in \({\mathbb {R}}^{d'}\). By a standard concentration of measure bound on the sphere [5, Lecture 8],

$$\begin{aligned} \mathbb {P}( |v_1| > \alpha ) \le 2e^{-d'\alpha ^2/2} \le 2e^{-\frac{\alpha ^{2}}{2}\left( d-2t\right) }. \end{aligned}$$

Substituting this bound back into the probability (26) gives

$$\begin{aligned} \mathbb {P}\left( G_{t}^{c}\mid G_{<t}\right) \le 2\left( T-t+1\right) e^{-\frac{\alpha ^{2}}{2}\left( d-2t\right) } \le 2 Te^{-\frac{\alpha ^{2}}{2}\left( d-2T\right) }. \end{aligned}$$

Substituting this in turn into the bound (25), we have \(\mathbb {P}(G_{\le T}^c) \le \sum _{t = 1}^T\mathbb {P}(G_t^c \mid G_{< t}) \le 2 T^2 e^{-\frac{\alpha ^2}{2} (d - 2T)}\). Setting \(d\ge 52TR^{2}\log \frac{2T^{2}}{\delta }\ge \frac{2}{\alpha ^{2}}\log \frac{2T^{2}}{\delta }+2T\) establishes \(\mathbb {P}(G_{\le T}^c)\le \delta \), concluding Lemma 4. \(\square \)

Proof of Lemma 6

Lemma 6

The function \(\hat{f}_{T;U}\) satisfies the following.

  1. i.

    We have \(\hat{f}_{T;U}(0) - \inf _x \hat{f}_{T;U}(x) \le 12T\).

  2. ii.

    For every \(p \ge 1\), the pth order derivatives of \(\hat{f}_{T;U}\) are \(\hat{\ell }_{p}\)-Lipschitz continuous, where \(\hat{\ell }_{p} \le \exp (c p \log p + c )\) for a numerical constant \(c<\infty \).

Proof

Part i holds because \(\hat{f}_{T;U}\left( 0\right) =\bar{f}_T\left( 0\right) \) and \(\hat{f}_{T;U}\left( x\right) \ge \tilde{f}_{T;U}\left( \rho (x)\right) \) for every x, so

$$\begin{aligned} \inf _{x\in {\mathbb {R}}^{d}}\hat{f}_{T;U}\left( x\right) \ge \inf _{x\in {\mathbb {R}}^{d}}\tilde{f}_{T;U}\left( \rho (x)\right) = \inf _{\left\| {x}\right\| \le R} \bar{f}_T\left( x\right) \ge \inf _{x\in {\mathbb {R}}^{d}}\bar{f}_T\left( x\right) , \end{aligned}$$

and therefore by Lemma 3.i, we have \(\hat{f}_{T;U}(0) - \inf _{x}\hat{f}_{T;U}(x) \le \bar{f}_T(0)-\inf _{x}\bar{f}_T(x) \le 12T\).

Establishing part ii requires substantially more work. Since smoothness with respect to Euclidean distances is invariant under orthogonal transformations, we take U to be the first \(T\) columns of the d-dimensional identity matrix, denoted \(U=I_{d,T}\). Recall the scaling \(\rho (x) = R x / \sqrt{R^2 + \left\| {x}\right\| ^2}\) with “radius” \(R = 230 \sqrt{T}\) and the definition \(\hat{f}_{T;U}(x) = \bar{f}_T(U^{\top }\rho (x)) + \frac{1}{10} \left\| {x}\right\| ^2\).

The quadratic \(\frac{1}{10} \left\| {x}\right\| ^2\) term in \(\hat{f}_{T;U}\) has \(\frac{1}{5}\)-Lipschitz first derivative and 0-Lipschitz higher order derivatives (as they are all constant or zero), and we take \(U = I_{d,T}\) without loss of generality, so we consider the function

$$\begin{aligned} \hat{f}_{T;I}(x) :=\bar{f}_T(\rho (x)) = \bar{f}_T\left( \left[ \rho \left( x\right) \right] _{1}, \ldots , \left[ \rho \left( x\right) \right] _{T}\right) . \end{aligned}$$

We now compute the partial derivatives of \(\hat{f}_{T;I}\). Defining \(y = \rho (x)\), let \({\widetilde{\nabla }}^{{k}}_{j_1, \ldots , j_k} :=\frac{\partial ^k}{\partial y_{j_1}\cdots \partial y_{j_k}}\) denote derivatives with respect to y. In addition, define \({\mathcal {P}}_k\) to be the set of all partitions of \([k] = \{1, \ldots , k\}\), i.e. \((S_1, \ldots , S_L) \in {\mathcal {P}}_k\) if and only if the \(S_i\) are disjoint and \(\cup _l S_l = [k]\). Using the chain rule, we have for any k and set of indices \(i_1, \ldots , i_k \le T\) that

$$\begin{aligned}&\nabla ^{{k}}_{i_1, \ldots , i_k} \hat{f}_{T;I}(x) \nonumber \\&\quad = \sum _{(S_1, \ldots , S_L) \in {\mathcal {P}}_k} \sum _{j_1, \ldots , j_L =1}^T\bigg (\prod _{l = 1}^L \nabla ^{{|S_l|}}_{i_{{S_l}}} \rho _{j_l}(x)\bigg ) {\widetilde{\nabla }}^{{L}}_{j_1, \ldots , j_L} \bar{f}_T(y), ~~ y = \rho (x), \end{aligned}$$
(29)

where we have used the shorthand \(\nabla ^{{|S|}}_{i_{S}}\) to denote the partial derivatives with respect to each of \(x_{i_j}\) for \(j \in S\). We use the equality (29) to argue that (recall the identity (2))

$$\begin{aligned} \left\| {\nabla ^{{p+1}} \hat{f}_{T;I}(x)}\right\| _{\mathrm{op}} = \sup _{\left\| {v}\right\| = 1} \langle \nabla ^{{p+1}} \hat{f}_{T;I}(x), v^{\otimes (p+1)}\rangle :=\hat{\ell }_{p} - \frac{1}{5}1_{\left( {p=1}\right) } \le e^{c p\log p + c}, \end{aligned}$$

for some numerical constantFootnote 5, \(0< c < \infty \) and every \(p\ge 1\). As explained in Sect. 2.1, this implies \(\hat{f}_{T;U}\) has \(e^{c p\log p + c}\)-Lipschitz pth order derivative, giving part ii of the lemma.

To do this, we begin by considering the partitioned sum (29). Let \(v \in {\mathbb {R}}^d\) be an arbitrary direction with \(\left\| {v}\right\| = 1\). Then for \(j \in [d]\) and \(k \in {\mathbb {N}}\) we define the quantity

$$\begin{aligned} {\widetilde{v}}_j^k = {\widetilde{v}}_j^k(x) :=\langle \nabla ^{{k}} \rho _j(x), v^{\otimes k}\rangle , \end{aligned}$$

algebraic manipulations and rearrangement of the sum (29) yield

$$\begin{aligned}&\langle \nabla ^{{k}} \hat{f}_{T;I}(x), v^{\otimes k}\rangle \\&\quad = \sum _{(S_1, \ldots , S_L) \in {\mathcal {P}}_k} \sum _{i_1, \ldots , i_k = 1}^d v_{i_1} v_{i_2} \cdots v_{i_k} \sum _{j_1, \ldots , j_L =1}^T\bigg (\prod _{l = 1}^L \nabla ^{{|S_l|}}_{i_{{S_l}}} \rho _{j_l}(x)\bigg ) {\widetilde{\nabla }}^{{L}}_{j_1, \ldots , j_L} \bar{f}_T(y) \\&\quad = \sum _{(S_1, \ldots , S_L) \in {\mathcal {P}}_k} \sum _{j_1, \ldots , j_L =1}^T{\widetilde{v}}_{j_1}^{|S_1|} \cdots {\widetilde{v}}_{j_L}^{|S_L|} {\widetilde{\nabla }}^{{L}}_{j_1, \ldots , j_L} \bar{f}_T(y) \\&\quad = \sum _{(S_1, \ldots , S_L) \in {\mathcal {P}}_k} \left\langle {\widetilde{\nabla }}^{{L}} \bar{f}_T(y), {\widetilde{v}}^{|S_1|} \otimes \cdots \otimes {\widetilde{v}}^{|S_L|}\right\rangle . \end{aligned}$$

We claim that there exists a numerical constant \(c < \infty \) such that for all \(k \in {\mathbb {N}}\),

$$\begin{aligned} \sup _x \Vert {{\widetilde{v}}^k(x)}\Vert \le \exp (c k \log k + c) R^{1 - k}. \end{aligned}$$
(30)

Before proving inequality (30), we show how it implies the desired lemma. By the preceding display, we have

$$\begin{aligned} |\langle \nabla ^{{p+1}}\hat{f}_{T;I}(x), v^{\otimes (p+1)}\rangle | \le \sum _{(S_1, \ldots , S_L) \in {\mathcal {P}}_{p+1}} \left\| {{\widetilde{\nabla }}^{{L}} \bar{f}_T(y)}\right\| _{\mathrm{op}} \prod _{l = 1}^L \Vert {{\widetilde{v}}^{|S_l|}}\Vert . \end{aligned}$$

Lemma 3 shows that there exists a numerical constant \(c < \infty \) such that

$$\begin{aligned} \left\| {\nabla ^{(L)} \bar{f}_T(y)}\right\| _{\mathrm{op}} \le \ell _{L-1} \le \exp (c L \log L + c)~\text{ for } \text{ all } L \ge 2. \end{aligned}$$

When the number of partitions \(L = 1\), we have \(|S_1| = p+1 \ge 2\), and so Lemma 3.ii yields

$$\begin{aligned} \left\| {\nabla \bar{f}_T(y)}\right\| _{\mathrm{op}} \Vert {{\widetilde{v}}^{|S_1|}}\Vert= & {} \left\| {\nabla \bar{f}_T(y)}\right\| \Vert {{\widetilde{v}}^{|S_1|}}\Vert \le 23 \sqrt{T} \cdot R^{-p} \exp (c p \log p + c) \\\le & {} \exp (c p \log p + c), \end{aligned}$$

where we have used \(R = 230 \sqrt{T}\). Using \(|S_1| + \cdots + |S_L| = p+1\) and the fact that \(q(x) = (x+1)\log (x+1)\) satisfies \(q(x)+q(y) \le q(x+y)\) for every \(x,y>0\), we have

$$\begin{aligned} \left\| {{\widetilde{\nabla }}^{{L}} \bar{f}_T(y)}\right\| _{\mathrm{op}} \prod _{l = 1}^L \Vert {{\widetilde{v}}^{|S_l|}}\Vert \le \exp (c p \log p + c) \end{aligned}$$

for some \(c < \infty \) and every \((S_1, \ldots , S_L) \in {\mathcal {P}}_{p+1}\). Bounds on Bell numbers [6, Thm. 2.1] give that there are at most \(\exp (k \log k)\) partitions in \({\mathcal {P}}_k\), which combined with the bound above gives desired result.

Let us return to the derivation of inequality (30). We begin by recalling Faà di Bruno’s formula for the chain rule. Let \(f, g : {\mathbb {R}}\rightarrow {\mathbb {R}}\) be appropriately smooth functions. Then

$$\begin{aligned} \frac{d^k}{dt^k} f(g(t)) = \sum _{P \in {\mathcal {P}}_k} f^{(|P|)}(g(t)) \cdot \prod _{S \in P} g^{(|S|)}(t), \end{aligned}$$
(31)

where |P| denotes the number of disjoint elements of partition \(P \in {\mathcal {P}}_k\). Define the function \(\overline{\rho }(\xi ) = \xi / \sqrt{1 + \Vert {\xi }\Vert ^2}\), and let \(\lambda (\xi ) = \sqrt{1 + \Vert {\xi }\Vert ^2}\) so that \(\overline{\rho }(\xi ) = \nabla \lambda (\xi )\) and \(\rho (\xi ) = R \overline{\rho }(\xi / R)\). Let \(\overline{v}_j^k(\xi ) = \langle \nabla ^{{k}} \overline{\rho }_j(\xi ), v^{\otimes k}\rangle \), so that

$$\begin{aligned} \overline{v}^k(\xi ) = \nabla \langle \nabla ^{{k}} \lambda (\xi ), v^{\otimes k}\rangle ~~ \text{ and } ~~ {\widetilde{v}}^k = R^{1 - k} \overline{v}^k(x / R). \end{aligned}$$
(32)

With this in mind, we consider the quantity \(\langle \nabla ^{{k}} \lambda (\xi ), v^{\otimes k}\rangle \). Defining temporarily the functions \(\alpha (r) = \sqrt{1 + 2r}\) and \(\beta (t) = \frac{1}{2}\Vert {\xi + t v}\Vert ^2\), and their composition \(h(t) = \alpha (\beta (t))\), we evidently have

$$\begin{aligned} h^{(k)}(0) = \langle \nabla ^{{k}} \lambda (\xi ), v^{\otimes k}\rangle = \sum _{P \in {\mathcal {P}}_k} \alpha ^{(|P|)}(\beta (0)) \cdot \prod _{S \in P} \beta ^{(|S|)}(0), \end{aligned}$$

where the second equality used Faá di Bruno’s formula (31). Now, we note the following immediate facts:

$$\begin{aligned} \alpha ^{(l)}(r) = (-1)^l \frac{(2l - 1)!!}{(1 + 2r)^{l - 1/2}} ~~ \text{ and } ~~ \beta ^{(l)}(t) = {\left\{ \begin{array}{ll} \langle v, \xi \rangle + t \left\| {v}\right\| ^2 &{} l = 1 \\ \left\| {v}\right\| ^2 &{} l = 2 \\ 0 &{} l > 2. \end{array}\right. } \end{aligned}$$

Thus, if we let \({\mathcal {P}}_{k,2}\) denote the partitions of [k] consisting only of subsets with one or two elements, we have

$$\begin{aligned} h^{(k)}(0) = \sum _{P \in {\mathcal {P}}_{k,2}} (-1)^{|P|} \frac{(2|P| - 1)!!}{(1 + \left\| {\xi }\right\| ^2)^{|P| - 1/2}} \langle \xi , v\rangle ^{\mathsf {C}_1(P)} \left\| {v}\right\| ^{2\mathsf {C}_2(P)} \end{aligned}$$

where \(\mathsf {C}_i(P)\) denotes the number of sets in P with precisely i elements. Noting that \(\left\| {v}\right\| = 1\), we may rewrite this as

$$\begin{aligned} \langle \nabla ^{{k}} \lambda (\xi ), v^{\otimes k}\rangle = \sum _{l = 1}^k \sum _{P \in {\mathcal {P}}_{k,2}, \mathsf {C}_1(P) = l} (-1)^{|P|} \frac{(2|P| - 1)!!}{(1 + \left\| {\xi }\right\| ^2)^{|P| - 1/2}} \langle \xi , v\rangle ^l. \end{aligned}$$

Taking derivatives we obtain

$$\begin{aligned} {\widetilde{v}}^k = \nabla \langle \nabla ^{{k}} \lambda (\xi ), v^{\otimes k}\rangle = \bigg (\sum _{l = 1}^k a_l(\xi ) \langle \xi , v\rangle ^{l - 1}\bigg ) v + \bigg (\sum _{l = 1}^k b_l(\xi ) \langle \xi , v\rangle ^l \bigg ) \xi \end{aligned}$$

where

$$\begin{aligned} a_l(\xi )= & {} l \cdot \sum _{P \in {\mathcal {P}}_{k,2}, \mathsf {C}_1(P) = l} \frac{(-1)^{|P|} (2|P| - 1)!!}{(1 + \left\| {\xi }\right\| ^2)^{|P| - 1/2}}\\ \text{ and } ~~ b_l(\xi )= & {} \sum _{P \in {\mathcal {P}}_{k,2}, \mathsf {C}_1(P) = l} \frac{(-1)^{|P| + 1} (2|P| + 1)!!}{(1 + \left\| {\xi }\right\| ^2)^{|P| + 1/2}}. \end{aligned}$$

We would like to bound \(a_l(\xi ) \langle \xi , v\rangle ^{l-1}\) and \(b_l(\xi ) \langle \xi , v\rangle ^l \xi \). Note that \(|P| \ge \mathsf {C}_1(P)\) for every \(P\in {\mathcal {P}}_{k}\), so \(|P|\ge l\) in the sums above. Moreover, bounds for Bell numbers [6, Thm. 2.1] show that there are at most \(\exp (k \log k)\) partitions of [k], and \((2k - 1)!! \le \exp (k \log k)\) as well. As a consequence, we obtain

$$\begin{aligned} \sup _\xi |a_l(\xi ) \langle \xi , v\rangle ^{l - 1}| \le \exp (c l \log l) \sup _\xi \frac{|\langle \xi , v\rangle |^{l - 1}}{ (1 + \left\| {\xi }\right\| ^2)^{(l - 1)/2}} < \exp (c l \log l), \end{aligned}$$

where we have used \(|\langle \xi , v\rangle |\le \left\| {\xi }\right\| \) due to \(\left\| {v}\right\| =1\). We similarly bound \(\sup _\xi |b_l(\xi )| |\langle \xi , v\rangle |^l \left\| {\xi }\right\| \). Returning to expression (32), we have

$$\begin{aligned} \sup _x \Vert {{\widetilde{v}}^k(x)}\Vert \le \exp \left( c k \log k + c\right) R^{1 - k}, \end{aligned}$$

for a numerical constant \(c<\infty \). This is the desired bound (30), completing the proof. \(\square \)

Fig. 2
figure2

Two-dimensional cross-section of the bump function \(\bar{h}_T\)

Proof of Theorem 3

Theorem 3

There exist numerical constants \(0< c_0, c_1 < \infty \) such that the following lower bound holds. For any \(p\ge 1\), let \(D, L_{p}\), and \(\epsilon \) be positive. Then

$$\begin{aligned} \mathcal {T}_{\epsilon }\big (\mathcal {A}_{ \textsf {rand} }, {\mathcal {F}}^\mathrm{dist}_{p}(D, L_{p})\big ) \ge c_0 \cdot D^{1+p} \left( \frac{L_{p}}{\ell _{p}'}\right) ^{\frac{1+p}{p}} \epsilon ^{-\frac{1+p}{p}}, \end{aligned}$$

where \(\ell _{p}' \le e^{c_1 p \log p + c_1}\). The lower bound holds even if we restrict \({\mathcal {F}}^\mathrm{dist}_{p}(D, L_{p})\) to functions with domain of dimension \(1 + c_2 q\Big (D^{1+p} \left( {L_{p}}/{\ell _{p}'}\right) ^{\frac{1+p}{p}} \epsilon ^{-\frac{1+p}{p}}\Big )\), for a some numerical constant \(c_2<\infty \) and \(q(x) = x^2 \log (2x)\).

We divide the proof of the theorem into two parts, as in our previous results, first providing a few building blocks, then giving the theorem. The basic idea is to introduce a negative “bump” that is challenging to find, but which is close to the origin.

To make this precise, let \(e^{(j)}\) denote the jth standard basis vector. Then we define the bump function \(\bar{h}_T:{\mathbb {R}}^T\rightarrow {\mathbb {R}}\) by

$$\begin{aligned} \bar{h}_T(x)&= \varPsi \left( 1 - \frac{25}{2}\left\| {x - \frac{4}{5}e^{(T)}}\right\| ^2\right) \nonumber \\&= {\left\{ \begin{array}{ll} 0 &{} \left\| {x -\frac{4}{5}e^{(T)}}\right\| \ge \frac{1}{5}\\ \exp \left( 1-\frac{1}{ \left( 1-25\left\| {x -\frac{4}{5}e^{(T)}}\right\| ^2\right) ^{2}}\right) &{} \text{ otherwise. } \end{array}\right. } \end{aligned}$$
(33)

As Fig. 2 shows, \(\bar{h}_T\) features a unit-height peak centered at \(\frac{4}{5} e^{(T)}\), and it is identically zero when the distance from that peak exceeds \(\frac{1}{5}\). The volume of the peak vanishes exponentially with \(T\), making it hard to find by querying \(\bar{h}_T\) locally. We list the properties of \(\bar{h}_T\) necessary for our analysis.

Lemma 10

The function \(\bar{h}_T\) satisfies the following.

  1. i.

    \(\bar{h}_T\left( 0.8 e^{(T)}\right) = 1\) and \(\bar{h}_T(x)\in [0,1]\) for all \(x\in {\mathbb {R}}^T\).

  2. ii.

    \(\bar{h}_T(x)=0\) on the set \(\{x\in {\mathbb {R}}^d\,|\,x_{T} \le \frac{3}{5} or \left\| {x}\right\| \ge 1\}\).

  3. iii.

    For \(p \ge 1\), the pth order derivatives of \(\bar{h}_T\) are \(\tilde{\ell }_{p}\)-Lipschitz continuous, where \(\tilde{\ell }_{p} < e^{3 p\log p + c p}\) for some numerical constant \(c < \infty \).

We prove the lemma in Sect. C.1; the proof is similar to that of Lemma 6. With these properties in hand, we can prove Theorem 3.

Proof of Lemma 10

Properties i and ii are evident from the definition (33) of \(\bar{h}_T\). To show property iii, consider \({h}(x) = \bar{h}_T(\frac{x+0.8e^{(T)}}{5})=\varPsi (1-\frac{1}{2}\Vert x\Vert ^2)\), which is a translation and scaling of \(\bar{h}_T\), so if we show \({h}\) has \((\tilde{\ell }_{p}/5^{p+1})\)-Lipschitz pth order derivatives, for every \(p\ge 1\), we obtain the required results. For any \(x,v\in {\mathbb {R}}^T\) with \(\left\| {v}\right\| \le 1\) we define the directional projection \({h}_{x,v}(t) = {h}(x+t\cdot v)\). The required smoothness bound is equivalent to

$$\begin{aligned} \left| {h}_{x,v}^{(p+1)}(0) \right| \le \tilde{\ell }_{p}/5^{p+1} \le e^{c p\log p + c} \end{aligned}$$

for every \(x,v\in {\mathbb {R}}^d\) with \(\left\| {v}\right\| \le 1\), every \(p\ge 1\) and some numerical constant \(c<\infty \) (which we allow to change from equation to equation, caring only that it is finite and independent of \(T\) and p).

As in the proof of Lemma 6, we write \({h}_{x,v}(t) = \varPsi (\beta (t))\) where \(\beta (t) = 1-\frac{1}{2}\left\| {x+tv}\right\| ^2\), and use Faá di Bruno’s formula (31) to write, for any \(k\ge 1\),

$$\begin{aligned} {h}_{x,v}^{(k)}(0) = \sum _{P \in {\mathcal {P}}_k} \varPsi ^{(|P|)}(\beta (0)) \cdot \prod _{S \in P} \beta ^{(|S|)}(0), \end{aligned}$$

where \({\mathcal {P}}_k\) is the set of partitions of [k] and |P| denotes the number of set in partition P. Noting that \(\beta '(0) = -\langle x, v\rangle \), \(\beta ''(0) = -\left\| {v}\right\| ^2\) and \(\beta ^{(n)}(0)=0\) for any \(n>2\), we have

$$\begin{aligned} {h}_{x,v}^{(k)}(0) = \sum _{P \in {\mathcal {P}}_{k,2}} (-1)^{|P|} \varPsi ^{(|P|)}\left( 1-\frac{1}{2}\left\| {x}\right\| ^2\right) \langle x, v\rangle ^{\mathsf {C}_1(P)} \left\| {v}\right\| ^{2\mathsf {C}_2(P)} \end{aligned}$$

where \({\mathcal {P}}_{k,2}\) denote the partitions of [k] consisting only of subsets with one or two elements and \(\mathsf {C}_i(P)\) denotes the number of sets in P with precisely i elements.

Noting that \(\varPsi ^{(k)}(1-\frac{1}{2}\left\| {x}\right\| ^2) = 0\) for any \(k\ge 0\) and \(\left\| {x}\right\| > 1\), we may assume \(\left\| {x}\right\| \le 1\). Since \(\left\| {v}\right\| \le 1\), we may bound \(| {h}_{x,v}^{(p+1)}(0)|\) by

$$\begin{aligned} \left| {h}_{x,v}^{(p+1)}(0) \right|\le & {} \left| {\mathcal {P}}_{p+1,2} \right| \cdot \max _{k\in [p+1]}\sup _{x\in {\mathbb {R}}} | \varPsi ^{(k)}(x)|\\&{\mathop {\le }\limits ^{(i)}}&e^{\frac{p+1}{2}\log (p+1)} \cdot e^{\frac{5(p+1)}{2}\log (\frac{5}{2}(p+1))} \le e^{3 p \log p + c p} \end{aligned}$$

for some absolute constant \(c<\infty \), where inequality (i) follows from Lemma 1.iv and that the number of matchings in the complete graph (or the kth telephone number [21, Lem. 2]) has bound \(|{\mathcal {P}}_{k,2}| \le e^{\frac{k}{2}\log k}\). This gives the result.

Proof of Theorem 3

For some \(T\in {\mathbb {N}}\) and \(\sigma > 0\) to be specified, and \(d=\left\lceil 52\cdot 230^2 \cdot T^2 \log (4 T^2) \right\rceil \), consider the function \(f_U :{\mathbb {R}}^d \rightarrow {\mathbb {R}}\) indexed by orthogonal matrix \(U\in {\mathbb {R}}^{d\times T}\) and defined as

$$\begin{aligned} f_U(x) = \frac{L_{p}\sigma ^{p+1}}{\ell _{p}'}\hat{f}_{T;U}(x/\sigma ) - \frac{L_{p}D^{p+1}}{\ell _{p}'}\bar{h}_T(U^{\top } x/D), \end{aligned}$$

where \(\hat{f}_{T;U}(x) = \tilde{f}_{T;U}(\rho (x)) + \frac{1}{10} \left\| {x}\right\| ^2\) is the randomized hard instance construction (13) with \(\rho (x) = x / \sqrt{1 + \left\| {x/R}\right\| ^2}\), \(\bar{h}_T\) is the bump function (33) and \(\ell _{p}' = \hat{\ell }_{p} + \tilde{\ell }_{p}\), for \(\hat{\ell }_{p}\) and \(\tilde{\ell }_{p}\) as in Lemmas 6.ii and 10.iii, respectively. By the lemmas, \(f_U\) has \(L_{p}\)-Lipschitz pth order derivatives and \(\ell _{p}' \le e^{c_1 p\log p + c_1}\) for some \(c_1<\infty \). We assume that \(\sigma \le D\); our subsequent choice of \(\sigma \) will obey this constraint.

Following our general proof strategy, we first demonstrate that \(f_U \in {\mathcal {F}}^\mathrm{dist}_{p}(D, L_{p})\), for which all we need do is guarantee that the global minimizers of \(f_U\) have norm at most D. By the constructions (13) and (10) of \(\hat{f}_{T;U}\) and \(\tilde{f}_{T;U}\), Lemma 10.i implies

$$\begin{aligned}&f_U\left( (0.8D)u^{(T)}\right) \\&\quad = \frac{L_{p}\sigma ^{p+1}}{\ell _{p}'}\bar{f}_T(\rho (e^{(T)})) + \frac{L_{p}\sigma ^{p+1}}{10\ell _{p}'}\left\| {\frac{4Du^{(T)}}{5\sigma }}\right\| ^2 - \frac{L_{p}D^{p+1}}{\ell _{p}'}\bar{h}_T(0.8 e^{(T)}) \\&\quad = \frac{L_{p}\sigma ^{p + 1}}{\ell _{p}'} \bar{f}_T(0) + \frac{8 L_{p} \sigma ^{p - 1} D^2}{125 \ell _{p}'} + \frac{-L_{p} D^{p + 1}}{\ell _{p}'} < -\frac{117}{125} \frac{L_{p} D^{p + 1}}{\ell _{p}'}\\&\qquad \qquad + \frac{L_{p}\sigma ^{p + 1}}{\ell _{p}'} \bar{f}_T(0) \end{aligned}$$

with the final inequality using our assumption \(\sigma \le D\). On the other hand, for any x such that \(\bar{h}_T(U^{\top } x/D) = 0\), we have by Lemma 6.i (along with \(\hat{f}_{T;U}(0)=0)\) that

$$\begin{aligned} f_U(x) \ge \frac{L_{p}\sigma ^{p+1}}{\ell _{p}'} \inf _x \hat{f}_{T;U}(x) \ge - 12\frac{L_{p}\sigma ^{p+1}}{\ell _{p}'}T + \frac{L_{p}\sigma ^{p + 1}}{\ell _{p}'} \bar{f}_T(0). \end{aligned}$$

Combining the two displays above, we conclude that if

$$\begin{aligned} 12 \frac{L_{p}\sigma ^{p + 1}}{\ell _{p}'} T\le \frac{117}{125}\frac{L_{p}D^{p + 1}}{\ell _{p}'}, \end{aligned}$$

then all global minima \(x^\star \) of \(f_U\) must satisfy \(\bar{h}_T(U^{\top } x^\star / D) > 0\). Inspecting the definition (18) of \(\bar{h}_T\), this implies \(\left\| {x^\star /D - 0.8u^{(T)}}\right\| < \frac{1}{5}\), and therefore \(\left\| {x^\star }\right\| \le D\). Thus, by setting

$$\begin{aligned} T= \left\lfloor \frac{D^{p+1}}{13\sigma ^{p+1}} \right\rfloor , \end{aligned}$$
(34)

we guarantee that \(f_U\in {\mathcal {F}}^\mathrm{dist}_{p}(D, L_{p})\) as long as \(\sigma \le D\).

It remains to show that, for an appropriately chosen \(\sigma \), any randomized algorithm requires (with high probability) more than \(T\) iterations to find x such that \(\Vert {\nabla f_U(x)}\Vert < \epsilon \). We claim that when \(\sigma \le D\), for any \(x\in {\mathbb {R}}^d\),

$$\begin{aligned} |\langle u^{(T)}, \rho (x/\sigma )\rangle | < \frac{1}{2}~~\text{ implies } ~~ \bar{h}_T(U^{\top } y/D) = 0 ~~ \text{ for }~ y ~ \text{ in } \text{ a } \text{ neighborhood } \text{ of } x.\nonumber \\ \end{aligned}$$
(35)

We defer the proof of claim (35) to the end of this section.

Now, let \(U \in {\mathbb {R}}^{d \times T}\) be an orthogonal matrix chosen uniformly at random from \(\mathsf {O}(d,T)\). Let \(x^{(1)},\ldots , x^{(t)}\) be a sequence of iterates generated by algorithm \({\mathsf {A}}\in \mathcal {A}_{ \textsf {rand} }\) applied on \(f_U\). We argue that \(|\langle u^{(T)}, \rho (x^{(t)} / \sigma )\rangle | < 1/2\) for all \(t \le T\), with high probability. To do so, we briefly revisit the proof of Lemma 4 (Sect. B.3) where we replace \(\tilde{f}_{T;U}\) with \(f_U\) and \(x^{(t)}\) with \(\rho (x^{(t)}/\sigma )\). By Lemma 4a we have that for every \(t\le T\) the event \(G_{\le t}\) implies \(|\langle u^{(T)},\rho (x^{(s)} / \sigma )\rangle | < 1/2\) for all \(s\le t\), and therefore by the claim (35) we have that Lemma 4b holds (as we may replace the terms \(\bar{h}_T(U^{\top } x^{(s)}/D)\), \(s<t\), with 0 whenever \(G_{<t}\) holds). The rest of the proof of Lemma 4a proceeds unchanged and gives us that with probability greater than 1 / 2 (over any randomness in \({\mathsf {A}}\) and the uniform choice of U),

$$\begin{aligned} |\langle u^{(T)}, \rho (x^{(t)}/\sigma )\rangle | < \frac{1}{2}~~ \text{ for } \text{ all } ~ t \le T. \end{aligned}$$

By claim (35), this implies \(\nabla \bar{h}_T(U^{\top } x^{(t)}/D)=0\), and by Lemma 5, \(\Vert {\nabla \hat{f}_{T;U}(x^{(t)}/\sigma )}\Vert > 1/2\). Thus, after scaling,

$$\begin{aligned} \left\| {\nabla f_U(x^{(t)})}\right\| > \frac{L_{p}\sigma ^p}{2\ell _{p}'} \end{aligned}$$

for all \(t\le T\), with probability greater that 1 / 2. As in the proof of Theorem 2, By taking \(\sigma = (2\ell _{p}'\epsilon /L_{p})^{1/p}\) we guarantee

$$\begin{aligned} \inf _{{\mathsf {A}}\in \mathcal {A}_{ \textsf {det} }}\sup _{U\in \mathsf {O}(d,T)}{\mathsf {T}}_{\epsilon }\big ({\mathsf {A}}, f_U\big ) \ge 1+T. \end{aligned}$$

where \(T= \left\lfloor D^{p + 1} / 13 \sigma ^{p + 1} \right\rfloor \) is defined in Eq. (34). Thus, as \(f_U \in {\mathcal {F}}^\mathrm{dist}_{p}(D, L_{p})\) for our choice of \(T\), we immediately obtain

$$\begin{aligned} \mathcal {T}_{\epsilon }\big (\mathcal {A}_{ \textsf {rand} }, {\mathcal {F}}^\mathrm{dist}_{p}(D, L_{p})\big ) \ge T+ 1 \ge \frac{D^{1+p}}{52}\left( \frac{L_{p}}{\ell _{p}'}\right) ^{\frac{1+p}{p}} \epsilon ^{-\frac{1+p}{p}}, \end{aligned}$$

as long as our initial assumption \(\sigma \le D\) holds. When \(\sigma > D\), we have that \(\frac{2 \ell _{p}'}{L_{p}} \epsilon > D^p\), or \(1 > D^{p + 1} (\frac{L_{p}}{2 \ell _{p}'})^\frac{1 + p}{p} \epsilon ^{-\frac{1 + p}{p}}\), so that the bound is vacuous in this case regardless: every method must take at least 1 step.

Finally, we return to demonstrate claim (35). Note that \(|\langle u^{(T)}, \rho (x/\sigma )\rangle | < 1/2\) is equivalent to \(|\langle u^{(T)}, x\rangle | < \frac{\sigma }{2} \sqrt{1+\Vert {\frac{x}{\sigma R}}\Vert ^2}\), and consider separately the cases \(\left\| {x/\sigma }\right\| \le R/2\) and \(\left\| {x/\sigma }\right\| > R/2 = 115\sqrt{T}\). In the first case, we have \(|\langle u^{(T)}, x\rangle |< (\sqrt{5}/4)\sigma < (3/5)D\), by our assumption \(\sigma \le D\). Therefore, by Lemma 10.ii we have that \(\bar{h}_T(U^{\top } y/D) = 0\) for y near x. In the second case, we have \(\left\| {x}\right\|> (4R/\sqrt{5})|\langle u^{(T)}, x\rangle | > 230 |\langle u^{(T)}, x\rangle |\). If in addition \(|\langle u^{(T)}, x\rangle | < (3/5)D\) then our conclusion follows as before. Otherwise, \(\left\| {x}\right\| /D> 230\cdot (3/5) > 1\), so again the conclusion follows by Lemma 10.ii.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Carmon, Y., Duchi, J.C., Hinder, O. et al. Lower bounds for finding stationary points I. Math. Program. 184, 71–120 (2020). https://doi.org/10.1007/s10107-019-01406-y

Download citation

Keywords

  • Non-convex optimization
  • Information-based complexity
  • Dimension-free rates
  • Gradient descent
  • Cubic regularization of Newton’s method

Mathematics Subject Classification

  • 90C06
  • 90C26
  • 90C30
  • 90C60
  • 68Q25