Lower bounds for non-convex stochastic optimization

Arjevani, Yossi; Carmon, Yair; Duchi, John C.; Foster, Dylan J.; Srebro, Nathan; Woodworth, Blake

doi:10.1007/s10107-022-01822-7

Lower bounds for non-convex stochastic optimization

Full Length Paper
Series A
Published: 09 June 2022

Volume 199, pages 165–214, (2023)
Cite this article

Mathematical Programming Submit manuscript

Yossi Arjevani¹,
Yair Carmon ORCID: orcid.org/0000-0001-5731-8640²,
John C. Duchi³,
Dylan J. Foster⁴,
Nathan Srebro⁵ &
…
Blake Woodworth⁶

2940 Accesses
11 Citations
Explore all metrics

Abstract

We lower bound the complexity of finding $\epsilon $-stationary points (with gradient norm at most $\epsilon $) using stochastic first-order methods. In a well-studied model where algorithms access smooth, potentially non-convex functions through queries to an unbiased stochastic gradient oracle with bounded variance, we prove that (in the worst case) any algorithm requires at least $\epsilon ^{-4}$ queries to find an $\epsilon $-stationary point. The lower bound is tight, and establishes that stochastic gradient descent is minimax optimal in this model. In a more restrictive model where the noisy gradient estimates satisfy a mean-squared smoothness property, we prove a lower bound of $\epsilon ^{-3}$ queries, establishing the optimality of recently proposed variance reduction techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

$\mathbf{C^{2}}$ -Lusin approximation of strongly convex functions

Article 03 April 2024

Finite-time error bounds for Greedy-GQ

Article 30 April 2024

Random Gradient-Free Minimization of Convex Functions

Article 30 November 2015

Notes

As is common in the optimization literature, we describe algorithms which use random coins in their execution as “randomized,” as opposed to “deterministic” algorithms which do not. Likewise, we distinguish between “noiseless” and “stochastic” first-order oracles, which provide exact and noisy gradient information, respectively.
See also the K-parallel model of [35].
The iterates of SPIDER and SNVRG are a linear combination of previously computed gradients, and therefore these algorithms are zero-respecting.
The event holds with probability at least $1-\delta $ with respect to the random choice of U and the oracle seeds $\{z^{(t)}\}$, even when conditioned over any randomness in ${\mathsf {A}}$.
Available at arxiv.org/abs/1912.02365v2.

References

Agarwal, A., Bartlett, P.L., Ravikumar, P., Wainwright, M.J.: Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Trans. Inf. Theory 5(58), 3235–3249 (2012)
Article MathSciNet MATH Google Scholar
Allen-Zhu, Z.: How to make the gradients small stochastically: even faster convex and nonconvex SGD. In: Advances in Neural Information Processing Systems, pp. 1165–1175 (2018a)
Allen-Zhu, Z.: Natasha 2: Faster non-convex optimization than SGD. In: Advances in Neural Information Processing Systems, pp. 2675–2686, (2018b)
Allen-Zhu, Z., Hazan, E.: Variance reduction for faster non-convex optimization. In International conference on machine learning, pp. 699–707 (2016)
Allen-Zhu, Z., Li. Y.: Neon2: finding local minima via first-order oracles. In: Advances in Neural Information Processing Systems, pp. 3716–3726 (2018)
Arjevani, Y.: Limitations on variance-reduction and acceleration schemes for finite sums optimization. In: Advances in Neural Information Processing Systems, pp. 3540–3549 (2017)
Arjevani, Y., Shamir, O.: Dimension-free iteration complexity of finite sum optimization problems. In: Advances in Neural Information Processing Systems, pp. 3540–3548 (2016)
Arjevani, Y., Carmon, Y., Duchi, J.C., Foster, D.J., Sekhari, A., Sridharan, K.: Second-order information in non-convex stochastic optimization: power and limitations. In: Conference on Learning Theory, pp. 242–299. PMLR (2020)
Ball, K.: An elementary introduction to modern convex geometry. In Levy, S. (ed Flavors of Geometry, pp. 1–58. MSRI Publications (1997)
Bottou, L., Bousquet, O.: The tradeoffs of large scale learning. In Advances in neural information processing systems, pp. 161–168 (2008)
Bottou, L., Curtis, F., Nocedal, J.: Optimization methods for large-scale learning. SIAM Rev. 60(2), 223–311 (2018)
Article MathSciNet MATH Google Scholar
Braun, G., Guzmán, C., Pokutta, S.: Lower bounds on the oracle complexity of nonsmooth convex optimization via information theory. IEEE Trans. Inf. Theory 63(7), 4709–4724 (2017)
Article MathSciNet MATH Google Scholar
Bubeck, S., Jiang, Q., Lee, Y.T., Li, Y., Sidford, A.: Complexity of highly parallel non-smooth convex optimization. In: Advances in Neural Information Processing Systems 32 (2019)
Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Convex until proven guilty: Dimension-free acceleration of gradient descent on non-convex functions. In Proceedings of the 34th International Conference on Machine Learning, pp. 654–663 (2017)
Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Lower bounds for finding stationary points I. Math. Progr. 184(1), 71–120 (2019)
MathSciNet MATH Google Scholar
Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Lower bounds for finding stationary points II: First-order methods. Math. Progr. 185(1), 315–355 (2021)
Article MathSciNet MATH Google Scholar
Cartis, C., Gould, N.I., Toint, P.L.: On the complexity of steepest descent, newton’s and regularized newton’s methods for nonconvex unconstrained optimization problems. Siam J. Opt. 20(6), 2833–2852 (2010)
Article MathSciNet MATH Google Scholar
Cartis, C., Gould, N.I., Toint, P.L.: Complexity bounds for second-order optimality in unconstrained optimization. J. Complex. 28(1), 93–108 (2012)
Article MathSciNet MATH Google Scholar
Cartis, C., Gould, N.I., Toint, P.L.: How much patience to you have?: a worst-case perspective on smooth noncovex optimization. Optima 88, 1–10 (2012)
Google Scholar
Cartis, C., Gould, N.I., Toint, P.L.: Worst-case evaluation complexity and optimality of second-order methods for nonconvex smooth optimization. arXiv preprint arXiv:1709.07180, (2017)
Cutkosky, A., Orabona, F.: Momentum-based variance reduction in non-convex SGD. Adv. Neural Inf. Process. Syst. (2019)
Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems 27, (2014)
Diakonikolas, J., Guzmán, C.: Lower bounds for parallel and randomized convex optimization. In: Proceedings of the Thirty Second Annual Conference on Computational Learning Theory (2019)
Drori, Y., Shamir, O.: The complexity of finding stationary points with stochastic gradient descent. arXiv preprint arXiv:1910.01845 (2019)
Fang, C., Li, C.J., Lin, Z., Zhang, T.: Spider: near-optimal non-convex optimization via stochastic path-integrated differential estimator. In: Advances in Neural Information Processing Systems, pp. 689–699 (2018)
Fang, C., Lin, Z., Zhang, T.: Sharp analysis for nonconvex SGD escaping from saddle points. In: Beygelzimer, A., Hsu, D., (eds) Proceedings of the Thirty-Second Conference on Learning Theory, vol. 99, pp. 1192–1234. PMLR (2019)
Foster, D.J., Sekhari, A., Shamir, O., Srebro, N., Sridharan, K., Woodworth, B.: The complexity of making the gradient small in stochastic convex optimization. In: Proceedings of the Thirty-Second Conference on Learning Theory, pp. 1319–1345 (2019)
Ge, R., Huang, F., Jin, C., Yuan, Y.: Escaping from saddle points: online stochastic gradient for tensor decomposition. In: Conference on Learning Theory, pp. 797–842 (2015)
Ge, R., Lee, J.D., Ma, T.: Matrix completion has no spurious local minimum. In: Advances in Neural Information Processing Systems, pp. 2973–2981 (2016)
Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Opt. 23(4), 2341–2368 (2013)
Article MathSciNet MATH Google Scholar
LeCam, L.: Convergence of estimates under dimensionality restrictions. Ann. Stat. 1(1), 38–53 (1973)
Article MathSciNet Google Scholar
Lei, L., Ju, C., Chen, J., Jordan, M.I.: Non-convex finite-sum optimization via SCSG methods. In: Advances in Neural Information Processing Systems, pp. 2348–2358 (2017)
Ma, C., Wang, K., Chi, Y., Chen, Y.: Implicit regularization in nonconvex statistical estimation: gradient descent converges linearly for phase retrieval, matrix completion and blind deconvolution. Found. Comput. Math. (2019). https://doi.org/10.1007/s10208-019-09429-9
Article MATH Google Scholar
Murty, K.G., Kabadi, S.N.: Some np-complete problems in quadratic and nonlinear programming. Math. progr. 39(2), 117–129 (1987)
Article MathSciNet MATH Google Scholar
Nemirovski, A.: On parallel complexity of nonsmooth convex optimization. J. Complex. 10(4), 451–463 (1994)
Article MathSciNet MATH Google Scholar
Nemirovski, A., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley (1983)
Nesterov, Y.: Introductory Lectures of Convex Optimization. Kluwer Academic Publishers (2004)
Nesterov, Y., Polyak, B.T.: Cubic regularization of newton method and its global performance. Math. Progr. 108(1), 177–205 (2006)
Article MathSciNet MATH Google Scholar
Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate $o (1/k^2)$. Sov. Math. Dokl. 27(2), 372–376 (1983)
MATH Google Scholar
Nocedal, J., Wright, S.: Numerical Optimization. Springer Science & Business Media (2006)
Raginsky, M., Rakhlin, A.: Information-based complexity, feedback and dynamics in convex programming. IEEE Trans. Inf. Theory 57(10), 7036–7056 (2011)
Article MathSciNet MATH Google Scholar
Reddi, S.J., Hefny, A., Sra, S., Poczos, B., Smola, A.: Stochastic variance reduction for nonconvex optimization. In: International Conference on Machine Learning, pp. 314–323 (2016)
Schmidt, M., Roux, N.L., Bach, F.: Convergence rates of inexact proximal-gradient methods for convex optimization. In: Advances in Neural Information Processing Systems 24 (2011)
Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)
MathSciNet MATH Google Scholar
Sun, J., Qu, Q., Wright, J.: A geometric analysis of phase retrieval. Found. Comput. Math. 18(5), 1131–1198 (2018)
Article MathSciNet MATH Google Scholar
Traub, J.F., Wasilkowski, G.W., Woźniakowski H.: Information-Based Complexity. (1988)
Tripuraneni, N., Stern, M., Jin, C., Regier, J., Jordan, M.I.: Stochastic cubic regularization for fast nonconvex optimization. In: Advances in Neural Information Processing Systems, pp. 2899–2908 (2018)
Vavasis, S.A.: Black-box complexity of local minimization. SIAM J. Opt. 3(1), 60–80 (1993)
Article MathSciNet MATH Google Scholar
Wang, Z., Ji, K., Zhou, Y., Liang, Y., Tarokh, V.: Spiderboost: a class of faster variance-reduced algorithms for nonconvex optimization. arXiv preprint arXiv:1810.10690, (2018)
Woodworth, B., Srebro, N.: Tight complexity bounds for optimizing composite objectives. In: Advances in Neural Information Processing Systems, pp. 3639–3647 (2016)
Woodworth, B., Srebro, N.: Lower bound for randomized first order convex optimization. arXiv preprint, arXiv:1709.03594 (2017)
Xu, Y., Rong, J., Yang, T.: First-order stochastic algorithms for escaping from saddle points in almost linear time. In: Advances in Neural Information Processing Systems, pp. 5530–5540 (2018)
Yao, A.C.-C.: Probabilistic computations: toward a unified measure of complexity. In 18th Annual Symposium on Foundations of Computer Science, pp. 222–227. IEEE (1977)
Yu, B.: Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, pp. 423–435. Springer (1997)
Zhou, D., Gu, Q.: Lower bounds for smooth nonconvex finite-sum optimization. In: International Conference on Machine Learning (2019)
Zhou, D., Xu, P., Gu, Q.: Stochastic nested variance reduction for nonconvex optimization. In: Advances in Neural Information Processing Systems, pp. 3925–3936. Curran Associates Inc. (2018)
Zhou, D., Xu, P., Gu, Q.: Stochastic nested variance reduction for nonconvex optimization. J. Mach. Learn. Res. (2020)

Download references

Acknowledgements

Part of this work was completed while the authors were visiting the Simons Institute for the Foundations of Deep Learning program. We thank Ayush Sekhari, Ohad Shamir, Aaron Sidford and Karthik Sridharan for several helpful discussions. YC was supported by the Stanford Graduate Fellowship. JCD acknowledges support from NSF CAREER award 1553086, the Sloan Foundation, and ONR-YIP N00014-19-1-2288. DF was supported by NSF TRIPODS award #1740751. BW was supported by the Google PhD Fellowship program. Division of Computing and Communication Foundations (Grant Number 1553086

Author information

Authors and Affiliations

The Hebrew University, Jerusalem, Israel
Yossi Arjevani
Tel Aviv University, Tel Aviv, Israel
Yair Carmon
Stanford University, Stanford, USA
John C. Duchi
Microsoft Research New England, Cambridge, MA, USA
Dylan J. Foster
TTIC, Chicago, USA
Nathan Srebro
Inria, Paris, France
Blake Woodworth

Authors

Yossi Arjevani
View author publications
You can also search for this author in PubMed Google Scholar
Yair Carmon
View author publications
You can also search for this author in PubMed Google Scholar
John C. Duchi
View author publications
You can also search for this author in PubMed Google Scholar
Dylan J. Foster
View author publications
You can also search for this author in PubMed Google Scholar
Nathan Srebro
View author publications
You can also search for this author in PubMed Google Scholar
Blake Woodworth
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yair Carmon.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

A Proofs from Section 3

1.1 A.1 Basic technical results

Before proving the main results from Sect. 3, we first state two self-contained technical results that will be used in subsequent proofs. The first result bounds component functions $\Psi $ and $\Phi $ and gives the calculation for the parameter $\ell _1$ in Lemma 2.2.

Observation 2

The functions $\Psi $ and $\Phi $ in (17) and their derivatives satisfy

$$\begin{aligned}&0\le {}\Psi \le {}e,~~ 0\le {}\Psi '\le \sqrt{54/e},~~ |\Psi ''| \le 32.5,~~ 0\le {}\Phi \le {}\sqrt{2\pi {}e},\nonumber \\&\quad 0\le {}\Phi '\le {}\sqrt{e}~~\text {and}~~|\Phi ''|\le 1. \end{aligned}$$

(36)

Proof of Lemma 2.2

We note that the Hessian of $F_T$ is tridiagonal. Consequently, for any $x\in {\mathbb {R}}^d$,

$$\begin{aligned} \Vert \nabla ^2 F_T(x)\Vert _\mathrm{op}&\le \max _{i\in [T]}|\nabla _{i,i}^2 F_T(x)| + \max _{i\in [T]}|\nabla _{i,i+1}^2 F_T(x)| + \max _{i\in [T]}|\nabla _{i+1,i}^2 F_T(x)|\\&{\mathop {\le }\limits ^{(i)}} \sup _{z\in {\mathbb {R}}} |\Phi ''(z)| \sup _{z\in {\mathbb {R}}} |\Psi (z)| + \sup _{z\in {\mathbb {R}}} |\Phi (z)| \sup _{z\in {\mathbb {R}}} |\Psi ''(z)| \\&\quad + 2 \sup _{z\in {\mathbb {R}}} |\Phi '(z)| \sup _{z\in {\mathbb {R}}} |\Psi '(z)| {\mathop {\le }\limits ^{(ii)}} 152, \end{aligned}$$

where (i) is a direct calculation using the definition (16) of $F_T$ and (ii) follows from (36). $\square $

The second result is an $\Omega (\frac{\sigma ^{2}}{\epsilon ^{2}})$ lower bound on the sample complexity of finding stationary points whenever $\epsilon \le {}O(\sqrt{\Delta {}L})$. This result handles an edge case in the proof of Theorem 2. A similar lower bound appeared in [27], but the result we prove here is slightly stronger because it holds even for dimension $d=1$.

Lemma 10

There exists a number $c_0 > 0$ such that for any number of simultaneous queries K, dimension d and $\epsilon \le {}\sqrt{\frac{{\bar{L}}\Delta }{8}}$, we have

$$\begin{aligned} {\bar{{\mathfrak {m}}}_{\epsilon }^{\mathsf {zr}}(K,\Delta ,{\bar{L}},\sigma ^{2})}\ge {}\bar{{\mathfrak {m}}}_{\epsilon }^{\mathsf {rand}}(K,\Delta ,{\bar{L}},\sigma ^{2})\ge c_0 \cdot {\frac{\sigma ^{2}}{\epsilon ^{2}}}. \end{aligned}$$

(37)

Our approach for proving Lemma 10 is as follows. Given a dimension d, we construct a function $F:\mathbb {R}^{d}\rightarrow \mathbb {R}$, a family of distributions $P_z$, and a family of functions f(x, z) for which $F(x)={\mathbb {E}}_{z}\left[f(x,z)\right]$, and for which the initial suboptimality, variance, and mean-squared smoothness are bounded by $\Delta , \sigma ^2$ and ${\bar{L}}$, respectively. We then prove a lower bound in the global stochastic model in which at round t the oracle returns the full function $f(\cdot , z^{(t)})$, rather than just its value and derivatives at the queried point. The global stochastic model is more powerful than the K-query stochastic first-order model (with $g(x,z)=\nabla _x f(x,z)$) for every value of K, so this will imply the claimed result as a special case.

Lemma 11

Whenever $\epsilon \le {}\sqrt{\frac{{\bar{L}}\Delta }{8}}$, the number of samples required to obtain an $\epsilon $-stationary point in the global stochastic model defined above is $\Omega (1) \cdot \frac{\sigma ^{2}}{\epsilon ^{2}}$.

Proof of Lemma 11

The proof follows standard arguments used to derive information-theoretic lower bounds for statistical estimation [31, 54].

We consider a family of functions $f:{\mathbb {R}}^d\times {\mathbb {R}}\rightarrow {\mathbb {R}}$ given by

$$\begin{aligned} f(x,z)= \frac{{\bar{L}}}{2} \left( \Vert x\Vert ^2 - 2 z x_1 +r^2 \right) , \end{aligned}$$

(38)

where $r\in (0,\sqrt{2\Delta /{\bar{L}}})$ is a fixed parameter. We take $P_z$ to have the form $P_z^{s}:=\mathcal {N}(rs,\frac{\sigma ^2}{{\bar{L}}^2})$, where $s\in \{-1,1\}$, and let $\theta _s :=(rs,0,\dots , 0)\in {\mathbb {R}}^d$. Then, when $P_z=P_z^{s}$, we have $F_{s}(x) :={\mathbb {E}}_z\left[f(x,z)\right] =\frac{{\bar{L}}}{2} \Vert x-\theta _s\Vert ^2$, and furthermore for any $x,y\in \mathbb {R}^{d}$ we have

Note that $F_s$ is indeed an ${\bar{L}}$-smooth, and has initial suboptimality at $x^{(0)}=0$ bounded as $F_s(0)-\inf _{x\in {\mathbb {R}}^d} F_s(x)={\bar{L}}r^2/2\le \Delta $.

Now, we provide a distribution over the underlying instance by drawing S uniformly from $\{\pm {}1\}$, and consider any algorithm that takes as input samples $z_1,\ldots ,z_T\sim P_z^S$, and returns iterate ${\hat{x}}$. To bound the expected norm of the gradient at ${\hat{x}}$ (over the randomness of the oracle, the randomness of the algorithm, and the choice of the underlying instance S), we define ${\hat{S}} :={{\,\mathrm{arg\,min}\,}}_{s'\in \{1,-1\}} \Vert \nabla F_{s'}({\hat{x}})\Vert $, with ties broken arbitrarily. Observe that we have

$$\begin{aligned} {\mathbb {E}}[\Vert \nabla F_{S}({\hat{x}})\Vert ]&\overset{(i)}{\ge } r{\bar{L}}{\mathbb {P}}\left( \Vert \nabla F_{S}({\hat{x}})\Vert \ge r{\bar{L}}\right) \overset{(ii)}{\ge } r{\bar{L}} {\mathbb {P}}({\hat{S}}\ne S), \end{aligned}$$

(39)

where (i) follows by Markov’s inequality and (ii) follows because when ${\hat{S}}\ne {}S$, the definition of ${\hat{S}}$ implies

$$\begin{aligned} 2\cdot \Vert \nabla {}F_{S}({\hat{x}})\Vert&\ge \inf _{x\in {\mathbb {R}}^d} \{\Vert \nabla F_{-1}(x) \Vert + \Vert \nabla F_{1}(x) \Vert \}\\&= {\bar{L}}\cdot \inf _{x\in {\mathbb {R}}^d} \{ \Vert x-\theta _1 \Vert + \Vert x-\theta _{-1} \Vert \} \ge {\bar{L}}\Vert \theta _{1} - \theta _{-1}\Vert = 2r{\bar{L}}. \end{aligned}$$

Next, for $s\in \{\pm {}1\}$ let ${\mathbb {P}}_{s}=\mathcal {N}^{\otimes T}(rs,\frac{\sigma ^2}{{\bar{L}}^2})$ denote the law of $(z_1,\dots ,z_T)$ conditioned on $S=s$. We have

$$\begin{aligned} {\mathbb {P}}({\hat{S}}\ne S) = 1- {\mathbb {P}}({\hat{S}}= S)&\ge 1 - \frac{1}{2} \sup _{A \text { is measurable}} \{{\mathbb {P}}_1(A) + {\mathbb {P}}_{-1}(A^c)\} \\&= \frac{1}{2} - \frac{1}{2} \sup _{A \text { is measurable}} \{{\mathbb {P}}_1(A) - {\mathbb {P}}_{-1}(A)\} \\&= \frac{1}{2} \left\{ 1-\Vert {\mathbb {P}}_{1}-{\mathbb {P}}_{-1}\Vert \right\} \\&\ge \frac{1}{2} \left\{ 1-\sqrt{\frac{1}{2} D_{\mathrm {KL}}({\mathbb {P}}_{1}||{\mathbb {P}}_{-1})}\right\} \\&= \frac{1}{2} \left( 1- \frac{r{\bar{L}}\sqrt{T}}{\sigma }\right) , \end{aligned}$$

where the penultimate step follows by Pinsker’s inequality and the last step uses that ${\mathbb {P}}_{s}=\mathcal {N}^{\otimes T}(rs,\frac{\sigma ^2}{{\bar{L}}^2})$. Combining this lower bound with (39) yields

$$\begin{aligned} {\mathbb {E}}[\Vert F_S({\hat{x}})\Vert ] \ge \frac{r{\bar{L}} }{2}\left( 1-\frac{r{\bar{L}}\sqrt{T}}{\sigma } \right) . \end{aligned}$$

Finally, setting $r= \min \{\frac{\sigma }{2{\bar{L}}\sqrt{T}},\sqrt{\frac{2\Delta }{{\bar{L}}}}\}$, implies

$$\begin{aligned} \max \left\{ {\mathbb {E}}[\Vert F_1({\hat{x}})\Vert ],{\mathbb {E}}[\Vert F_{-1}({\hat{x}})\Vert ]\right\}&\ge \frac{1}{2}\left( {\mathbb {E}}[\Vert F_1({\hat{x}})\Vert ] + {\mathbb {E}}[\Vert F_{-1}({\hat{x}})\Vert ]\right) \\&= {\mathbb {E}}[\Vert F_S({\hat{x}})\Vert ] \ge \min \{\frac{\sigma }{8\sqrt{T}},\sqrt{\frac{{\bar{L}}\Delta }{8}}\}. \end{aligned}$$

Stated equivalently, whenever $\epsilon \le \sqrt{{{\bar{L}}\Delta }/{8}}$, there exists $s\in \{-1,1\}$ such that the number of oracle calls T required to ensure ${\mathbb {E}}[\Vert \nabla F_{s}({\hat{x}})\Vert ]\le \epsilon $ satisfies

$$\begin{aligned} T\ge \frac{\sigma ^2}{64 \epsilon ^2}, \end{aligned}$$

concluding the proof. $\square $

1.2 A.2 Proof of Lemma 4

First, we note that ${\mathbb {E}}\left[\nu _i (x,z)\right]=1$ for all x and i, and therefore ${\mathbb {E}}\left[{\bar{g}}_T(x,z)\right] = \nabla F_T(x)$. To show the probabilistic zero-chain property, note that, due to Observation 1.1, we have $\nu _i(x,z) = \nu _i(x_{\le \mathrm {prog}_{\frac{1}{4}}(x)},z)$ for all x and z. Moreover, for $i\ge 1+\mathrm {prog}_{\frac{1}{4}}(x)$ we have $\Gamma (|x_{\ge i}|) = 0$ and therefore $\Theta _i(x)=\Gamma (1)=1$ and $\nu _i(x,0)=0$.

With these observation, the proof of the zero-chain property is analogous its proof in Lemma 3: for all x and z we have $\mathrm {prog}_{0}({\bar{g}}_T(x,z)) \le 1+\mathrm {prog}_{\frac{1}{4}}(x)$ (from Lemma 2.4) and ${\bar{g}}_T(x,z) = {\bar{g}}_T\big (x_{\le 1+\mathrm {prog}_{\frac{1}{4}}(x)},z\big )$ (from Lemma 2.5 and $\nu _i(x,z) = \nu _i(x_{\le \mathrm {prog}_{\frac{1}{4}}(x)},z)$), giving Eq. (12); for $z=0$ we further have $\mathrm {prog}_{0}({\bar{g}}_T(x,z)) \le \mathrm {prog}_{\frac{1}{4}}(x)$ (from $\nu _i(x,0)=0$) and ${\bar{g}}_T(x,z) = {\bar{g}}_T\big (x_{\le \mathrm {prog}_{\frac{1}{4}}(x)},z\big )$ (from Lemma 2.5 and $\nu _i(x,z) = \nu _i(x_{\le \mathrm {prog}_{\frac{1}{4}}(x)},z)$), giving Eq. (11).

To bound the variance of the gradient estimator we observe that for all $i\le \mathrm {prog}_{\frac{1}{2}}(x)$, $\Vert \Gamma (|x_{\ge i}|)\Vert \ge \Gamma (1/2)=1$ and therefore $\Theta _{i}(x)=0$ and $\nu _i(x,z)=1$, so that

$$\begin{aligned}{}[{\bar{g}}_T(x,z)]_i = \nabla _i F_T(x)~~\forall i \le \mathrm {prog}_{\frac{1}{2}}(x). \end{aligned}$$

On the other hand, Lemma 2.4 gives us that

$$\begin{aligned}{}[{\bar{g}}_T(x,z)]_i = \nabla _i F_T(x)=0~~\forall i > 1+\mathrm {prog}_{\frac{1}{2}}(x). \end{aligned}$$

We conclude that $\delta (x,z) = {\bar{g}}_T(x,z) - \nabla F_T(x)$ has at most a single nonzero entry in coordinate $i_x = \mathrm {prog}_{\frac{1}{2}}(x)+1$. Moreover, for every i

$$\begin{aligned} \delta _i(x,z) = \nabla _i F_T(x) (\nu _i(x,z)-1) = \nabla _i F_T(x) \Theta _i(x) \left(\frac{z}{p}-1\right). \end{aligned}$$

Therefore,

$$\begin{aligned} {\mathbb {E}}\Vert {\bar{g}}_T(x,z) - \nabla F_T(x)\Vert ^2 = {\mathbb {E}}\delta _i^2(x,z) = \left|\nabla _{i_x} F_T(x)\right|^2 \,\Theta _i^2(x)\,\frac{1-p}{p} \le \frac{(1-p)23^2}{p}, \end{aligned}$$

where the final transition used Lemma 2.3 and $\Theta _i^2(x)\le 1$ for all x and i, establishing the variance bound in (23) with $\varsigma = 23$.

To bound ${\mathbb {E}}\Vert {\bar{g}}_T(x,z) - {\bar{g}}_T(y,z) \Vert ^2$, we use that ${\mathbb {E}}\left[\delta (\cdot ,z)\right]=0$ and that $\delta (\cdot ,z)$ has at most one nonzero coordinate to write

$$\begin{aligned} {\mathbb {E}}\Vert {\bar{g}}_T(x,z) - {\bar{g}}_T(y,z) \Vert ^2&= {\mathbb {E}}\Vert \delta (x,z) - \delta (y,z) \Vert ^2 + \Vert \nabla F_T(x)-\nabla F_T(y)\Vert ^2 \nonumber \\&= \sum _{i\in \{i_x, i_y\}} {\mathbb {E}}\left(\delta _{i}(x,z) - \delta _{i}(y,z)\right)^2 + \Vert \nabla F_T(x)-\nabla F_T(y)\Vert ^2, \end{aligned}$$

(40)

where $i_y=\mathrm {prog}_{\frac{1}{2}}(y)+1$ is the nonzero index of $\delta (y,z)$. For any $i\le T$, we have

$$\begin{aligned}&{\mathbb {E}}\left(\delta _{i}(x,z) - \delta _{i}(y,z)\right)^2\\&\quad = \left( \nabla _i F_T(x) \Theta _i(x) - \nabla _i F_T(y) \Theta _i(y)\right)^2 \, {\mathbb {E}}\left(\frac{z}{p}-1\right)^2\\&\quad = \left( \nabla _i F_T(x) ( \Theta _i(x) - \Theta _i(y)) + (\nabla _i F_T(x) - \nabla _i F_T(y) )\Theta _i(y)\right)^2 \, \frac{1-p}{p}\\&\quad \le \left( 2 (\nabla _i F_T(x))^2 ( \Theta _i(x) - \Theta _i(y))^2 + 2(\nabla _i F_T(x) - \nabla _i F_T(y) )^2\Theta _i^2(y)\right) \, \frac{1}{p} ~. \end{aligned}$$

By Observation 1.3, $\Gamma _i$ is 6-Lipschitz. Since the Euclidean norm $\Vert \cdot \Vert $ is 1-Lipschitz, we have

$$\begin{aligned} \left|\Theta _i(x) - \Theta _i(y)\right|&\le 6 \, \big |\,\left\Vert\Gamma (|x_{\ge i}|)\right\Vert - \left\Vert\Gamma (|y_{\ge i}|)\right\Vert\,\big | \le 6 \, \big \Vert \,\Gamma (|x_{\ge i}|)- \Gamma (|y_{\ge i}|)\,\big \Vert \\&\le 6^2 \, \big \Vert \,|x_{\ge i}| - |y_{\ge i}|\,\big \Vert \le 6^2 \, \Vert x-y\Vert . \end{aligned}$$

That is, $\Theta _i$ is $6^2$-Lipschitz. Since $\Theta _i^2(y) \le 1$ and $(\nabla _i F_T(x))^2 \le 23^2$ by Lemma 2.3, we have

$$\begin{aligned} \left(\delta _{i}(x,z) - \delta _{i}(y,z)\right)^2 \le \frac{(23\cdot 6)^2 \Vert x-y\Vert ^2 + 2(\nabla _i F_T(x) - \nabla _i F_T(y))^2}{p} \end{aligned}$$

for all i. Substituting back into (40) we obtain

$$\begin{aligned}&{\mathbb {E}}\Vert {\bar{g}}_T(x,z) - {\bar{g}}_T(y,z) \Vert ^2 \le \frac{2\cdot (23\cdot 6)^2 \Vert x-y\Vert ^2 + 2\Vert \nabla F_T(x) - \nabla F_T(y)\Vert ^2}{p} \\&\quad + \Vert \nabla F_T(x) - \nabla F_T(y)\Vert ^2. \end{aligned}$$

Recalling that $\Vert \nabla F_T(x) - \nabla F_T(y)\Vert \le \ell _{1} \Vert x-y\Vert $ by Lemma 2.2, establishes the mean-square smoothness bound in (23) with ${\bar{\ell }}_{1} = \sqrt{2\cdot (\varsigma \cdot 6)^2 + 3\ell _{1}^2}$.

1.3 A.3 Proof of Theorem 2

Let $\Delta _0, \ell _1,\varsigma $ and ${\bar{\ell }}_{1}$ be the numerical constants in Lemmas 2.1, 2.2 and 4, respectively. Let the accuracy parameter $\epsilon $, initial suboptimality $\Delta $, mean-squared smoothness parameter ${\bar{L}}$, and variance parameter $\sigma ^2 $ be fixed, and let $L\le {\bar{L}}$ be specified later. We rescale $F_T$ as in the proof of Theorem 1,

This guarantees that $F^{\star }_T\in {\mathcal {F}}(\Delta , L)$ and that the corresponding scaled gradient estimator $g^{\star }_T(x,z)=(L\lambda /\ell _{1}){\bar{g}}_T(x/\lambda ,z)$ is such that every zero respecting algorithm ${\mathsf {A}}$ interacting with ${\mathsf {O}}_{F^{\star }_T}(x,z)=(F^{\star }_T(x),g^{\star }_T(x,z))$ satisfies

$$\begin{aligned} {\mathbb {E}}\big \Vert \nabla F^{\star }_T\big ( x^{(t,k)}_{{\mathsf {A}}[{\mathsf {O}}_F]} \big ) \big \Vert >\epsilon , \end{aligned}$$

for all $t\le {} {(T-1)}/{2p}$ and $k\in [K]$. It remains to choose p and $L$ such that ${\mathsf {O}}_{F^{\star }_T}$ belongs to $\mathcal {O}(K,\sigma ^{2},{\bar{L}})$. As in the proof of Theorem 1, setting $\frac{1}{p}=\frac{\sigma ^2}{ (2\varsigma \epsilon )^2} + 1$ and using Lemma 4 guarantees a variance bound of $\sigma ^{2}$. Moreover, by Lemma 4 we have

$$\begin{aligned} {\mathbb {E}}\Vert g^{\star }_T(x,z) - g^{\star }_T(y,z) \Vert ^2&= \left( \frac{L\lambda }{\ell _{1}}\right) ^2 {\mathbb {E}}\left\| {{\bar{g}}_T\left( \frac{x}{\lambda },z\right) - {\bar{g}}_T\left( \frac{y}{\lambda },z\right) }\right\| ^2 \\&\le \left( \frac{L\lambda }{\ell _{1}}\right) ^2\frac{{\bar{\ell }}_{1}^2}{p} \left\| \frac{x}{\lambda }-\frac{y}{\lambda }\right\| ^2 \\ {}&= \left(\frac{{\bar{\ell }}_{1}L}{\ell _{1}\sqrt{p}}\right)^2 \Vert x-y\Vert ^2. \end{aligned}$$

Therefore, taking

$$\begin{aligned} L = \frac{\ell _{1}}{{\bar{\ell }}_{1}} {\bar{L}}\sqrt{p} = \frac{\ell _{1}}{{\bar{\ell }}_{1}}\min \left\{ \frac{2\varsigma \epsilon }{\sigma },1 \right\} {\bar{L}}\le {\bar{L}}\end{aligned}$$

guarantees membership in the oracle class and implies the lower bound

$$\begin{aligned} {\bar{{\mathfrak {m}}}_{\epsilon }^{\mathsf {zr}}(K,\Delta ,{\bar{L}},\sigma ^{2})} > \frac{T-1}{2p} =\left( \left\lfloor \frac{{\bar{L}}\Delta \sqrt{p}}{4{\bar{\ell }}_{1}\Delta _0\epsilon ^2}\right\rfloor -1\right) \frac{1}{2p}. \end{aligned}$$

We consider the cases $\frac{{\bar{L}}\Delta \sqrt{p}}{4{\bar{\ell }}_{1}\Delta _0\epsilon ^2}\ge 3$ and $\frac{{\bar{L}}\Delta \sqrt{p}}{4{\bar{\ell }}_{1}\Delta _0\epsilon ^2}<3$ separately. In the former case (which is the more interesting one), we use $\lfloor x\rfloor -1\ge x/2$ for $x \ge 3$ and the setting of p to write

$$\begin{aligned} {\bar{{\mathfrak {m}}}_{\epsilon }^{\mathsf {zr}}(K,\Delta ,{\bar{L}},\sigma ^{2})} \ge \frac{{\bar{L}}\Delta }{16{\bar{\ell }}_{1}\Delta _0\epsilon ^2\sqrt{p}}\ge \frac{1}{64{\bar{\ell }}_{1}\Delta _0 \varsigma } \cdot \frac{{\bar{L}}\Delta \sigma }{\epsilon ^3} + \frac{1}{32{\bar{\ell }}_{1}\Delta _0 } \cdot \frac{{\bar{L}}\Delta }{\epsilon ^2}. \end{aligned}$$

(41)

Moreover, we choose $c'=12{\bar{\ell }}_{1}\Delta _0$ so that $\epsilon \le \sqrt{\frac{{\bar{L}}\Delta }{12{\bar{\ell }}_{1}\Delta _0}}\le \sqrt{\frac{{\bar{L}}\Delta }{8}}$ holds. By Lemma 11,

$$\begin{aligned} {\bar{{\mathfrak {m}}}_{\epsilon }^{\mathsf {zr}}(K,\Delta ,{\bar{L}},\sigma ^{2})} > c_0\cdot \frac{\sigma ^{2}}{\epsilon ^{2}}, \end{aligned}$$

(42)

where $c_0$ is a universal constant (this lower bound holds for any value of d). Together, the bounds (41) and (42) imply the desired result when $\frac{{\bar{L}}\Delta \sqrt{p}}{4{\bar{\ell }}_{1}\Delta _0\epsilon ^2}\ge 3$.

Finally, we consider the edge case $\frac{{\bar{L}}\Delta \sqrt{p}}{4{\bar{\ell }}_{1}\Delta _0\epsilon ^2}<3$. We note that the assumption $\epsilon \le \sqrt{\frac{{\bar{L}}\Delta }{12{\bar{\ell }}_{1}\Delta _0}}$ precludes the option that $p=1$ in this case. Therefore we must have $\frac{{\bar{L}}\Delta \varsigma }{2{\bar{\ell }}_{1}\Delta _0\sigma \epsilon }<3$ or, equivalently, $\frac{\sigma ^2}{\epsilon ^2} > \frac{\varsigma }{6{\bar{\ell }}_{1}\Delta _0} \cdot \frac{{\bar{L}}\Delta \sigma }{\epsilon ^3}$. Thus, in this case the bound (42) implies (41) up to a constant, concluding the proof.

B Proofs from Section 4

1.1 B.1 Proof of Lemma 5

The proof combines the techniques of the proofs of Lemma 1 and random-projection based lower bounds on the sequential oracle complexity of optimization [15, 51] in their refined form [13, 23] which yields low-dimensional hard instances.

Let us adopt the shorthand $x^{(i)}:={}x^{(i)}_{{\mathsf {A}}[{\mathsf {O}}_{{\tilde{F}}_U}]}$, which we recall is defined via

$$\begin{aligned}x^{(i)}_{{\mathsf {A}}[{\mathsf {O}}_{{\tilde{F}}_U}]}={\mathsf {A}}^{(i)}\left(r, {\mathsf {O}}_{{\tilde{F}}_U{}}\big (x^{(1)}_{{\mathsf {A}}[{\mathsf {O}}_{{\tilde{F}}_U}]},z^{(1)}\big ), \ldots , {\mathsf {O}}_{{\tilde{F}}_U{}}\big (x^{(i-1)}_{{\mathsf {A}}[{\mathsf {O}}_{{\tilde{F}}_U}]},z^{(i-1)}\big )\right), \end{aligned}$$

where r is the algorithm’s random seed. Following the proof strategy of Lemma 1, we define

$$\begin{aligned}&\pi ^{(t)} = \max _{i\le t}\max _{k\in [K]}\mathrm {prog}_{\frac{1}{4}}(U^{\top } x^{(i,k)}) \\&\quad = \max \left\{ j\le T| |\langle u^{(j)}, x^{(i,k)}\rangle | \ge \frac{1}{4}\text { for some }i\le t, k\in \left[K\right]\right\} \end{aligned}$$

and

(43)

recalling that, due to Definition 2, $\{B^{(t)}\}_{t \ge 1}$ are i.i.d. Bernoulli with probability of success at most p, and are independent of any randomization in the algorithm ${\mathsf {A}}$. With the shorthand

$$\begin{aligned} C^{(t)} :=\sum _{s<t} B^{(s)} \end{aligned}$$

We additionally define, for every $t\ge 1$, the event

$$\begin{aligned} {\mathfrak {E}}^{(t)} :=\bigcap _{s\le t}\{ \pi ^{(s)} \le C^{(s)} \}. \end{aligned}$$

Writing

$$\begin{aligned} T_p :=\left\lfloor \frac{T - \log (2/\delta )}{2p}\right\rfloor , \end{aligned}$$

the claim of the lemma is equivalent to the statement that ${\mathbb {P}}(\pi ^{(T_p)} \ge T) \le \delta $. Since

$$\begin{aligned} {\mathbb {P}}(\pi ^{(T_p)} \ge T) \le {\mathbb {P}}\left( \big [{\mathfrak {E}}^{(T_p)}\big ]^c~\text {or}~C^{(T_p)} \ge T\right) \le {\mathbb {P}}\left( \big [{\mathfrak {E}}^{(T_p)}\big ]^c\right) +{\mathbb {P}}\left( C^{(T_p)} \ge T\right), \end{aligned}$$

it suffices to show that both

$$\begin{aligned} {\mathbb {P}}\left( \big [{\mathfrak {E}}^{(T_p)}\big ]^c\right) \le \frac{\delta }{2}. \end{aligned}$$

(44)

and

$$\begin{aligned} {\mathbb {P}}\left( C^{(T_p)} \ge T\right) \le \frac{\delta }{2}. \end{aligned}$$

(45)

The bound (45) follows identically to Eq. (15) in the proof of Lemma 1, and so the remainder of the proof consists of establishing the bound (44).

We begin by rewriting the event $\big [{\mathfrak {E}}^{(T_p)}\big ]^c$ as follows

$$\begin{aligned} \big [{\mathfrak {E}}^{(T_p)}\big ]^c&= \bigcup _{t\le T_p} \left\{ \pi ^{(t)}> C^{(t)}\right\} \cap {\mathfrak {E}}^{(t-1)}\\&= \bigcup _{t\le T_p} \left\{ \max _{k\in [K]}\mathrm {prog}_{\frac{1}{4}}(U^\top x^{(t,k)})> C^{(t)}\right\} \cap {\mathfrak {E}}^{(t-1)}\\&= \bigcup _{t\le T_p} \left\{ \exists k\in [K], j > C^{(t)}:|\langle u^{(j)}, x^{(t,k)}\rangle | \ge \frac{1}{4}\right\} \cap {\mathfrak {E}}^{(t-1)}. \end{aligned}$$

Define the $\sigma $-field

$$\begin{aligned} \mathcal {F}:=\sigma \left( z^{(1)}, \ldots , z^{(T_p)}, r \right), \end{aligned}$$

where we recall that r is the algorithm’s random seed, and note that $C^{(t)}\in \mathcal {F}$ for all $t \le T_p$. Conditioning on $\mathcal {F}$ and applying the union bound, we have

(46)

Therefore, to establish the bound (44) and with it the result, it suffices to prove that the probability for every $t\le T_p$, $k\in [K]$ and $j>C^{(t)}$. To do so, we leverage probabilistic zero-chain property in order show the following.

Lemma 12

Fix $t \ge 1$, and condition on $\mathcal {F}$. If ${\mathfrak {E}}^{(t-1)}$ holds, then for every $s \le t$ and $k\in [K]$, $x^{(s,k)}$ is measurable with respect to $u^{(1)}, \ldots , u^{(C^{(s)})}$.

Proof

Throughout the proof, we adopt the shorthand $U_{\le c}$ for $[u^{(1)}; \ldots u^{(c)}; 0, \ldots , 0]$, i.e., a version of U where the last $T-c$ columns are replaced with zeros. We also recall the notation $x_{\le i}$ for the replacement of all but the first i coordinates of x with zeros.

The crux of the proof is the following claim: for any $s<t$ and $k\in [K]$, if $x^{(s,k)}$ is measurable w.r.t. $U_{\le C^{(s)}}$ and $\mathrm {prog}_{\frac{1}{4}}(U^\top x^{(s,k)}) \le C^{(s)}$, then the oracle response to query $x^{(s,k)}$ is measurable w.r.t. $U_{\le C^{(s+1)}}$. To see why this holds, let $g^{(s,k)} = g( U^\top x^{(s,k)},z^{(s)})$ and note that Definition 2 of the probabilistic zero chain, along with definition (43) of the sequence $(B^{(t)})$, implies that

$$\begin{aligned} \mathrm {prog}_{0}( g^{(s,k)} )\le & {} B^{(s)} + \mathrm {prog}_{\frac{1}{4}}(U^\top x^{(s,k)}) ~~\text{ and }~~ g^{(s,k)} \\= & {} g\left( \big [U^\top x^{(s,k)}\big ]_{\le B^{(s)}+ \mathrm {prog}_{\frac{1}{4}}(U^\top x^{(s,k)})}, z^{(s)}\right). \end{aligned}$$

The assumption $\mathrm {prog}_{\frac{1}{4}}(U^\top x^{(s,k)}) \le C^{(s)}$ implies that $B^{(s)} + \mathrm {prog}_{\frac{1}{4}}(U^\top x^{(s,k)}) \le B^{(s)} + C^{(s)} = C^{(s+1)}$, and—noting that $[U^\top v]_{\le c} = U_{\le c}^\top v$—we consequently have

$$\begin{aligned} \mathrm {prog}_{0}( g^{(s,k)} ) \le C^{(s+1)} ~~\text{ and }~~ g^{(s,k)} = g( U_{\le C^{(s+1)}}^\top x^{(s,k)},z^{(s)}). \end{aligned}$$

Therefore, the oracle response to query $x^{(s,k)}$ has the form

$$\begin{aligned} {\mathsf {O}}_{{\tilde{F}}_U}(x^{(s,k)},z^{(s)})= & {} {\tilde{g}}_U(x^{(s,k)},z^{(s)}) = U g^{(s,k)} \\= & {} U_{\le \mathrm {prog}_{0}(g^{(s,k)}) } g^{(s,k)} = U_{\le C^{(s+1)}} g( U_{\le C^{(s+1)}}^\top x^{(s,k)}, z^{(s)}), \end{aligned}$$

so that if $x^{(s,k)}$ is measurable w.r.t. $U_{\le C^{(s)}}$, then ${\mathsf {O}}_{{\tilde{F}}_U}(x^{(s,k)},z^{(s)})$ is measurable w.r.t. $U_{\le C^{(s+1)}}$.

From here the lemma follows by straightforward induction. The base case $t=1$ is trivial, since the algorithm’s initial queries do not depend on U. For the induction step, fix s and suppose that $x^{(s',k')}$ is measurable w.r.t. $U_{\le C^{(s')}}$ for all $s' < s \le t$ and $k'\in [K]$. That ${\mathfrak {E}}^{(t-1)}$ holds implies that $\mathrm {prog}_{\frac{1}{4}}( U^\top x^{(s',k')}) \le C^{(s')}$ for all $s'<t$, and hence by the discussion above we conclude that the oracle’s responses to all queries at iterations $1,\ldots , s-1$ are measurable w.r.t. $U_{\le C^{(s)}}$. Since $x^{(s,k)}$ is a (measurable) function of r and the oracle responses up to iteration s, we conclude that it is measurable w.r.t. $U_{\le C^{(s)}}$ as well. $\square $

From Lemma 12, we conclude that there exists a function ${\mathsf {f}}^{(t,k)}: ({\mathbb {R}}^d)^{C^{(t)}}\rightarrow \{x\in {\mathbb {R}}^d | \Vert x\Vert \le R\}$ (implicitly also dependent on $\mathcal {F}$), such that $x^{(t,k)} = {\mathsf {f}}^{(t,k)}(u^{(1)},\ldots , u^{(C^{(t)})})$. Consequently,

Conditional on $\mathcal {F}$ and $u^{(1)}, \ldots , u^{(C^{(t)})}$, we have that ${\mathsf {f}}^{(t,k)}(u^{(1)},\ldots , u^{(C^{(t)})})$ is a fixed vector with norm at most R, while for every $j > C^{(t)}$, the vector $u^{(j)}$ is uniformly distributed on the $(d-C^{(t)})$-dimensional unit sphere. Therefore, concentration of measure on the sphere (see, e.g., Lemma 2.2 of [9]) gives

where the last transition follows from our choice of $d - T \ge 32R^2 \log \frac{2T^2 K}{p\delta } \ge 32R^2 \log \frac{4 T_p K T}{\delta }$. Substituting back into Eq. (46), we obtain the bound (44) and conclude the proof of Lemma 5.

1.2 B.2 Proof of Lemma 6

Before proving Lemma 6 we first list the relevant continuity properties of the compression function

$$\begin{aligned} \rho (x) = \frac{x}{\sqrt{1+\Vert x\Vert ^2/R^2}}. \end{aligned}$$

Lemma 13

Let $J(x)=\frac{\partial \rho }{\partial {}x}(x) = \frac{I - \rho (x)\rho (x)^{\top }/R^{2}}{\sqrt{1+\Vert x\Vert ^{2}/R^{2}}}$. For all $x,y\in \mathbb {R}^{d}$ we have

$$\begin{aligned} \Vert J(x)\Vert _{\mathrm {op}}= & {} \frac{1 }{\sqrt{1+\Vert x\Vert ^{2}/R^{2}}} \le {}1, ~ \Vert \rho (x)-\rho (y)\Vert \le {}\Vert x-y\Vert ,\nonumber \\ \quad \text {and}~~ \Vert J(x)-J(y)\Vert _{\mathrm {op}}\le & {} {}\frac{3}{R}\Vert x-y\Vert . \end{aligned}$$

(47)

Proof of Lemma13

Note that $\Vert \rho (x)\Vert \le R$ and therefore $0 \preceq I - \rho (x)\rho (x)^{\top }/R^{2} \preceq I$. Consequently, we have $ \Vert J(x)\Vert _{\mathrm {op}} = (1+\Vert x\Vert ^{2}/R^{2})^{-1/2} \le 1. $ The guarantee $\Vert \rho (x)-\rho (y)\Vert \le {}\Vert x-y\Vert $ follows immediately by Taylor’s theorem. For the last statement, define $h(t)=\frac{1}{\sqrt{1+t^{2}}}$, and note that $\left|h(t)\right|,\left|h'(t)\right|\le {}1$. By triangle inequality and the aforementioned boundedness and Lipschitzness properties of h, we have

For the first term, observe that for any x, y we have $\Vert x\Vert ,\Vert y\Vert \le {}1$, we have $\Vert xx^{\top }-yy^{\top }\Vert _{\mathrm {op}}\le {}2\Vert x-y\Vert $; this follows because for any $\Vert v\Vert =1$, we have $\Vert (xx^{\top }-yy^{\top })v\Vert \le {}\Vert x-y\Vert \left|\left\langle v,x\right\rangle \right|+\Vert y\Vert \left|\left\langle v,x-y\right\rangle \right|\le {}2\Vert x-y\Vert $. Since $\Vert \rho (x)/R\Vert \le {}1$, it follows that

For the second term, we again use that $\Vert \rho (x)\Vert \le {}R$ to write

$\square $

Proof of Lemma 6

The argument here is essentially identical to [15, Lemma 5]. Define $y^{(i)}=(y^{(i,1)},\ldots ,y^{(i,K)})$, where $y^{(i,k)}=\rho (x^{(i,k)})$. Observe that for each i and k, the oracle response $(\widehat{F}_{T,U}(x^{(i,k)}),\widehat{g}_{T,U}(x^{(i,k)},z^{(i)}))$ is a measurable function of $x^{(i,k)}$ and $({\tilde{F}}_{T,U}(y^{(i,k)}),{\tilde{g}}_{T,U}(y^{(i,k)},z^{(i)}))$. Consequently, we can regard the sequence $y^{(1)},\ldots ,y^{(T)}$ as realized by some algorithm in ${\mathcal {A}}_{{\textsf {rand}}}(K)$ applied to an oracle with ${\mathsf {O}}_{{\tilde{F}}_{T,U}}(y,z)=({\tilde{F}}_{T,U}(y),{\tilde{g}}_{T,U}(y,z))$. Lemma 5 then implies that as long as $d\ge {}\lceil (32\cdot 230^2+1)T\log \frac{2KT^2}{p\delta }\rceil \ge {}\lceil T + 32 R^2\log \frac{2KT^2}{p\delta }\rceil $, we have that with probability at least $1-\delta $,

(48)

as long as $i\le {}(T-\log (2/\delta ))/2p$.

We now show that the gradient must be large for all of the iterates. Let i and k be fixed. We first consider the case where $\Vert x^{(i,k)}\Vert \le {}R/2$. Observe that (48) implies that $\mathrm {prog}_{1}(U^{\top }y^{(i,k)})<T$ and so by Lemma 2.6, if we set $j=\mathrm {prog}_{1}(U^{\top }y^{(i,k)})+1$, we have

$$\begin{aligned} |\langle u^{(j)},y^{(i,k)}\rangle |<1\quad \text {and}\quad \left|\left\langle u^{(j)},\nabla {}{\tilde{F}}_{T,U}(y^{(i,k)})\right\rangle \right|\ge {}1. \end{aligned}$$

(49)

Now, observe that we have

$$\begin{aligned} \left\langle u^{(j)},\nabla {}\widehat{F}_{T,U}(x^{(i,k)})\right\rangle&= \left\langle u^{(j)},J(x)^{\top }\nabla {}{\tilde{F}}_{T,U}(y^{(i,k)})\right\rangle + \eta \left\langle u^{(j)},x^{(i,k)}\right\rangle . \end{aligned}$$

Using that $J(x) = \frac{I - \rho (x)\rho (x)^{\top }/R^{2}}{\sqrt{1+\Vert x\Vert ^{2}/R^{2}}}$, this is equal to

$$\begin{aligned}&\frac{\left\langle u^{(j)},\nabla {}{\tilde{F}}_{T,U}(y^{(i,k)})\right\rangle }{\sqrt{1+\Vert x^{(i,k)}\Vert ^{2}/R^{2}}} - \frac{\left\langle u^{(j)},y^{(i,k)}\right\rangle \left\langle y^{(i,k)},{\tilde{F}}_{T,U}(y^{(i,k)})\right\rangle /R^{2}}{\sqrt{1+\Vert x^{(i,k)}\Vert ^{2}/R^{2}}}\\&\quad + \eta \left\langle u^{(j)},y^{(i,k)}\right\rangle \sqrt{1+\Vert x^{(i,k)}\Vert ^{2}/R^{2}}. \end{aligned}$$

Since $\Vert y^{(i,k)}\Vert \le {}\Vert x^{(i,k)}\Vert \le {}R/2$, this implies

By Lemma 2 we have $\Vert {\tilde{F}}_{T,U}(y^{(i,k)})\Vert \le {}23\sqrt{T}$. At this point, the choice $\eta =1/5$, $R=230\sqrt{T}$, as well as (49) imply that $\left|\left\langle u^{(j)},\nabla {}\widehat{F}_{T,U}(x^{(i,k)})\right\rangle \right| \ge {} \frac{2}{\sqrt{5}} - \left(\frac{1}{20} + \frac{1}{2\sqrt{5}}\right)\ge {}\frac{1}{2}$.

Next, we handle the case where $\Vert x^{(i,k)}\Vert >R/2$. Here, we have

where the second inequality uses that $\Vert J(x^{(i,k)})\Vert _{\mathrm {op}}\le {} \frac{1}{\sqrt{1+\Vert x^{(i,k)}\Vert ^{2}/R^{2}}}\le {}2/\sqrt{5}$ which follows from Lemma 13 and $\Vert x^{(i,k)}\Vert >R/2$. $\square $

1.3 B.3 Proof of Lemma 7

To establish Lemma 7 we first prove a generic result showing that composition with the compression function $\rho $ and an orthogonal transformation U never significantly hurts the regularity requirements in our lower bounds. In the following, we use the notation $a\vee b :=\max \{a,b\}$.

Lemma 14

Let $F:\mathbb {R}^{T}\rightarrow \mathbb {R}$ be an arbitrary twice-differentiable function with $\Vert \nabla {}F(x)\Vert \le {}\ell _0$ and $\Vert \nabla {}F(x)-\nabla {}F(y)\Vert \le {}\ell _1\cdot \Vert x-y\Vert $, and let g(x, z) and a random variable $z\sim P_z$ satisfy for all $x,y\in \mathbb {R}^{T}$,

$$\begin{aligned}&{\mathbb {E}}\left[g(x,z)\right]=\nabla {}F(x),\quad {\mathbb {E}}\Vert g(x,z)-F(x)\Vert ^{2}\le {}\sigma ^{2},\nonumber \\&\quad \quad \text {and}\quad {\mathbb {E}}\Vert g(x,z)-g(y,z)\Vert ^{2}\le {}{\bar{L}}^2\Vert x-y\Vert ^{2}. \end{aligned}$$

(50)

Let $R\ge \ell _{0} \vee 1$, $d\ge T$, and $U\in \mathsf {Ortho}(d,T)$. Then the functions

$$\begin{aligned} \widehat{F}_U(x)= & {} F(U^{\top }\rho (x))~~~~\text{ and }\\ \widehat{g}_U(x,z)= & {} J(x)^{\top }U{}g(U^{\top }\rho (x),z) \end{aligned}$$

satisfy the following properties.

1.
$\widehat{F}_{U}(0) - \inf _{x}\widehat{F}_{U}(x) \le F(0)-\inf _{x}F(x)$.
2.
The first derivative of $\widehat{F}_{U}$ is $(\ell _{1}+3)$-Lipschitz continuous.
3.
${\mathbb {E}}\big \Vert \widehat{g}_{U}(x,z)-\nabla {}\widehat{F}_{U}(x)\big \Vert ^{2}\le {} \sigma ^2$ for all $x\in \mathbb {R}^{d}$.
4.
${\mathbb {E}}\Vert \widehat{g}_{U}(x,z)-\widehat{g}_{U}(y,z)\Vert ^{2} \le {} ({\bar{L}}^{2} + 9\sigma ^{2} + 9)\Vert x-y\Vert ^{2}$ for all $x,y\in \mathbb {R}^{d}$.

Proof of Lemma 14

Property 1 is immediate, since the range of $\rho $ is a subset of $\mathbb {R}^{T}$. For property 2, we use the triangle inequality along with Lemma 13 and the assumed smoothness properties of F as follows:

For the variance bound (property 3), observe that we have

Here the second inequality follows from (47) and the fact that $U\in \mathsf {Ortho}(d,T)$, and the third inequality follows because the variance bound in (50) holds uniformly for all points in the domain $\mathbb {R}^{T}$ (in particular, those in the range of $x\mapsto {}U^{\top }\rho (x)$).

Lastly, to prove property 4 we first invoke the triangle inequality and the elementary inequality $(a+b)^{2}\le {}2a^{2}+2b^{2}$.

For the first term, we use the Jacobian operator norm bound from (47) and the assumed mean-squared smoothness of g:

For the second term, we use the Jacobian Lipschitzness from (47):

We now use the assumed Lipschitzness of F and variance bound for g:

$$\begin{aligned} {\mathbb {E}}\Vert g(U^{\top }\rho (x),z)\Vert ^{2}&={\mathbb {E}}\Vert g(U^{\top }\rho (x),z)-\nabla {}F(U^{\top }\rho (x))\Vert ^{2}\\&\quad + \Vert \nabla {}F(U^{\top }\rho (x))\Vert ^{2} \le {} \sigma ^{2}+\ell _0^2. \end{aligned}$$

Putting everything together, we have

$$\begin{aligned} {\mathbb {E}}\Vert \widehat{g}_{U}(x,z)-\widehat{g}_{U}(y,z)\Vert ^{2} \le {}\left({\bar{L}}^2+9\sigma ^{2}/R^{2}+9\ell _0^{2}/R^{2}\right)\cdot \Vert x-y\Vert ^{2}. \end{aligned}$$

$\square $

Proof of Lemma 7

For property 1, observe that $\widehat{F}_{T,U}(0)=F_T(0)$, and

$$\begin{aligned}\min _{x}\widehat{F}_{T,U}(x)\ge {}\min _{x}F_T(U^{\top }\rho (x))\ge \min _{x}F_T(U^{\top }x)\ge {}\min _{x}F_T(x).\end{aligned}$$

For properties 2, 3, and 4 we observe from Lemma 14 that $\widehat{F}_{T,U}$ and $\widehat{g}_{T,U}$, ignoring the quadratic regularization term, satisfy the same smoothness, variance, and mean-squared smoothness bounds as in Lemma 2/Lemma 4/Lemma 8 up to constant factors. The additional regularization term in (28) leads to an additional $\eta =1/5$ factor in the smoothness and mean-squared-smoothness. $\square $

1.4 B.4 Proof of Theorem 3

We prove the lower bound for the bounded variance and mean-squared smooth settings in turn. The proofs follow the same outline as the proofs of Theorems 1 and 2, relying on Lemmas 6 and 7 rather than Lemmas 1 and 4, respectively. Throughout, let $\Delta _0, \ell _1,\varsigma $ and ${\bar{\ell }}_{1}$ be the numerical constants in Lemma 7. Bounded variance setting setting Given accuracy parameter $\epsilon $, initial suboptimality $\Delta $, smoothness parameter $L$ and variance parameter $\sigma ^2 $, we define for each $U\in \mathsf {Ortho}(d,T)$ a scaled instance

(51)

We assume $T\ge 4$, or equivalently $\epsilon \le \sqrt{\frac{L\Delta }{64\ell _1\Delta _0 }}$. Let $g^{\star }_T{}(x,z)$ denote the corresponding scaled version of the stochastic gradient function $\widehat{g}_{T,U}$. Now, by Lemma 7, we have that $F^{\star }_{T,U}\in {\mathcal {F}}(\Delta , L)$ and moreover,

$$\begin{aligned}&{\mathbb {E}}\Vert g^{\star }_{T,U}(x,z) - \nabla F^{\star }_{T,U}(x)\Vert ^2 = \left( \frac{L\lambda }{\ell _{1}}\right) ^2\\&\quad {\mathbb {E}}\left\| {\widehat{g}_{T,U}\left( \frac{x}{\lambda },z\right) - \nabla \widehat{F}_{T,U}\left( \frac{x}{\lambda }\right) }\right\| ^2 \le \frac{(4\varsigma \epsilon )^2(1-p)}{p}. \end{aligned}$$

Therefore, setting $\frac{1}{p}=\frac{\sigma ^2}{(4\varsigma \epsilon )^2} + 1$ guarantees a variance bound of $\sigma ^{2}$.

Next, Let ${\mathsf {O}}$ be an oracle for which ${\mathsf {O}}_{F^{\star }_{T,U}}(x,z)=(F^{\star }_{T,U}(x),g^{\star }_{T,U}(x,z))$ for all $U\in \mathsf {Ortho}(d,T)$. Observe that for any ${\mathsf {A}}\in {\mathcal {A}}_{{\textsf {rand}}}(K)$, we may regard the sequence $\big \{x^{(i,k)}_{{\mathsf {A}}[{\mathsf {O}}_{F^{\star }_{T,U}}]}/\lambda \big \}$ as queries made by an algorithm ${\mathsf {A}}'\in {\mathcal {A}}_{{\textsf {rand}}}(K)$ interacting with the unscaled oracle ${\mathsf {O}}_{\widehat{F}_{T,U}}(x,z)=(\widehat{F}_{T,U}(x),\widehat{g}_{T,U}(x,z))$. Instantiating Lemma 6 for $\delta =\frac{1}{2}$, we have that w.p. at least $\frac{1}{2}$, $\min _{k\in \left[K\right]}\big \Vert \nabla \widehat{F}_{T,U}\big ( \frac{1}{\lambda } x^{(t,k)}_{{\mathsf {A}}[{\mathsf {O}}_{F^{\star }_{T,U}}]} \big ) \big \Vert >\frac{1}{2}$ for all $t \le {} \frac{T-2}{2p}$. Therefore,

$$\begin{aligned} {\mathbb {E}}\min _{k\in \left[K\right]}\big \Vert \nabla F^{\star }_{T,U}\big ( x^{(t,k)}_{{\mathsf {A}}[{\mathsf {O}}_{F^{\star }_{T,U}}]} \big ) \big \Vert = \frac{L\lambda }{\ell _{1}}\cdot {\mathbb {E}}\min _{k\in \left[K\right]}\big \Vert \nabla \widehat{F}_{T,U}\big (\tfrac{1}{\lambda } x^{(t,k)}_{{\mathsf {A}}[{\mathsf {O}}_{F^{\star }_{T,U}}]} \big ) \big \Vert \ge \frac{L\lambda }{4\ell _1}=\epsilon , \end{aligned}$$

(52)

by which it follows that

$$\begin{aligned} {\mathfrak {m}}_{\epsilon }^{\mathsf {rand}}(K,\Delta ,L,\sigma ^{2})> \frac{T-2}{2p}&= \left( \left\lfloor \frac{L\Delta }{16\ell _{1}\Delta _0 \epsilon ^{2}}\right\rfloor -2\right) \frac{1}{2p} \ge \frac{1}{2^{11} \ell _{1}\Delta _0\varsigma } \cdot \frac{L\Delta \sigma ^2}{\epsilon ^4} \\&\quad + \frac{1}{2^{7} \ell _{1}\Delta _0} \cdot \frac{L\Delta }{\epsilon ^2} , \end{aligned}$$

where the second inequality uses that $\lfloor x\rfloor -2\ge {}x/4$ whenever $x\ge {}4$.

Mean-squared smooth setting We use the scaling (51), choose $p=\min \left\{ {(4\varsigma \epsilon )^2}/{\sigma ^2},1\right\} $ as above, and let

$$\begin{aligned} L = \frac{\ell _{1}}{{\bar{\ell }}_{1}} {\bar{L}}\sqrt{p} = \frac{\ell _{1}}{{\bar{\ell }}_{1}}\min \{\frac{4\varsigma \epsilon }{\sigma },1\}{\bar{L}}\le {\bar{L}}. \end{aligned}$$

Using Lemma 7 and the calculation from the proof of Theorem 2, this setting guarantees that ${\mathsf {O}}_{F^{\star }_{T,U}}(x,z)$ is in the class $\mathcal {O}(K,\sigma ^{2},{\bar{L}})$. Consequently, the inequality (52) implies the lower bound

$$\begin{aligned} \bar{{\mathfrak {m}}}_{\epsilon }^{\mathsf {rand}}(K,\Delta ,{\bar{L}},\sigma ^{2})> \frac{T-2}{2p} =\left( \left\lfloor \frac{{\bar{L}}\Delta \sqrt{p}}{16{\bar{\ell }}_{1}\Delta _0\epsilon ^2}\right\rfloor -1\right) \frac{1}{2p}. \end{aligned}$$

When $\frac{{\bar{L}}\Delta \sqrt{p}}{16{\bar{\ell }}_{1}\Delta _0\epsilon ^2}\ge 4$, we have $T\ge 4$ and (53) along with $\lfloor x\rfloor -2\ge {}x/4$ for $x\ge 4$ gives

$$\begin{aligned} \bar{{\mathfrak {m}}}_{\epsilon }^{\mathsf {rand}}(K,\Delta ,{\bar{L}},\sigma ^{2})\ge \frac{{\bar{L}}\Delta }{2^7{\bar{\ell }}_{1}\Delta _0\epsilon ^2\sqrt{p}}\ge \frac{1}{2^{10}{\bar{\ell }}_{1}\Delta _0 \varsigma } \cdot \frac{{\bar{L}}\Delta \sigma }{\epsilon ^3} + \frac{1}{2^{8}{\bar{\ell }}_{1}\Delta _0 } \cdot \frac{{\bar{L}}\Delta }{\epsilon ^2}.\quad \end{aligned}$$

(53)

Moreover, we choose $c'$ so that $\epsilon \le \sqrt{\frac{{\bar{L}}\Delta }{64{\bar{\ell }}_{1}\Delta _0}}\le \sqrt{\frac{{\bar{L}}\Delta }{8}}$ holds. Lemma 11 then gives the lower bound

$$\begin{aligned} \bar{{\mathfrak {m}}}_{\epsilon }^{\mathsf {rand}}(K,\Delta ,{\bar{L}},\sigma ^{2})> c_0\cdot \frac{\sigma ^{2}}{\epsilon ^{2}}, \end{aligned}$$

(54)

for a universal constant $c_0$. Together, the bounds (53) and (54) imply the desired result when $\frac{{\bar{L}}\Delta \sqrt{p}}{16{\bar{\ell }}_{1}\Delta _0\epsilon ^2}\ge 4$. As we argue in the proof of Theorem 2, in the complementary case $\frac{{\bar{L}}\Delta \sqrt{p}}{16{\bar{\ell }}_{1}\Delta _0\epsilon ^2}< 4$, the bound (54) dominates (53), and consequently the result holds there as well.

C Proofs from Section 5

1.1 C.1 Statistical learning oracles

To prove the mean-squared smoothness properties of the construction (32) we must first argue about the continuity of $\nabla \Theta _i$, where $\Theta _i:{\mathbb {R}}^T \rightarrow {\mathbb {R}}$ is the “soft indicator” function given by

$$\begin{aligned} \Theta _{i}(x) :=\Gamma \left(1-\left(\sum _{k=i}^T\Gamma ^2(|x_k|)\right)^{1/2}\right) = \Gamma \left(1-\left\Vert\Gamma \left(|x_{\ge i}|\right)\right\Vert\right). \end{aligned}$$

Lemma 15

For all $i\ge {}j$, $\nabla _i\Theta _j(x)$ is well-defined with

$$\begin{aligned} \nabla {}_{i}\Theta _j(x) = {\left\{ \begin{array}{ll} -\Gamma '(1-\Vert \Gamma (\left|x_{\ge {}j}\right|)\Vert ) \cdot \frac{\Gamma (\left|x_i\right|)}{ \Vert \Gamma (\left|x_{\ge {}j}\right|)\Vert }\cdot {}\Gamma '(\left|x_i\right|) \cdot \mathrm {sgn}(x_i), &{}\\ i \ge j ~\text{ and }~\Vert \Gamma (|x_{\ge {}j}|)\Vert >0, 0, \text{ otherwise. } \end{array}\right. } \end{aligned}$$

(55)

Moreover, $\Theta _j$ satisfies the following properties:

1.
$\Vert \nabla {}\Theta _j(x)\Vert \le {}6^{2}$.
2.
$\Vert \nabla \Theta _j(x)-\nabla \Theta _j(y)\Vert \le {}10^4\cdot \Vert x-y\Vert $.

Proof of Lemma 15

First, we verify that the function $x_i\mapsto {}\Vert \Gamma (\left|x_{\ge {}j}\right|)\Vert $ is differentiable everywhere for each i. From here it follows from Observation 1 that $\Theta _j(x)$ is differentiable, and (55) follows from the chain rule. Let $i\ge {}j$, and let $a=\sqrt{\sum _{k\ge {}j,k\ne {}i}\Gamma ^{2}(\left|x_k\right|)}$. Then $\Vert \Gamma (\left|x_{\ge {}j}\right|)\Vert =\sqrt{a^{2}+\Gamma ^{2}\left(\left|x_i\right|\right)}$. This function is clearly differentiable with respect to $x_i$ when $a>0$, and when $a=0$ it is equal to $\Gamma (\left|x_i\right|)$, which is also differentiable.

Property 1 follows because for all j,

$$\begin{aligned} \Vert \nabla {}\Theta _j(x)\Vert \le {} \frac{6}{\Vert \Gamma (\left|x_{\ge {}j}\right|)\Vert }\cdot \sqrt{\textstyle \sum _{i\ge {}j}\left(\Gamma (\left|x_i\right|)\Gamma '(\left|x_i\right|)\right)^{2}}\le {}6^{2}, \end{aligned}$$

(56)

where we have used Observation 1.3.

To prove Property 2, we restrict to the case $j=1$ so that $x_{\ge {}j}=x$ and subsequently drop the ‘$\ge {}j$’ subscript to simplify notation; the case $j>1$ follows as an immediate consequence. Define $\mu (x)\in \mathbb {R}^{T}$ via $\mu _{i}(x) = \Gamma (\left|x_i\right|)\Gamma '(\left|x_i\right|)\mathrm {sgn}(x_i)$. Assume without loss of generality that $0<\Vert \Gamma (\left|x\right|)\Vert \le {}\Vert \Gamma (\left|y\right|)\Vert $. By triangle inequality, we have

To proceed, we state some useful facts, all of which follow from Observation 1.3:

1.
$\Gamma $ is 6-Lipschitz.
2.
$\Gamma '$ is 128-Lipschitz, and in particular $\Gamma '(1-\Vert \Gamma (\left|x\right|)\Vert )\le {}128\cdot \Vert \Gamma (\left|x\right|)\Vert $ (since $\Gamma '(1)=0$).
3.
$\Vert \mu (x)\Vert \le {}6\cdot \Vert \Gamma (\left|x\right|)\Vert $ for all x.
4.
$\Vert \mu (x)-\mu (y)\Vert \le {}(128\cdot 1 + 6^2)\cdot \Vert x-y\Vert =164\cdot \Vert x-y\Vert $ for all x, y.

Using the first, second, and third facts, we bound the first term as

$$\begin{aligned}&\frac{\Vert \mu (x)\Vert }{\Vert \Gamma (\left|x\right|)\Vert }\cdot \left| \Gamma '(1-\Vert \Gamma (\left|x\right|)\Vert )-\Gamma '(1-\Vert \Gamma (\left|y\right|)\Vert )\right|\\&\quad \le {} 6\, \left|\Gamma '(1-\Vert \Gamma (\left|x\right|)\Vert )-\Gamma '(1-\Vert \Gamma (\left|y\right|)\Vert )\right|\\&\quad \le {} 128\cdot 6 \, \left|\Vert \Gamma (\left|x\right|)\Vert -\Vert \Gamma (\left|y\right|)\Vert \right|\\&\quad \le {} 128\cdot 6^{2} \,\left|\Vert x\Vert -\Vert y\Vert \right|\\&\quad \le {} 5000 \, \Vert x-y\Vert . \end{aligned}$$

For the second term, we apply the second fact and the triangle inequality to upper bound by

Using the fourth fact and the assumption that $\Vert \Gamma (\left|x\right|)\Vert \le {}\Vert \Gamma (\left|y\right|)\Vert $, we have

$$\begin{aligned} \frac{\Vert \Gamma (\left|x\right|)\Vert }{\Vert \Gamma (\left|y\right|)\Vert } \cdot \Vert \mu (x)-\mu (y)\Vert \le {}164\Vert x-y\Vert . \end{aligned}$$

Using the third fact and $\Vert \Gamma (\left|x\right|)\Vert \le {}\Vert \Gamma (\left|y\right|)\Vert $, we have

$$\begin{aligned}&\Vert \Gamma (\left|x\right|)\Vert \Vert \mu (x)\Vert \cdot \left|\frac{1}{\Vert \Gamma (\left|x\right|)\Vert }-\frac{1}{\Vert \Gamma (\left|y\right|)\Vert }\right|\\&\quad \le {} 6\Vert \Gamma (\left|x\right|)\Vert ^{2}\cdot \left|\frac{1}{\Vert \Gamma (\left|x\right|)\Vert }-\frac{1}{\Vert \Gamma (\left|y\right|)\Vert }\right| \\&\quad = 6\frac{\Vert \Gamma (\left|x\right|)\Vert }{\Vert \Gamma (\left|y\right|)\Vert }\cdot \left|{\Vert \Gamma (\left|x\right|)\Vert }{\Vert \Gamma (\left|y\right|)\Vert }\right| \le 6^2 \Vert x-y\Vert . \end{aligned}$$

Gathering all of the constants, this establishes that

$$\begin{aligned} \Vert \nabla \Theta _1(x)-\nabla \Theta _1(y)\Vert \le {}10^4\cdot \Vert x-y\Vert . \end{aligned}$$

$\square $

We are now ready to prove Lemma 8. For ease of reference, we restate the construction (32):

$$\begin{aligned} f_T(x,z) \!=\! -\Psi (1)\Phi (x_1)\nu _1(x,z) \!+\! \sum _{i=2}^{T}\left[\Psi (-x_{i-1})\Phi (-x_i) \!-\! \Psi (x_{i-1})\Phi (x_i)\right]\,\nu _i(x,z), \end{aligned}$$

where

$$\begin{aligned} \nu _i(x,z) :=1 + \Theta _{i}(x) \left(\frac{z}{p}-1\right). \end{aligned}$$

Proof of Lemma 8

To begin, we introduce some shorthand. Define

$$\begin{aligned}&H(s,t) = \Psi (-s)\Phi (-t)-\Psi (s)\Phi (t),\\&h_1(s,t) = \Psi (-s)\Phi '(-t) +\Psi (s)\Phi '(t),\\&h_2(s,t) = \Psi '(-s)\Phi (-t) +\Psi '(s)\Phi (t). \end{aligned}$$

The gradient of the noiseless hard function $F_T$ can then be written as

$$\begin{aligned} \nabla _i F_T(x) = -h_1(x_{i-1},x_i) - h_2(x_i, x_{i+1}). \end{aligned}$$

Next, define

$$\begin{aligned} g_i(x,z) = -h_1(x_{i-1},x_i)\cdot {}\nu _{i}(x,z) -h_2(x_{i},x_{i+1})\cdot {}\nu _{i+1}(x,z). \end{aligned}$$

(57)

With these definitions, we have the expression

$$\begin{aligned} \nabla {}_if_T(x,z) = g_i(x,z) +\left(\frac{z}{p}-1\right)\sum _{j=1}^{i}H(x_{j-1},x_j)\cdot \nabla {}_i\Theta _j(x). \end{aligned}$$

(58)

We begin by noting that $\nabla f_T$ is unbiased for $F_T$: Since ${\mathbb {E}}\left[\nu _i(x,z)\right]=1$ for all i and ${\mathbb {E}}(\tfrac{z}{p}-1)=1$, it follows immediately from (58) that ${\mathbb {E}}\left[\nabla f_T(x,z)\right]=\nabla {}F(x)$.

Next, we show that $\nabla f_T$ is a probability-p zero chain. with an argument analogous to the 4. First, we claim that $\left[\nabla {}f_T(x,z)\right]_i=0$, for all x, z and $i>1+\mathrm {prog}_{\frac{1}{4}}(x)$, yielding $\mathrm {prog}_{0}(\nabla {}f_T(x,z))\le 1+\mathrm {prog}_{\frac{1}{4}}(x)$. Since $\left|x_{i-1}\right|,\left|x_i\right|<1/4$, it follows from (57) that $g_i(x,z)=0$ and from (55) that $\nabla {}_i\Theta _j(x)=0$ for all j, establishing the first claim. Now, consider the case $i=\mathrm {prog}_{\frac{1}{4}}(x)+1$ and $z=0$. Here (since $\left|x_i\right|<1/4$) we still have $\nabla {}_i\Theta _j(x)=0$ for all j, so $\nabla _if_T(x,z)=g_i(x,z)$. Since $\Gamma (\left|x_{\ge {}i}\right|)=\Gamma (\left|x_{\ge {}i+1}\right|)=0$, we have $\nu _{i}(x,0)=\nu _{i+1}(x,0)=0$, so $g_i(x,z)=0$. It follows immediately that $\mathrm {prog}_{0}(\nabla {}f_T(x,0))\le \mathrm {prog}_{\frac{1}{4}}(x)$ for all x. Finally, examining the definition (32) of $f_T$, it is straightforward to verify that $f_T(y,z) = f_T(y_{\le 1+\mathrm {prog}_{\frac{1}{4}}(x)}, z)$ for all y in a neighborhood of x, and all x and z. This implies $f_T(x,z) = f_T(x_{\le 1+\mathrm {prog}_{\frac{1}{4}}(x)}, z)$ and, via differentiation $\nabla f_T(x,z) = \nabla f_T(x_{\le 1+\mathrm {prog}_{\frac{1}{4}}(x)}, z)$. Similarly, one has $f_T(y,0) = f_T(y_{\le \mathrm {prog}_{\frac{1}{4}}(x)}, 0)$ for y in a neighborhood of x, concluding the proof of the probabilistic zero-chain property.

To bound the variance and mean-squared smoothness of $\nabla f_T$, we begin by analyzing the sparsity pattern of the error vector

$$\begin{aligned} \delta (x,z):={}\nabla f_T(x,z)-\nabla F_T(x,z). \end{aligned}$$

Let $i_x=\mathrm {prog}_{\frac{1}{2}}(x)+1$. Observe that if $j<i_x$, we have $\Vert \Gamma (\left|x_{\ge {}j}\right|)\Vert \ge {}\Gamma (\left|x_{i_{x}-1}\right|)\ge {}\Gamma (1/2)=1$, and so $\Gamma '(1-\Vert \Gamma (\left|x_{\ge {}j}\right|)\Vert )=0$ and consequently $\nabla _{i}\Theta _j(x)=0$ for all i. Note also that if $j>i_x$, we have $H(x_{j-1},x_j)=0$. We conclude that (58) simplifies to

$$\begin{aligned} \nabla {}_if_T(x,z) = g_i(x,z) +\left(\frac{z}{p}-1\right)\cdot {}H(x_{i_x-1},x_{i_x})\cdot \nabla {}_i\Theta _{i_x}(x). \end{aligned}$$

(59)

As in Lemma 4, we have $\nu _{i}(x,z)=1$ for all $i<i_x$ and $g_i(x,z)=\nabla {}_iF_T(x)=0$ for all $i>i_x$. Thus, using the expression (57) along with (59), we have

$$\begin{aligned}&\delta _{i}(x,z) =\left(\tfrac{z}{p}-1\right)H(x_{i_x-1},x_{i_x})\cdot \nabla {}_i\Theta _{i_x}(x)-\left(\tfrac{z}{p}-1\right)\nonumber \\&\quad \left\{ \begin{array}{ll} h_2(x_{i_x-1},x_{i_x})\cdot {}\Theta _{i_x}(x),&{}\quad {}i=i_x-1,\\ h_1(x_{i_x-1},x_{i_x})\cdot {}\Theta _{i_x}(x),&{}\quad {}i=i_x,\\ 0,&{}\quad {\text {otherwise.}} \end{array} \right. \end{aligned}$$

(60)

It follows immediately that the variance can be bounded as

$$\begin{aligned}&{\mathbb {E}}\Vert \nabla f_T(x,z)-\nabla F_T(z)\Vert ^{2} \le \frac{2}{p}H(x_{i_x-1},x_{i_x})^2\Vert \nabla \Theta _{i_x}(x)\Vert ^{2}\\&\qquad + \,\frac{2}{p}h_1^2(x_{i_x-1},x_{i_x})\cdot {}\Theta _{i_x}(x)^{2} +\quad \frac{2}{p}h_2^2(x_{i_x-1},x_{i_x})\cdot {}\Theta _{i_x}(x)^{2}. \end{aligned}$$

From (56) we have $\Vert \nabla {}\Theta _{i_x}(x)\Vert \le {}6^2$, and from (36) we have $|H(x,y)|\le {}12$, so the first term contributes at most $\tfrac{2\cdot {}144\cdot {}6^{4}}{p}$. Since $|\Theta _i(x)|\le {}1$, Lemma 2 implies that the second and third term together contribute at most $\tfrac{4\cdot {}23^{2}}{p}$. To conclude, we may take

$$\begin{aligned} {\mathbb {E}}\Vert \nabla f_T(x,z)-\nabla F_T(z)\Vert ^{2} \le {} \frac{\varsigma ^{2}}{p}, \end{aligned}$$

where $\varsigma \le {}10^{3}$.

To bound the mean-squared smoothness ${\mathbb {E}}\Vert \nabla f_T(x,z) - \nabla f_T(y,z) \Vert ^2$, we first use that ${\mathbb {E}}\left[\delta (x ,z)\right]=0$, which implies

$$\begin{aligned} {\mathbb {E}}\Vert \nabla f_T(x,z) - \nabla f_T(y,z) \Vert ^2 ={\mathbb {E}}\Vert \delta (x,z) - \delta (y,z) \Vert ^2 + \Vert F_T(x)-F_T(y)\Vert ^2. \end{aligned}$$

We have $\Vert \nabla F_T(x) - \nabla F_T(y)\Vert \le \ell _{1} \Vert x-y\Vert $ by Lemma 2.2. For the other term, we use the sparsity pattern of $\delta (x,z)$ established in (60) along with the fact that ${\mathbb {E}}\left(\tfrac{z}{p}-1\right)^{2}\le \tfrac{1}{p}$ to show

where $i_y=\mathrm {prog}_{\frac{1}{2}}(y)+1$.

We bound $\mathcal {E}_1$ and $\mathcal {E}_2$ using similar arguments to Lemma 4. Focusing on $\mathcal {E}_1$, and letting $i\in \{i_x,i_y\}$ be fixed, we have

$$\begin{aligned}&\left(h_1(x_{i-1},x_{i})\cdot {}\Theta _{i}(x) -h_1(y_{i-1},y_{i})\cdot {}\Theta _{i}(y)\right)^{2}\\&\le 2\left(h_1(x_{i-1},x_{i}) -h_1(y_{i-1},y_{i})\right)^{2}\Theta _i(x)^{2} + 2\left(\Theta _{i}(x) -\Theta _{i}(y)\right)^{2}h_1(y_{i-1},y_{i})^2. \end{aligned}$$

Note that by Lemma 15, (i) $\Theta _i$ is $6^2$ Lipschitz and $\Theta _i\le {}1$ and (ii) $h_1$ is 23-Lipschitz and $\left|h_1\right|\le {}5$ (from Observation 2 and Lemma 2). Consequently,

$$\begin{aligned} \mathcal {E}_1\le {}2\cdot {}10^{5}\cdot \Vert x-y\Vert ^{2}. \end{aligned}$$

Since $h_{2}$ is 23-Lipschitz and has $\left|h_2\right|\le {}20$, an identical argument also yields that

$$\begin{aligned} \mathcal {E}_2\le {}5\cdot {}10^{6}\cdot \Vert x-y\Vert ^{2}. \end{aligned}$$

To bound $\mathcal {E}_3$, we use the earlier observation that for all i and $j\ne i_x$ we have $H(x_{j-1},x_j)\nabla _i\Theta _j(x)=0$, and likewise that $H(y_{j-1},y_j)\nabla _i\Theta _j(y)=0$ for all $j\ne {}i_y$. This allows us to write

$$\begin{aligned} \mathcal {E}_3&=\sum _{i=1}^{T} \left(\sum _{j\in \{i_x,i_y\}}H(x_{j-1},x_{j})\cdot \nabla {}_i\Theta _{j}(x)-H(y_{j-1},y_{j})\cdot \nabla {}_i\Theta _{j}(y)\right)^2\\&\le {}2\sum _{j\in \{i_x,i_y\}}\sum _{i=1}^{T} \left(H(x_{j-1},x_{j})\cdot \nabla {}_i\Theta _{j}(x)-H(y_{j-1},y_{j})\cdot \nabla {}_i\Theta _{j}(y)\right)^2. \end{aligned}$$

Letting $j\in \{i_x,i_y\}$ be fixed, we upper bound the inner summation as

$$\begin{aligned}&\sum _{i=1}^{T}\left(H(x_{j-1},x_{j})\cdot \nabla {}_i\Theta _{j}(x)-H(y_{j-1},y_{j})\cdot \nabla {}_i\Theta _{j}(y)\right)^2\\&\quad \le 2\sum _{i=1}^{T}\left(H(x_{j-1},x_{j})\cdot (\nabla {}_i\Theta _{j}(x)-\nabla {}_i\Theta _{j}(y))\right)^2\\&\qquad + \left((H(x_{j-1},x_{j})-H(y_{j-1},y_{j}))\cdot \nabla {}_i\Theta _{j}(y)\right)^2\\&\quad = 2H(x_{j-1},x_j)^{2}\Vert \nabla {}\Theta _j(x)-\nabla \Theta _j(y)\Vert ^{2} \\&\qquad +2(H(x_{j-1},x_j)-H(y_{j-1},y_{j}))^{2}\Vert \nabla \Theta _j(y)\Vert ^{2}. \end{aligned}$$

We may now upper bound this quantity by applying the following basic results:

1.
$H(x_{j-1},x_j)\le {}12$ by (36).
2.
$\left|H(x_{j-1},x_j)-H(y_{j-1},y_{j})\right|\le {}20\Vert x-y\Vert $, by (36).
3.
$\Vert \nabla \Theta _j(y)\Vert \le {}6^{2}$ by Lemma 15.1.
4.
$\Vert \nabla \Theta _j(x)-\nabla \Theta _j(y)\Vert \le {}10^4\cdot \Vert x-y\Vert $, by Lemma 15.2.

It follows that $\mathcal {E}_3\le {}3\cdot {}10^{10}\cdot {}\Vert x-y\Vert ^{2}$. Collecting the bounds on $\mathcal {E}_1$, $\mathcal {E}_2$, and $\mathcal {E}_3$, this establishes that

$$\begin{aligned} {\mathbb {E}}\Vert \nabla f_T(x,z) - \nabla f_T(y,z) \Vert ^2\le {} \frac{{\bar{\ell }}_{1}^{2}}{p}\cdot \Vert x-y\Vert ^{2}. \end{aligned}$$

with ${\bar{\ell }}_{1} \le {} \sqrt{10^{11} + \ell _{1}^2}$. $\square $

1.2 C.2 Active oracles

Proof of Lemma 9

Denoting

$$\begin{aligned} \mathcal {G}^{(t)} = \sigma (r, g^{(1)}, \ldots , g^{(t)}) ~~\text{ and }~~ \gamma ^{(t)}:=\max _{i\le t} \mathrm {prog}_{0}(g^{(i)}), \end{aligned}$$

we see that the equality ${\mathbb {P}}(\gamma ^{(t)} - \gamma ^{(t-1)} \notin \{0,1\}| \mathcal {G}^{(t-1)})=0$ holds for our setting as well. Moreover, we claim that

$$\begin{aligned} {\mathbb {P}}(\gamma ^{(t)} - \gamma ^{(t-1)} =1| \mathcal {G}^{(t-1)}) \le 2p. \end{aligned}$$

(61)

Given the bound (61), the remainder of the proof is identical to that of Lemma 1, with 2p replacing p. To see why (61) holds, let $(x^{(1)},i^{(1)}),\ldots ,(x^{(t)},i^{(t)})\in \mathcal {G}^{(t-1)}$ denote the sequence of queries made by the algorithm. We first observe that, by the construction of $g_\pi $, we have $\gamma ^{(t)} = 1+ \gamma ^{(t-1)}$ only if $\zeta _{1+\gamma ^{(t-1)}}(\pi (i^{(t)}))=1$. Therefore,

$$\begin{aligned} {\mathbb {P}}(\gamma ^{(t)} - \gamma ^{(t-1)} =1| \mathcal {G}^{(t-1)}) \le {\mathbb {P}}\big (\zeta _{1+\gamma ^{(t-1)}}(\pi (i^{(t)}))=1| \mathcal {G}^{(t-1)}\big ). \end{aligned}$$

(62)

Next, let $b\in \{0,1\}^{N^T}$ denote a (random) vector whose ith entry is $b_i :=\zeta _{1+\gamma ^{(t-1)}}(\pi (i))$. The vector b has $N^{T-1}$ elements equal to 1 and its distribution is permutation invariant. Note that, by construction, the vector b is independent of $\{\zeta _j(\pi (i))\}_{j\ne 1+\gamma ^{(t-1)},i\in N^T}$. Consequently, the gradient estimates $g^{(1)},\ldots ,g^{(t-1)}$ depend on b only through their $(1+\gamma ^{(t-1)})$th coordinate, which for iterate $t'\le t-1$ is

$$\begin{aligned} g_{1+\gamma ^{(t-1)}}^{(t')} = \left[\nabla _{1+\gamma ^{(t-1)}}F_T(x^{(t')})\right] b_{i^{(t')}}. \end{aligned}$$

From this expression we see that $g^{(t')}$ depends on b only for index queries in the set

$$\begin{aligned} S^{(t-1)}:=\{i^{(t')}| t'<t ~~\text{ and }~~\nabla _{1+\gamma ^{(t-1)}} F_T(x^{(t')}) \ne 0\}\in \mathcal {G}^{(t-1)}. \end{aligned}$$

Moreover, for every $i\in S^{(t-1)}$ we have that $b_i=0$, because otherwise there exists $t'<t$ such that $g_{1+\gamma ^{(t-1)}}^{(t')} \ne 0$ which gives the contradiction $\gamma ^{(t-1)}\ge \gamma ^{(t')}\ge \mathrm {prog}_{0}(g^{(t')})\ge 1+\gamma ^{(t-1)}>\gamma ^{(t-1)}$. In conclusion, we have for every $i\in N^T$

$$\begin{aligned} {\mathbb {P}}\big (\zeta _{1+\gamma ^{(t-1)}}(\pi (i))=1| \mathcal {G}^{(t-1)}\big )= & {} {\mathbb {P}}\big (b_i = 1 | b_{j}=0~\forall j\in S^{(t-1)}\big ) \nonumber \\&= {\left\{ \begin{array}{ll} \frac{N^{T-1}}{N^T - |S^{(t-1)}|} &{} i \notin S^{(t-1)} \\ 0 \text {otherwise,} \\ \end{array}\right. } \end{aligned}$$

(63)

where the last equality follows from the permutation invariance of b.

Combining the observations above with the fact that $|S^{(t-1)}| \le t-1 \le \frac{T}{4p} \le \frac{1}{4}NT \le \frac{1}{2}N^T$ gives the desired result (61), since

We remark that the argument above depends crucially on using a different bit for every coordinate. Indeed, had we instead used the original construction $g_T$ in Eq. (18) and set $g_\pi (x;i)=g_T(\zeta _1(\pi (i)))$, an algorithm that queried roughly N random indices would find an index $i^\star $ such that $\zeta _1(\pi (i^\star ))=1$ and could then continue to query it exclusively, achieving a unit of progress at every query. This would decrease the lower bound from $\Omega (T/p)=\Omega (NT)$ to $\Omega (N+T)$. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Arjevani, Y., Carmon, Y., Duchi, J.C. et al. Lower bounds for non-convex stochastic optimization. Math. Program. 199, 165–214 (2023). https://doi.org/10.1007/s10107-022-01822-7

Download citation

Received: 26 May 2020
Accepted: 14 April 2022
Published: 09 June 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s10107-022-01822-7

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Lower bounds for non-convex stochastic optimization

Abstract

Access this article

Similar content being viewed by others

$\mathbf{C^{2}}$ -Lusin approximation of strongly convex functions

Finite-time error bounds for Greedy-GQ

Random Gradient-Free Minimization of Convex Functions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix

A Proofs from Section 3

1.1 A.1 Basic technical results

Observation 2

Proof of Lemma 2.2

Lemma 10

Lemma 11

Proof of Lemma 11

1.2 A.2 Proof of Lemma 4

1.3 A.3 Proof of Theorem 2

B Proofs from Section 4

1.1 B.1 Proof of Lemma 5

Lemma 12

Proof

1.2 B.2 Proof of Lemma 6

Lemma 13

Proof of Lemma13

Proof of Lemma 6

1.3 B.3 Proof of Lemma 7

Lemma 14

Proof of Lemma 14

Proof of Lemma 7

1.4 B.4 Proof of Theorem 3

C Proofs from Section 5

1.1 C.1 Statistical learning oracles

Lemma 15

Proof of Lemma 15

Proof of Lemma 8

1.2 C.2 Active oracles

Proof of Lemma 9

Rights and permissions

About this article

Cite this article

Share this article

Mathematics Subject Classification

Search

Navigation