Abstract
We lower bound the complexity of finding \(\epsilon \)-stationary points (with gradient norm at most \(\epsilon \)) using stochastic first-order methods. In a well-studied model where algorithms access smooth, potentially non-convex functions through queries to an unbiased stochastic gradient oracle with bounded variance, we prove that (in the worst case) any algorithm requires at least \(\epsilon ^{-4}\) queries to find an \(\epsilon \)-stationary point. The lower bound is tight, and establishes that stochastic gradient descent is minimax optimal in this model. In a more restrictive model where the noisy gradient estimates satisfy a mean-squared smoothness property, we prove a lower bound of \(\epsilon ^{-3}\) queries, establishing the optimality of recently proposed variance reduction techniques.
Similar content being viewed by others
Notes
As is common in the optimization literature, we describe algorithms which use random coins in their execution as “randomized,” as opposed to “deterministic” algorithms which do not. Likewise, we distinguish between “noiseless” and “stochastic” first-order oracles, which provide exact and noisy gradient information, respectively.
See also the K-parallel model of [35].
The iterates of SPIDER and SNVRG are a linear combination of previously computed gradients, and therefore these algorithms are zero-respecting.
The event holds with probability at least \(1-\delta \) with respect to the random choice of U and the oracle seeds \(\{z^{(t)}\}\), even when conditioned over any randomness in \({\mathsf {A}}\).
Available at arxiv.org/abs/1912.02365v2.
References
Agarwal, A., Bartlett, P.L., Ravikumar, P., Wainwright, M.J.: Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Trans. Inf. Theory 5(58), 3235–3249 (2012)
Allen-Zhu, Z.: How to make the gradients small stochastically: even faster convex and nonconvex SGD. In: Advances in Neural Information Processing Systems, pp. 1165–1175 (2018a)
Allen-Zhu, Z.: Natasha 2: Faster non-convex optimization than SGD. In: Advances in Neural Information Processing Systems, pp. 2675–2686, (2018b)
Allen-Zhu, Z., Hazan, E.: Variance reduction for faster non-convex optimization. In International conference on machine learning, pp. 699–707 (2016)
Allen-Zhu, Z., Li. Y.: Neon2: finding local minima via first-order oracles. In: Advances in Neural Information Processing Systems, pp. 3716–3726 (2018)
Arjevani, Y.: Limitations on variance-reduction and acceleration schemes for finite sums optimization. In: Advances in Neural Information Processing Systems, pp. 3540–3549 (2017)
Arjevani, Y., Shamir, O.: Dimension-free iteration complexity of finite sum optimization problems. In: Advances in Neural Information Processing Systems, pp. 3540–3548 (2016)
Arjevani, Y., Carmon, Y., Duchi, J.C., Foster, D.J., Sekhari, A., Sridharan, K.: Second-order information in non-convex stochastic optimization: power and limitations. In: Conference on Learning Theory, pp. 242–299. PMLR (2020)
Ball, K.: An elementary introduction to modern convex geometry. In Levy, S. (ed Flavors of Geometry, pp. 1–58. MSRI Publications (1997)
Bottou, L., Bousquet, O.: The tradeoffs of large scale learning. In Advances in neural information processing systems, pp. 161–168 (2008)
Bottou, L., Curtis, F., Nocedal, J.: Optimization methods for large-scale learning. SIAM Rev. 60(2), 223–311 (2018)
Braun, G., Guzmán, C., Pokutta, S.: Lower bounds on the oracle complexity of nonsmooth convex optimization via information theory. IEEE Trans. Inf. Theory 63(7), 4709–4724 (2017)
Bubeck, S., Jiang, Q., Lee, Y.T., Li, Y., Sidford, A.: Complexity of highly parallel non-smooth convex optimization. In: Advances in Neural Information Processing Systems 32 (2019)
Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Convex until proven guilty: Dimension-free acceleration of gradient descent on non-convex functions. In Proceedings of the 34th International Conference on Machine Learning, pp. 654–663 (2017)
Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Lower bounds for finding stationary points I. Math. Progr. 184(1), 71–120 (2019)
Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Lower bounds for finding stationary points II: First-order methods. Math. Progr. 185(1), 315–355 (2021)
Cartis, C., Gould, N.I., Toint, P.L.: On the complexity of steepest descent, newton’s and regularized newton’s methods for nonconvex unconstrained optimization problems. Siam J. Opt. 20(6), 2833–2852 (2010)
Cartis, C., Gould, N.I., Toint, P.L.: Complexity bounds for second-order optimality in unconstrained optimization. J. Complex. 28(1), 93–108 (2012)
Cartis, C., Gould, N.I., Toint, P.L.: How much patience to you have?: a worst-case perspective on smooth noncovex optimization. Optima 88, 1–10 (2012)
Cartis, C., Gould, N.I., Toint, P.L.: Worst-case evaluation complexity and optimality of second-order methods for nonconvex smooth optimization. arXiv preprint arXiv:1709.07180, (2017)
Cutkosky, A., Orabona, F.: Momentum-based variance reduction in non-convex SGD. Adv. Neural Inf. Process. Syst. (2019)
Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems 27, (2014)
Diakonikolas, J., Guzmán, C.: Lower bounds for parallel and randomized convex optimization. In: Proceedings of the Thirty Second Annual Conference on Computational Learning Theory (2019)
Drori, Y., Shamir, O.: The complexity of finding stationary points with stochastic gradient descent. arXiv preprint arXiv:1910.01845 (2019)
Fang, C., Li, C.J., Lin, Z., Zhang, T.: Spider: near-optimal non-convex optimization via stochastic path-integrated differential estimator. In: Advances in Neural Information Processing Systems, pp. 689–699 (2018)
Fang, C., Lin, Z., Zhang, T.: Sharp analysis for nonconvex SGD escaping from saddle points. In: Beygelzimer, A., Hsu, D., (eds) Proceedings of the Thirty-Second Conference on Learning Theory, vol. 99, pp. 1192–1234. PMLR (2019)
Foster, D.J., Sekhari, A., Shamir, O., Srebro, N., Sridharan, K., Woodworth, B.: The complexity of making the gradient small in stochastic convex optimization. In: Proceedings of the Thirty-Second Conference on Learning Theory, pp. 1319–1345 (2019)
Ge, R., Huang, F., Jin, C., Yuan, Y.: Escaping from saddle points: online stochastic gradient for tensor decomposition. In: Conference on Learning Theory, pp. 797–842 (2015)
Ge, R., Lee, J.D., Ma, T.: Matrix completion has no spurious local minimum. In: Advances in Neural Information Processing Systems, pp. 2973–2981 (2016)
Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Opt. 23(4), 2341–2368 (2013)
LeCam, L.: Convergence of estimates under dimensionality restrictions. Ann. Stat. 1(1), 38–53 (1973)
Lei, L., Ju, C., Chen, J., Jordan, M.I.: Non-convex finite-sum optimization via SCSG methods. In: Advances in Neural Information Processing Systems, pp. 2348–2358 (2017)
Ma, C., Wang, K., Chi, Y., Chen, Y.: Implicit regularization in nonconvex statistical estimation: gradient descent converges linearly for phase retrieval, matrix completion and blind deconvolution. Found. Comput. Math. (2019). https://doi.org/10.1007/s10208-019-09429-9
Murty, K.G., Kabadi, S.N.: Some np-complete problems in quadratic and nonlinear programming. Math. progr. 39(2), 117–129 (1987)
Nemirovski, A.: On parallel complexity of nonsmooth convex optimization. J. Complex. 10(4), 451–463 (1994)
Nemirovski, A., Yudin, D.B.: Problem Complexity and Method Efficiency in Optimization. Wiley (1983)
Nesterov, Y.: Introductory Lectures of Convex Optimization. Kluwer Academic Publishers (2004)
Nesterov, Y., Polyak, B.T.: Cubic regularization of newton method and its global performance. Math. Progr. 108(1), 177–205 (2006)
Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate \(o (1/k^2)\). Sov. Math. Dokl. 27(2), 372–376 (1983)
Nocedal, J., Wright, S.: Numerical Optimization. Springer Science & Business Media (2006)
Raginsky, M., Rakhlin, A.: Information-based complexity, feedback and dynamics in convex programming. IEEE Trans. Inf. Theory 57(10), 7036–7056 (2011)
Reddi, S.J., Hefny, A., Sra, S., Poczos, B., Smola, A.: Stochastic variance reduction for nonconvex optimization. In: International Conference on Machine Learning, pp. 314–323 (2016)
Schmidt, M., Roux, N.L., Bach, F.: Convergence rates of inexact proximal-gradient methods for convex optimization. In: Advances in Neural Information Processing Systems 24 (2011)
Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res. 14, 567–599 (2013)
Sun, J., Qu, Q., Wright, J.: A geometric analysis of phase retrieval. Found. Comput. Math. 18(5), 1131–1198 (2018)
Traub, J.F., Wasilkowski, G.W., Woźniakowski H.: Information-Based Complexity. (1988)
Tripuraneni, N., Stern, M., Jin, C., Regier, J., Jordan, M.I.: Stochastic cubic regularization for fast nonconvex optimization. In: Advances in Neural Information Processing Systems, pp. 2899–2908 (2018)
Vavasis, S.A.: Black-box complexity of local minimization. SIAM J. Opt. 3(1), 60–80 (1993)
Wang, Z., Ji, K., Zhou, Y., Liang, Y., Tarokh, V.: Spiderboost: a class of faster variance-reduced algorithms for nonconvex optimization. arXiv preprint arXiv:1810.10690, (2018)
Woodworth, B., Srebro, N.: Tight complexity bounds for optimizing composite objectives. In: Advances in Neural Information Processing Systems, pp. 3639–3647 (2016)
Woodworth, B., Srebro, N.: Lower bound for randomized first order convex optimization. arXiv preprint, arXiv:1709.03594 (2017)
Xu, Y., Rong, J., Yang, T.: First-order stochastic algorithms for escaping from saddle points in almost linear time. In: Advances in Neural Information Processing Systems, pp. 5530–5540 (2018)
Yao, A.C.-C.: Probabilistic computations: toward a unified measure of complexity. In 18th Annual Symposium on Foundations of Computer Science, pp. 222–227. IEEE (1977)
Yu, B.: Assouad, Fano, and Le Cam. In Festschrift for Lucien Le Cam, pp. 423–435. Springer (1997)
Zhou, D., Gu, Q.: Lower bounds for smooth nonconvex finite-sum optimization. In: International Conference on Machine Learning (2019)
Zhou, D., Xu, P., Gu, Q.: Stochastic nested variance reduction for nonconvex optimization. In: Advances in Neural Information Processing Systems, pp. 3925–3936. Curran Associates Inc. (2018)
Zhou, D., Xu, P., Gu, Q.: Stochastic nested variance reduction for nonconvex optimization. J. Mach. Learn. Res. (2020)
Acknowledgements
Part of this work was completed while the authors were visiting the Simons Institute for the Foundations of Deep Learning program. We thank Ayush Sekhari, Ohad Shamir, Aaron Sidford and Karthik Sridharan for several helpful discussions. YC was supported by the Stanford Graduate Fellowship. JCD acknowledges support from NSF CAREER award 1553086, the Sloan Foundation, and ONR-YIP N00014-19-1-2288. DF was supported by NSF TRIPODS award #1740751. BW was supported by the Google PhD Fellowship program. Division of Computing and Communication Foundations (Grant Number 1553086
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
A Proofs from Section 3
1.1 A.1 Basic technical results
Before proving the main results from Sect. 3, we first state two self-contained technical results that will be used in subsequent proofs. The first result bounds component functions \(\Psi \) and \(\Phi \) and gives the calculation for the parameter \(\ell _1\) in Lemma 2.2.
Observation 2
The functions \(\Psi \) and \(\Phi \) in (17) and their derivatives satisfy
Proof of Lemma 2.2
We note that the Hessian of \(F_T\) is tridiagonal. Consequently, for any \(x\in {\mathbb {R}}^d\),
where (i) is a direct calculation using the definition (16) of \(F_T\) and (ii) follows from (36). \(\square \)
The second result is an \(\Omega (\frac{\sigma ^{2}}{\epsilon ^{2}})\) lower bound on the sample complexity of finding stationary points whenever \(\epsilon \le {}O(\sqrt{\Delta {}L})\). This result handles an edge case in the proof of Theorem 2. A similar lower bound appeared in [27], but the result we prove here is slightly stronger because it holds even for dimension \(d=1\).
Lemma 10
There exists a number \(c_0 > 0\) such that for any number of simultaneous queries K, dimension d and \(\epsilon \le {}\sqrt{\frac{{\bar{L}}\Delta }{8}}\), we have
Our approach for proving Lemma 10 is as follows. Given a dimension d, we construct a function \(F:\mathbb {R}^{d}\rightarrow \mathbb {R}\), a family of distributions \(P_z\), and a family of functions f(x, z) for which \(F(x)={\mathbb {E}}_{z}\left[f(x,z)\right]\), and for which the initial suboptimality, variance, and mean-squared smoothness are bounded by \(\Delta , \sigma ^2\) and \({\bar{L}}\), respectively. We then prove a lower bound in the global stochastic model in which at round t the oracle returns the full function \(f(\cdot , z^{(t)})\), rather than just its value and derivatives at the queried point. The global stochastic model is more powerful than the K-query stochastic first-order model (with \(g(x,z)=\nabla _x f(x,z)\)) for every value of K, so this will imply the claimed result as a special case.
Lemma 11
Whenever \(\epsilon \le {}\sqrt{\frac{{\bar{L}}\Delta }{8}}\), the number of samples required to obtain an \(\epsilon \)-stationary point in the global stochastic model defined above is \(\Omega (1) \cdot \frac{\sigma ^{2}}{\epsilon ^{2}}\).
Proof of Lemma 11
The proof follows standard arguments used to derive information-theoretic lower bounds for statistical estimation [31, 54].
We consider a family of functions \(f:{\mathbb {R}}^d\times {\mathbb {R}}\rightarrow {\mathbb {R}}\) given by
where \(r\in (0,\sqrt{2\Delta /{\bar{L}}})\) is a fixed parameter. We take \(P_z\) to have the form \(P_z^{s}:=\mathcal {N}(rs,\frac{\sigma ^2}{{\bar{L}}^2})\), where \(s\in \{-1,1\}\), and let \(\theta _s :=(rs,0,\dots , 0)\in {\mathbb {R}}^d\). Then, when \(P_z=P_z^{s}\), we have \(F_{s}(x) :={\mathbb {E}}_z\left[f(x,z)\right] =\frac{{\bar{L}}}{2} \Vert x-\theta _s\Vert ^2\), and furthermore for any \(x,y\in \mathbb {R}^{d}\) we have
Note that \(F_s\) is indeed an \({\bar{L}}\)-smooth, and has initial suboptimality at \(x^{(0)}=0\) bounded as \(F_s(0)-\inf _{x\in {\mathbb {R}}^d} F_s(x)={\bar{L}}r^2/2\le \Delta \).
Now, we provide a distribution over the underlying instance by drawing S uniformly from \(\{\pm {}1\}\), and consider any algorithm that takes as input samples \(z_1,\ldots ,z_T\sim P_z^S\), and returns iterate \({\hat{x}}\). To bound the expected norm of the gradient at \({\hat{x}}\) (over the randomness of the oracle, the randomness of the algorithm, and the choice of the underlying instance S), we define \({\hat{S}} :={{\,\mathrm{arg\,min}\,}}_{s'\in \{1,-1\}} \Vert \nabla F_{s'}({\hat{x}})\Vert \), with ties broken arbitrarily. Observe that we have
where (i) follows by Markov’s inequality and (ii) follows because when \({\hat{S}}\ne {}S\), the definition of \({\hat{S}}\) implies
Next, for \(s\in \{\pm {}1\}\) let \({\mathbb {P}}_{s}=\mathcal {N}^{\otimes T}(rs,\frac{\sigma ^2}{{\bar{L}}^2})\) denote the law of \((z_1,\dots ,z_T)\) conditioned on \(S=s\). We have
where the penultimate step follows by Pinsker’s inequality and the last step uses that \({\mathbb {P}}_{s}=\mathcal {N}^{\otimes T}(rs,\frac{\sigma ^2}{{\bar{L}}^2})\). Combining this lower bound with (39) yields
Finally, setting \(r= \min \{\frac{\sigma }{2{\bar{L}}\sqrt{T}},\sqrt{\frac{2\Delta }{{\bar{L}}}}\}\), implies
Stated equivalently, whenever \(\epsilon \le \sqrt{{{\bar{L}}\Delta }/{8}}\), there exists \(s\in \{-1,1\}\) such that the number of oracle calls T required to ensure \({\mathbb {E}}[\Vert \nabla F_{s}({\hat{x}})\Vert ]\le \epsilon \) satisfies
concluding the proof. \(\square \)
1.2 A.2 Proof of Lemma 4
First, we note that \({\mathbb {E}}\left[\nu _i (x,z)\right]=1\) for all x and i, and therefore \({\mathbb {E}}\left[{\bar{g}}_T(x,z)\right] = \nabla F_T(x)\). To show the probabilistic zero-chain property, note that, due to Observation 1.1, we have \(\nu _i(x,z) = \nu _i(x_{\le \mathrm {prog}_{\frac{1}{4}}(x)},z)\) for all x and z. Moreover, for \(i\ge 1+\mathrm {prog}_{\frac{1}{4}}(x)\) we have \(\Gamma (|x_{\ge i}|) = 0\) and therefore \(\Theta _i(x)=\Gamma (1)=1\) and \(\nu _i(x,0)=0\).
With these observation, the proof of the zero-chain property is analogous its proof in Lemma 3: for all x and z we have \(\mathrm {prog}_{0}({\bar{g}}_T(x,z)) \le 1+\mathrm {prog}_{\frac{1}{4}}(x)\) (from Lemma 2.4) and \({\bar{g}}_T(x,z) = {\bar{g}}_T\big (x_{\le 1+\mathrm {prog}_{\frac{1}{4}}(x)},z\big )\) (from Lemma 2.5 and \(\nu _i(x,z) = \nu _i(x_{\le \mathrm {prog}_{\frac{1}{4}}(x)},z)\)), giving Eq. (12); for \(z=0\) we further have \(\mathrm {prog}_{0}({\bar{g}}_T(x,z)) \le \mathrm {prog}_{\frac{1}{4}}(x)\) (from \(\nu _i(x,0)=0\)) and \({\bar{g}}_T(x,z) = {\bar{g}}_T\big (x_{\le \mathrm {prog}_{\frac{1}{4}}(x)},z\big )\) (from Lemma 2.5 and \(\nu _i(x,z) = \nu _i(x_{\le \mathrm {prog}_{\frac{1}{4}}(x)},z)\)), giving Eq. (11).
To bound the variance of the gradient estimator we observe that for all \(i\le \mathrm {prog}_{\frac{1}{2}}(x)\), \(\Vert \Gamma (|x_{\ge i}|)\Vert \ge \Gamma (1/2)=1\) and therefore \(\Theta _{i}(x)=0\) and \(\nu _i(x,z)=1\), so that
On the other hand, Lemma 2.4 gives us that
We conclude that \(\delta (x,z) = {\bar{g}}_T(x,z) - \nabla F_T(x)\) has at most a single nonzero entry in coordinate \(i_x = \mathrm {prog}_{\frac{1}{2}}(x)+1\). Moreover, for every i
Therefore,
where the final transition used Lemma 2.3 and \(\Theta _i^2(x)\le 1\) for all x and i, establishing the variance bound in (23) with \(\varsigma = 23\).
To bound \({\mathbb {E}}\Vert {\bar{g}}_T(x,z) - {\bar{g}}_T(y,z) \Vert ^2\), we use that \({\mathbb {E}}\left[\delta (\cdot ,z)\right]=0\) and that \(\delta (\cdot ,z)\) has at most one nonzero coordinate to write
where \(i_y=\mathrm {prog}_{\frac{1}{2}}(y)+1\) is the nonzero index of \(\delta (y,z)\). For any \(i\le T\), we have
By Observation 1.3, \(\Gamma _i\) is 6-Lipschitz. Since the Euclidean norm \(\Vert \cdot \Vert \) is 1-Lipschitz, we have
That is, \(\Theta _i\) is \(6^2\)-Lipschitz. Since \(\Theta _i^2(y) \le 1\) and \((\nabla _i F_T(x))^2 \le 23^2\) by Lemma 2.3, we have
for all i. Substituting back into (40) we obtain
Recalling that \(\Vert \nabla F_T(x) - \nabla F_T(y)\Vert \le \ell _{1} \Vert x-y\Vert \) by Lemma 2.2, establishes the mean-square smoothness bound in (23) with \({\bar{\ell }}_{1} = \sqrt{2\cdot (\varsigma \cdot 6)^2 + 3\ell _{1}^2}\).
1.3 A.3 Proof of Theorem 2
Let \(\Delta _0, \ell _1,\varsigma \) and \({\bar{\ell }}_{1}\) be the numerical constants in Lemmas 2.1, 2.2 and 4, respectively. Let the accuracy parameter \(\epsilon \), initial suboptimality \(\Delta \), mean-squared smoothness parameter \({\bar{L}}\), and variance parameter \(\sigma ^2 \) be fixed, and let \(L\le {\bar{L}}\) be specified later. We rescale \(F_T\) as in the proof of Theorem 1,
This guarantees that \(F^{\star }_T\in {\mathcal {F}}(\Delta , L)\) and that the corresponding scaled gradient estimator \(g^{\star }_T(x,z)=(L\lambda /\ell _{1}){\bar{g}}_T(x/\lambda ,z)\) is such that every zero respecting algorithm \({\mathsf {A}}\) interacting with \({\mathsf {O}}_{F^{\star }_T}(x,z)=(F^{\star }_T(x),g^{\star }_T(x,z))\) satisfies
for all \(t\le {} {(T-1)}/{2p}\) and \(k\in [K]\). It remains to choose p and \(L\) such that \({\mathsf {O}}_{F^{\star }_T}\) belongs to \(\mathcal {O}(K,\sigma ^{2},{\bar{L}})\). As in the proof of Theorem 1, setting \(\frac{1}{p}=\frac{\sigma ^2}{ (2\varsigma \epsilon )^2} + 1\) and using Lemma 4 guarantees a variance bound of \(\sigma ^{2}\). Moreover, by Lemma 4 we have
Therefore, taking
guarantees membership in the oracle class and implies the lower bound
We consider the cases \(\frac{{\bar{L}}\Delta \sqrt{p}}{4{\bar{\ell }}_{1}\Delta _0\epsilon ^2}\ge 3\) and \(\frac{{\bar{L}}\Delta \sqrt{p}}{4{\bar{\ell }}_{1}\Delta _0\epsilon ^2}<3\) separately. In the former case (which is the more interesting one), we use \(\lfloor x\rfloor -1\ge x/2\) for \(x \ge 3\) and the setting of p to write
Moreover, we choose \(c'=12{\bar{\ell }}_{1}\Delta _0\) so that \(\epsilon \le \sqrt{\frac{{\bar{L}}\Delta }{12{\bar{\ell }}_{1}\Delta _0}}\le \sqrt{\frac{{\bar{L}}\Delta }{8}}\) holds. By Lemma 11,
where \(c_0\) is a universal constant (this lower bound holds for any value of d). Together, the bounds (41) and (42) imply the desired result when \(\frac{{\bar{L}}\Delta \sqrt{p}}{4{\bar{\ell }}_{1}\Delta _0\epsilon ^2}\ge 3\).
Finally, we consider the edge case \(\frac{{\bar{L}}\Delta \sqrt{p}}{4{\bar{\ell }}_{1}\Delta _0\epsilon ^2}<3\). We note that the assumption \(\epsilon \le \sqrt{\frac{{\bar{L}}\Delta }{12{\bar{\ell }}_{1}\Delta _0}}\) precludes the option that \(p=1\) in this case. Therefore we must have \(\frac{{\bar{L}}\Delta \varsigma }{2{\bar{\ell }}_{1}\Delta _0\sigma \epsilon }<3\) or, equivalently, \(\frac{\sigma ^2}{\epsilon ^2} > \frac{\varsigma }{6{\bar{\ell }}_{1}\Delta _0} \cdot \frac{{\bar{L}}\Delta \sigma }{\epsilon ^3}\). Thus, in this case the bound (42) implies (41) up to a constant, concluding the proof.
B Proofs from Section 4
1.1 B.1 Proof of Lemma 5
The proof combines the techniques of the proofs of Lemma 1 and random-projection based lower bounds on the sequential oracle complexity of optimization [15, 51] in their refined form [13, 23] which yields low-dimensional hard instances.
Let us adopt the shorthand \(x^{(i)}:={}x^{(i)}_{{\mathsf {A}}[{\mathsf {O}}_{{\tilde{F}}_U}]}\), which we recall is defined via
where r is the algorithm’s random seed. Following the proof strategy of Lemma 1, we define
and
recalling that, due to Definition 2, \(\{B^{(t)}\}_{t \ge 1}\) are i.i.d. Bernoulli with probability of success at most p, and are independent of any randomization in the algorithm \({\mathsf {A}}\). With the shorthand
We additionally define, for every \(t\ge 1\), the event
Writing
the claim of the lemma is equivalent to the statement that \({\mathbb {P}}(\pi ^{(T_p)} \ge T) \le \delta \). Since
it suffices to show that both
and
The bound (45) follows identically to Eq. (15) in the proof of Lemma 1, and so the remainder of the proof consists of establishing the bound (44).
We begin by rewriting the event \(\big [{\mathfrak {E}}^{(T_p)}\big ]^c\) as follows
Define the \(\sigma \)-field
where we recall that r is the algorithm’s random seed, and note that \(C^{(t)}\in \mathcal {F}\) for all \(t \le T_p\). Conditioning on \(\mathcal {F}\) and applying the union bound, we have
Therefore, to establish the bound (44) and with it the result, it suffices to prove that the probability for every \(t\le T_p\), \(k\in [K]\) and \(j>C^{(t)}\). To do so, we leverage probabilistic zero-chain property in order show the following.
Lemma 12
Fix \(t \ge 1\), and condition on \(\mathcal {F}\). If \({\mathfrak {E}}^{(t-1)}\) holds, then for every \(s \le t\) and \(k\in [K]\), \(x^{(s,k)}\) is measurable with respect to \(u^{(1)}, \ldots , u^{(C^{(s)})}\).
Proof
Throughout the proof, we adopt the shorthand \(U_{\le c}\) for \([u^{(1)}; \ldots u^{(c)}; 0, \ldots , 0]\), i.e., a version of U where the last \(T-c\) columns are replaced with zeros. We also recall the notation \(x_{\le i}\) for the replacement of all but the first i coordinates of x with zeros.
The crux of the proof is the following claim: for any \(s<t\) and \(k\in [K]\), if \(x^{(s,k)}\) is measurable w.r.t. \(U_{\le C^{(s)}}\) and \(\mathrm {prog}_{\frac{1}{4}}(U^\top x^{(s,k)}) \le C^{(s)}\), then the oracle response to query \(x^{(s,k)}\) is measurable w.r.t. \(U_{\le C^{(s+1)}}\). To see why this holds, let \(g^{(s,k)} = g( U^\top x^{(s,k)},z^{(s)})\) and note that Definition 2 of the probabilistic zero chain, along with definition (43) of the sequence \((B^{(t)})\), implies that
The assumption \(\mathrm {prog}_{\frac{1}{4}}(U^\top x^{(s,k)}) \le C^{(s)}\) implies that \(B^{(s)} + \mathrm {prog}_{\frac{1}{4}}(U^\top x^{(s,k)}) \le B^{(s)} + C^{(s)} = C^{(s+1)}\), and—noting that \([U^\top v]_{\le c} = U_{\le c}^\top v\)—we consequently have
Therefore, the oracle response to query \(x^{(s,k)}\) has the form
so that if \(x^{(s,k)}\) is measurable w.r.t. \(U_{\le C^{(s)}}\), then \({\mathsf {O}}_{{\tilde{F}}_U}(x^{(s,k)},z^{(s)})\) is measurable w.r.t. \(U_{\le C^{(s+1)}}\).
From here the lemma follows by straightforward induction. The base case \(t=1\) is trivial, since the algorithm’s initial queries do not depend on U. For the induction step, fix s and suppose that \(x^{(s',k')}\) is measurable w.r.t. \(U_{\le C^{(s')}}\) for all \(s' < s \le t\) and \(k'\in [K]\). That \({\mathfrak {E}}^{(t-1)}\) holds implies that \(\mathrm {prog}_{\frac{1}{4}}( U^\top x^{(s',k')}) \le C^{(s')}\) for all \(s'<t\), and hence by the discussion above we conclude that the oracle’s responses to all queries at iterations \(1,\ldots , s-1\) are measurable w.r.t. \(U_{\le C^{(s)}}\). Since \(x^{(s,k)}\) is a (measurable) function of r and the oracle responses up to iteration s, we conclude that it is measurable w.r.t. \(U_{\le C^{(s)}}\) as well. \(\square \)
From Lemma 12, we conclude that there exists a function \({\mathsf {f}}^{(t,k)}: ({\mathbb {R}}^d)^{C^{(t)}}\rightarrow \{x\in {\mathbb {R}}^d | \Vert x\Vert \le R\}\) (implicitly also dependent on \(\mathcal {F}\)), such that \(x^{(t,k)} = {\mathsf {f}}^{(t,k)}(u^{(1)},\ldots , u^{(C^{(t)})})\). Consequently,
Conditional on \(\mathcal {F}\) and \(u^{(1)}, \ldots , u^{(C^{(t)})}\), we have that \({\mathsf {f}}^{(t,k)}(u^{(1)},\ldots , u^{(C^{(t)})})\) is a fixed vector with norm at most R, while for every \(j > C^{(t)}\), the vector \(u^{(j)}\) is uniformly distributed on the \((d-C^{(t)})\)-dimensional unit sphere. Therefore, concentration of measure on the sphere (see, e.g., Lemma 2.2 of [9]) gives
where the last transition follows from our choice of \(d - T \ge 32R^2 \log \frac{2T^2 K}{p\delta } \ge 32R^2 \log \frac{4 T_p K T}{\delta }\). Substituting back into Eq. (46), we obtain the bound (44) and conclude the proof of Lemma 5.
1.2 B.2 Proof of Lemma 6
Before proving Lemma 6 we first list the relevant continuity properties of the compression function
Lemma 13
Let \(J(x)=\frac{\partial \rho }{\partial {}x}(x) = \frac{I - \rho (x)\rho (x)^{\top }/R^{2}}{\sqrt{1+\Vert x\Vert ^{2}/R^{2}}}\). For all \(x,y\in \mathbb {R}^{d}\) we have
Proof of Lemma13
Note that \(\Vert \rho (x)\Vert \le R\) and therefore \(0 \preceq I - \rho (x)\rho (x)^{\top }/R^{2} \preceq I\). Consequently, we have \( \Vert J(x)\Vert _{\mathrm {op}} = (1+\Vert x\Vert ^{2}/R^{2})^{-1/2} \le 1. \) The guarantee \(\Vert \rho (x)-\rho (y)\Vert \le {}\Vert x-y\Vert \) follows immediately by Taylor’s theorem. For the last statement, define \(h(t)=\frac{1}{\sqrt{1+t^{2}}}\), and note that \(\left|h(t)\right|,\left|h'(t)\right|\le {}1\). By triangle inequality and the aforementioned boundedness and Lipschitzness properties of h, we have
For the first term, observe that for any x, y we have \(\Vert x\Vert ,\Vert y\Vert \le {}1\), we have \(\Vert xx^{\top }-yy^{\top }\Vert _{\mathrm {op}}\le {}2\Vert x-y\Vert \); this follows because for any \(\Vert v\Vert =1\), we have \(\Vert (xx^{\top }-yy^{\top })v\Vert \le {}\Vert x-y\Vert \left|\left\langle v,x\right\rangle \right|+\Vert y\Vert \left|\left\langle v,x-y\right\rangle \right|\le {}2\Vert x-y\Vert \). Since \(\Vert \rho (x)/R\Vert \le {}1\), it follows that
For the second term, we again use that \(\Vert \rho (x)\Vert \le {}R\) to write
\(\square \)
Proof of Lemma 6
The argument here is essentially identical to [15, Lemma 5]. Define \(y^{(i)}=(y^{(i,1)},\ldots ,y^{(i,K)})\), where \(y^{(i,k)}=\rho (x^{(i,k)})\). Observe that for each i and k, the oracle response \((\widehat{F}_{T,U}(x^{(i,k)}),\widehat{g}_{T,U}(x^{(i,k)},z^{(i)}))\) is a measurable function of \(x^{(i,k)}\) and \(({\tilde{F}}_{T,U}(y^{(i,k)}),{\tilde{g}}_{T,U}(y^{(i,k)},z^{(i)}))\). Consequently, we can regard the sequence \(y^{(1)},\ldots ,y^{(T)}\) as realized by some algorithm in \({\mathcal {A}}_{{\textsf {rand}}}(K)\) applied to an oracle with \({\mathsf {O}}_{{\tilde{F}}_{T,U}}(y,z)=({\tilde{F}}_{T,U}(y),{\tilde{g}}_{T,U}(y,z))\). Lemma 5 then implies that as long as \(d\ge {}\lceil (32\cdot 230^2+1)T\log \frac{2KT^2}{p\delta }\rceil \ge {}\lceil T + 32 R^2\log \frac{2KT^2}{p\delta }\rceil \), we have that with probability at least \(1-\delta \),
as long as \(i\le {}(T-\log (2/\delta ))/2p\).
We now show that the gradient must be large for all of the iterates. Let i and k be fixed. We first consider the case where \(\Vert x^{(i,k)}\Vert \le {}R/2\). Observe that (48) implies that \(\mathrm {prog}_{1}(U^{\top }y^{(i,k)})<T\) and so by Lemma 2.6, if we set \(j=\mathrm {prog}_{1}(U^{\top }y^{(i,k)})+1\), we have
Now, observe that we have
Using that \(J(x) = \frac{I - \rho (x)\rho (x)^{\top }/R^{2}}{\sqrt{1+\Vert x\Vert ^{2}/R^{2}}}\), this is equal to
Since \(\Vert y^{(i,k)}\Vert \le {}\Vert x^{(i,k)}\Vert \le {}R/2\), this implies
By Lemma 2 we have \(\Vert {\tilde{F}}_{T,U}(y^{(i,k)})\Vert \le {}23\sqrt{T}\). At this point, the choice \(\eta =1/5\), \(R=230\sqrt{T}\), as well as (49) imply that \(\left|\left\langle u^{(j)},\nabla {}\widehat{F}_{T,U}(x^{(i,k)})\right\rangle \right| \ge {} \frac{2}{\sqrt{5}} - \left(\frac{1}{20} + \frac{1}{2\sqrt{5}}\right)\ge {}\frac{1}{2}\).
Next, we handle the case where \(\Vert x^{(i,k)}\Vert >R/2\). Here, we have
where the second inequality uses that \(\Vert J(x^{(i,k)})\Vert _{\mathrm {op}}\le {} \frac{1}{\sqrt{1+\Vert x^{(i,k)}\Vert ^{2}/R^{2}}}\le {}2/\sqrt{5}\) which follows from Lemma 13 and \(\Vert x^{(i,k)}\Vert >R/2\). \(\square \)
1.3 B.3 Proof of Lemma 7
To establish Lemma 7 we first prove a generic result showing that composition with the compression function \(\rho \) and an orthogonal transformation U never significantly hurts the regularity requirements in our lower bounds. In the following, we use the notation \(a\vee b :=\max \{a,b\}\).
Lemma 14
Let \(F:\mathbb {R}^{T}\rightarrow \mathbb {R}\) be an arbitrary twice-differentiable function with \(\Vert \nabla {}F(x)\Vert \le {}\ell _0\) and \(\Vert \nabla {}F(x)-\nabla {}F(y)\Vert \le {}\ell _1\cdot \Vert x-y\Vert \), and let g(x, z) and a random variable \(z\sim P_z\) satisfy for all \(x,y\in \mathbb {R}^{T}\),
Let \(R\ge \ell _{0} \vee 1\), \(d\ge T\), and \(U\in \mathsf {Ortho}(d,T)\). Then the functions
satisfy the following properties.
-
1.
\(\widehat{F}_{U}(0) - \inf _{x}\widehat{F}_{U}(x) \le F(0)-\inf _{x}F(x)\).
-
2.
The first derivative of \(\widehat{F}_{U}\) is \((\ell _{1}+3)\)-Lipschitz continuous.
-
3.
\({\mathbb {E}}\big \Vert \widehat{g}_{U}(x,z)-\nabla {}\widehat{F}_{U}(x)\big \Vert ^{2}\le {} \sigma ^2\) for all \(x\in \mathbb {R}^{d}\).
-
4.
\({\mathbb {E}}\Vert \widehat{g}_{U}(x,z)-\widehat{g}_{U}(y,z)\Vert ^{2} \le {} ({\bar{L}}^{2} + 9\sigma ^{2} + 9)\Vert x-y\Vert ^{2}\) for all \(x,y\in \mathbb {R}^{d}\).
Proof of Lemma 14
Property 1 is immediate, since the range of \(\rho \) is a subset of \(\mathbb {R}^{T}\). For property 2, we use the triangle inequality along with Lemma 13 and the assumed smoothness properties of F as follows:
For the variance bound (property 3), observe that we have
Here the second inequality follows from (47) and the fact that \(U\in \mathsf {Ortho}(d,T)\), and the third inequality follows because the variance bound in (50) holds uniformly for all points in the domain \(\mathbb {R}^{T}\) (in particular, those in the range of \(x\mapsto {}U^{\top }\rho (x)\)).
Lastly, to prove property 4 we first invoke the triangle inequality and the elementary inequality \((a+b)^{2}\le {}2a^{2}+2b^{2}\).
For the first term, we use the Jacobian operator norm bound from (47) and the assumed mean-squared smoothness of g:
For the second term, we use the Jacobian Lipschitzness from (47):
We now use the assumed Lipschitzness of F and variance bound for g:
Putting everything together, we have
\(\square \)
Proof of Lemma 7
For property 1, observe that \(\widehat{F}_{T,U}(0)=F_T(0)\), and
For properties 2, 3, and 4 we observe from Lemma 14 that \(\widehat{F}_{T,U}\) and \(\widehat{g}_{T,U}\), ignoring the quadratic regularization term, satisfy the same smoothness, variance, and mean-squared smoothness bounds as in Lemma 2/Lemma 4/Lemma 8 up to constant factors. The additional regularization term in (28) leads to an additional \(\eta =1/5\) factor in the smoothness and mean-squared-smoothness. \(\square \)
1.4 B.4 Proof of Theorem 3
We prove the lower bound for the bounded variance and mean-squared smooth settings in turn. The proofs follow the same outline as the proofs of Theorems 1 and 2, relying on Lemmas 6 and 7 rather than Lemmas 1 and 4, respectively. Throughout, let \(\Delta _0, \ell _1,\varsigma \) and \({\bar{\ell }}_{1}\) be the numerical constants in Lemma 7. Bounded variance setting setting Given accuracy parameter \(\epsilon \), initial suboptimality \(\Delta \), smoothness parameter \(L\) and variance parameter \(\sigma ^2 \), we define for each \(U\in \mathsf {Ortho}(d,T)\) a scaled instance
We assume \(T\ge 4\), or equivalently \(\epsilon \le \sqrt{\frac{L\Delta }{64\ell _1\Delta _0 }}\). Let \(g^{\star }_T{}(x,z)\) denote the corresponding scaled version of the stochastic gradient function \(\widehat{g}_{T,U}\). Now, by Lemma 7, we have that \(F^{\star }_{T,U}\in {\mathcal {F}}(\Delta , L)\) and moreover,
Therefore, setting \(\frac{1}{p}=\frac{\sigma ^2}{(4\varsigma \epsilon )^2} + 1\) guarantees a variance bound of \(\sigma ^{2}\).
Next, Let \({\mathsf {O}}\) be an oracle for which \({\mathsf {O}}_{F^{\star }_{T,U}}(x,z)=(F^{\star }_{T,U}(x),g^{\star }_{T,U}(x,z))\) for all \(U\in \mathsf {Ortho}(d,T)\). Observe that for any \({\mathsf {A}}\in {\mathcal {A}}_{{\textsf {rand}}}(K)\), we may regard the sequence \(\big \{x^{(i,k)}_{{\mathsf {A}}[{\mathsf {O}}_{F^{\star }_{T,U}}]}/\lambda \big \}\) as queries made by an algorithm \({\mathsf {A}}'\in {\mathcal {A}}_{{\textsf {rand}}}(K)\) interacting with the unscaled oracle \({\mathsf {O}}_{\widehat{F}_{T,U}}(x,z)=(\widehat{F}_{T,U}(x),\widehat{g}_{T,U}(x,z))\). Instantiating Lemma 6 for \(\delta =\frac{1}{2}\), we have that w.p. at least \(\frac{1}{2}\), \(\min _{k\in \left[K\right]}\big \Vert \nabla \widehat{F}_{T,U}\big ( \frac{1}{\lambda } x^{(t,k)}_{{\mathsf {A}}[{\mathsf {O}}_{F^{\star }_{T,U}}]} \big ) \big \Vert >\frac{1}{2}\) for all \(t \le {} \frac{T-2}{2p}\). Therefore,
by which it follows that
where the second inequality uses that \(\lfloor x\rfloor -2\ge {}x/4\) whenever \(x\ge {}4\).
Mean-squared smooth setting We use the scaling (51), choose \(p=\min \left\{ {(4\varsigma \epsilon )^2}/{\sigma ^2},1\right\} \) as above, and let
Using Lemma 7 and the calculation from the proof of Theorem 2, this setting guarantees that \({\mathsf {O}}_{F^{\star }_{T,U}}(x,z)\) is in the class \(\mathcal {O}(K,\sigma ^{2},{\bar{L}})\). Consequently, the inequality (52) implies the lower bound
When \(\frac{{\bar{L}}\Delta \sqrt{p}}{16{\bar{\ell }}_{1}\Delta _0\epsilon ^2}\ge 4\), we have \(T\ge 4\) and (53) along with \(\lfloor x\rfloor -2\ge {}x/4\) for \(x\ge 4\) gives
Moreover, we choose \(c'\) so that \(\epsilon \le \sqrt{\frac{{\bar{L}}\Delta }{64{\bar{\ell }}_{1}\Delta _0}}\le \sqrt{\frac{{\bar{L}}\Delta }{8}}\) holds. Lemma 11 then gives the lower bound
for a universal constant \(c_0\). Together, the bounds (53) and (54) imply the desired result when \(\frac{{\bar{L}}\Delta \sqrt{p}}{16{\bar{\ell }}_{1}\Delta _0\epsilon ^2}\ge 4\). As we argue in the proof of Theorem 2, in the complementary case \(\frac{{\bar{L}}\Delta \sqrt{p}}{16{\bar{\ell }}_{1}\Delta _0\epsilon ^2}< 4\), the bound (54) dominates (53), and consequently the result holds there as well.
C Proofs from Section 5
1.1 C.1 Statistical learning oracles
To prove the mean-squared smoothness properties of the construction (32) we must first argue about the continuity of \(\nabla \Theta _i\), where \(\Theta _i:{\mathbb {R}}^T \rightarrow {\mathbb {R}}\) is the “soft indicator” function given by
Lemma 15
For all \(i\ge {}j\), \(\nabla _i\Theta _j(x)\) is well-defined with
Moreover, \(\Theta _j\) satisfies the following properties:
-
1.
\(\Vert \nabla {}\Theta _j(x)\Vert \le {}6^{2}\).
-
2.
\(\Vert \nabla \Theta _j(x)-\nabla \Theta _j(y)\Vert \le {}10^4\cdot \Vert x-y\Vert \).
Proof of Lemma 15
First, we verify that the function \(x_i\mapsto {}\Vert \Gamma (\left|x_{\ge {}j}\right|)\Vert \) is differentiable everywhere for each i. From here it follows from Observation 1 that \(\Theta _j(x)\) is differentiable, and (55) follows from the chain rule. Let \(i\ge {}j\), and let \(a=\sqrt{\sum _{k\ge {}j,k\ne {}i}\Gamma ^{2}(\left|x_k\right|)}\). Then \(\Vert \Gamma (\left|x_{\ge {}j}\right|)\Vert =\sqrt{a^{2}+\Gamma ^{2}\left(\left|x_i\right|\right)}\). This function is clearly differentiable with respect to \(x_i\) when \(a>0\), and when \(a=0\) it is equal to \(\Gamma (\left|x_i\right|)\), which is also differentiable.
Property 1 follows because for all j,
where we have used Observation 1.3.
To prove Property 2, we restrict to the case \(j=1\) so that \(x_{\ge {}j}=x\) and subsequently drop the ‘\(\ge {}j\)’ subscript to simplify notation; the case \(j>1\) follows as an immediate consequence. Define \(\mu (x)\in \mathbb {R}^{T}\) via \(\mu _{i}(x) = \Gamma (\left|x_i\right|)\Gamma '(\left|x_i\right|)\mathrm {sgn}(x_i)\). Assume without loss of generality that \(0<\Vert \Gamma (\left|x\right|)\Vert \le {}\Vert \Gamma (\left|y\right|)\Vert \). By triangle inequality, we have
To proceed, we state some useful facts, all of which follow from Observation 1.3:
-
1.
\(\Gamma \) is 6-Lipschitz.
-
2.
\(\Gamma '\) is 128-Lipschitz, and in particular \(\Gamma '(1-\Vert \Gamma (\left|x\right|)\Vert )\le {}128\cdot \Vert \Gamma (\left|x\right|)\Vert \) (since \(\Gamma '(1)=0\)).
-
3.
\(\Vert \mu (x)\Vert \le {}6\cdot \Vert \Gamma (\left|x\right|)\Vert \) for all x.
-
4.
\(\Vert \mu (x)-\mu (y)\Vert \le {}(128\cdot 1 + 6^2)\cdot \Vert x-y\Vert =164\cdot \Vert x-y\Vert \) for all x, y.
Using the first, second, and third facts, we bound the first term as
For the second term, we apply the second fact and the triangle inequality to upper bound by
Using the fourth fact and the assumption that \(\Vert \Gamma (\left|x\right|)\Vert \le {}\Vert \Gamma (\left|y\right|)\Vert \), we have
Using the third fact and \(\Vert \Gamma (\left|x\right|)\Vert \le {}\Vert \Gamma (\left|y\right|)\Vert \), we have
Gathering all of the constants, this establishes that
\(\square \)
We are now ready to prove Lemma 8. For ease of reference, we restate the construction (32):
where
Proof of Lemma 8
To begin, we introduce some shorthand. Define
The gradient of the noiseless hard function \(F_T\) can then be written as
Next, define
With these definitions, we have the expression
We begin by noting that \(\nabla f_T\) is unbiased for \(F_T\): Since \({\mathbb {E}}\left[\nu _i(x,z)\right]=1\) for all i and \({\mathbb {E}}(\tfrac{z}{p}-1)=1\), it follows immediately from (58) that \({\mathbb {E}}\left[\nabla f_T(x,z)\right]=\nabla {}F(x)\).
Next, we show that \(\nabla f_T\) is a probability-p zero chain. with an argument analogous to the 4. First, we claim that \(\left[\nabla {}f_T(x,z)\right]_i=0\), for all x, z and \(i>1+\mathrm {prog}_{\frac{1}{4}}(x)\), yielding \(\mathrm {prog}_{0}(\nabla {}f_T(x,z))\le 1+\mathrm {prog}_{\frac{1}{4}}(x)\). Since \(\left|x_{i-1}\right|,\left|x_i\right|<1/4\), it follows from (57) that \(g_i(x,z)=0\) and from (55) that \(\nabla {}_i\Theta _j(x)=0\) for all j, establishing the first claim. Now, consider the case \(i=\mathrm {prog}_{\frac{1}{4}}(x)+1\) and \(z=0\). Here (since \(\left|x_i\right|<1/4\)) we still have \(\nabla {}_i\Theta _j(x)=0\) for all j, so \(\nabla _if_T(x,z)=g_i(x,z)\). Since \(\Gamma (\left|x_{\ge {}i}\right|)=\Gamma (\left|x_{\ge {}i+1}\right|)=0\), we have \(\nu _{i}(x,0)=\nu _{i+1}(x,0)=0\), so \(g_i(x,z)=0\). It follows immediately that \(\mathrm {prog}_{0}(\nabla {}f_T(x,0))\le \mathrm {prog}_{\frac{1}{4}}(x)\) for all x. Finally, examining the definition (32) of \(f_T\), it is straightforward to verify that \(f_T(y,z) = f_T(y_{\le 1+\mathrm {prog}_{\frac{1}{4}}(x)}, z)\) for all y in a neighborhood of x, and all x and z. This implies \(f_T(x,z) = f_T(x_{\le 1+\mathrm {prog}_{\frac{1}{4}}(x)}, z)\) and, via differentiation \(\nabla f_T(x,z) = \nabla f_T(x_{\le 1+\mathrm {prog}_{\frac{1}{4}}(x)}, z)\). Similarly, one has \(f_T(y,0) = f_T(y_{\le \mathrm {prog}_{\frac{1}{4}}(x)}, 0)\) for y in a neighborhood of x, concluding the proof of the probabilistic zero-chain property.
To bound the variance and mean-squared smoothness of \(\nabla f_T\), we begin by analyzing the sparsity pattern of the error vector
Let \(i_x=\mathrm {prog}_{\frac{1}{2}}(x)+1\). Observe that if \(j<i_x\), we have \(\Vert \Gamma (\left|x_{\ge {}j}\right|)\Vert \ge {}\Gamma (\left|x_{i_{x}-1}\right|)\ge {}\Gamma (1/2)=1\), and so \(\Gamma '(1-\Vert \Gamma (\left|x_{\ge {}j}\right|)\Vert )=0\) and consequently \(\nabla _{i}\Theta _j(x)=0\) for all i. Note also that if \(j>i_x\), we have \(H(x_{j-1},x_j)=0\). We conclude that (58) simplifies to
As in Lemma 4, we have \(\nu _{i}(x,z)=1\) for all \(i<i_x\) and \(g_i(x,z)=\nabla {}_iF_T(x)=0\) for all \(i>i_x\). Thus, using the expression (57) along with (59), we have
It follows immediately that the variance can be bounded as
From (56) we have \(\Vert \nabla {}\Theta _{i_x}(x)\Vert \le {}6^2\), and from (36) we have \(|H(x,y)|\le {}12\), so the first term contributes at most \(\tfrac{2\cdot {}144\cdot {}6^{4}}{p}\). Since \(|\Theta _i(x)|\le {}1\), Lemma 2 implies that the second and third term together contribute at most \(\tfrac{4\cdot {}23^{2}}{p}\). To conclude, we may take
where \(\varsigma \le {}10^{3}\).
To bound the mean-squared smoothness \({\mathbb {E}}\Vert \nabla f_T(x,z) - \nabla f_T(y,z) \Vert ^2\), we first use that \({\mathbb {E}}\left[\delta (x ,z)\right]=0\), which implies
We have \(\Vert \nabla F_T(x) - \nabla F_T(y)\Vert \le \ell _{1} \Vert x-y\Vert \) by Lemma 2.2. For the other term, we use the sparsity pattern of \(\delta (x,z)\) established in (60) along with the fact that \({\mathbb {E}}\left(\tfrac{z}{p}-1\right)^{2}\le \tfrac{1}{p}\) to show
where \(i_y=\mathrm {prog}_{\frac{1}{2}}(y)+1\).
We bound \(\mathcal {E}_1\) and \(\mathcal {E}_2\) using similar arguments to Lemma 4. Focusing on \(\mathcal {E}_1\), and letting \(i\in \{i_x,i_y\}\) be fixed, we have
Note that by Lemma 15, (i) \(\Theta _i\) is \(6^2\) Lipschitz and \(\Theta _i\le {}1\) and (ii) \(h_1\) is 23-Lipschitz and \(\left|h_1\right|\le {}5\) (from Observation 2 and Lemma 2). Consequently,
Since \(h_{2}\) is 23-Lipschitz and has \(\left|h_2\right|\le {}20\), an identical argument also yields that
To bound \(\mathcal {E}_3\), we use the earlier observation that for all i and \(j\ne i_x\) we have \(H(x_{j-1},x_j)\nabla _i\Theta _j(x)=0\), and likewise that \(H(y_{j-1},y_j)\nabla _i\Theta _j(y)=0\) for all \(j\ne {}i_y\). This allows us to write
Letting \(j\in \{i_x,i_y\}\) be fixed, we upper bound the inner summation as
We may now upper bound this quantity by applying the following basic results:
-
1.
\(H(x_{j-1},x_j)\le {}12\) by (36).
-
2.
\(\left|H(x_{j-1},x_j)-H(y_{j-1},y_{j})\right|\le {}20\Vert x-y\Vert \), by (36).
-
3.
\(\Vert \nabla \Theta _j(y)\Vert \le {}6^{2}\) by Lemma 15.1.
-
4.
\(\Vert \nabla \Theta _j(x)-\nabla \Theta _j(y)\Vert \le {}10^4\cdot \Vert x-y\Vert \), by Lemma 15.2.
It follows that \(\mathcal {E}_3\le {}3\cdot {}10^{10}\cdot {}\Vert x-y\Vert ^{2}\). Collecting the bounds on \(\mathcal {E}_1\), \(\mathcal {E}_2\), and \(\mathcal {E}_3\), this establishes that
with \({\bar{\ell }}_{1} \le {} \sqrt{10^{11} + \ell _{1}^2}\). \(\square \)
1.2 C.2 Active oracles
Proof of Lemma 9
Denoting
we see that the equality \({\mathbb {P}}(\gamma ^{(t)} - \gamma ^{(t-1)} \notin \{0,1\}| \mathcal {G}^{(t-1)})=0\) holds for our setting as well. Moreover, we claim that
Given the bound (61), the remainder of the proof is identical to that of Lemma 1, with 2p replacing p. To see why (61) holds, let \((x^{(1)},i^{(1)}),\ldots ,(x^{(t)},i^{(t)})\in \mathcal {G}^{(t-1)}\) denote the sequence of queries made by the algorithm. We first observe that, by the construction of \(g_\pi \), we have \(\gamma ^{(t)} = 1+ \gamma ^{(t-1)}\) only if \(\zeta _{1+\gamma ^{(t-1)}}(\pi (i^{(t)}))=1\). Therefore,
Next, let \(b\in \{0,1\}^{N^T}\) denote a (random) vector whose ith entry is \(b_i :=\zeta _{1+\gamma ^{(t-1)}}(\pi (i))\). The vector b has \(N^{T-1}\) elements equal to 1 and its distribution is permutation invariant. Note that, by construction, the vector b is independent of \(\{\zeta _j(\pi (i))\}_{j\ne 1+\gamma ^{(t-1)},i\in N^T}\). Consequently, the gradient estimates \(g^{(1)},\ldots ,g^{(t-1)}\) depend on b only through their \((1+\gamma ^{(t-1)})\)th coordinate, which for iterate \(t'\le t-1\) is
From this expression we see that \(g^{(t')}\) depends on b only for index queries in the set
Moreover, for every \(i\in S^{(t-1)}\) we have that \(b_i=0\), because otherwise there exists \(t'<t\) such that \(g_{1+\gamma ^{(t-1)}}^{(t')} \ne 0\) which gives the contradiction \(\gamma ^{(t-1)}\ge \gamma ^{(t')}\ge \mathrm {prog}_{0}(g^{(t')})\ge 1+\gamma ^{(t-1)}>\gamma ^{(t-1)}\). In conclusion, we have for every \(i\in N^T\)
where the last equality follows from the permutation invariance of b.
Combining the observations above with the fact that \(|S^{(t-1)}| \le t-1 \le \frac{T}{4p} \le \frac{1}{4}NT \le \frac{1}{2}N^T\) gives the desired result (61), since
We remark that the argument above depends crucially on using a different bit for every coordinate. Indeed, had we instead used the original construction \(g_T\) in Eq. (18) and set \(g_\pi (x;i)=g_T(\zeta _1(\pi (i)))\), an algorithm that queried roughly N random indices would find an index \(i^\star \) such that \(\zeta _1(\pi (i^\star ))=1\) and could then continue to query it exclusively, achieving a unit of progress at every query. This would decrease the lower bound from \(\Omega (T/p)=\Omega (NT)\) to \(\Omega (N+T)\). \(\square \)
Rights and permissions
About this article
Cite this article
Arjevani, Y., Carmon, Y., Duchi, J.C. et al. Lower bounds for non-convex stochastic optimization. Math. Program. 199, 165–214 (2023). https://doi.org/10.1007/s10107-022-01822-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10107-022-01822-7