Gradient-Free Algorithms for Solving Stochastic Saddle Optimization Problems with the Polyak–Łojasiewicz Condition

Sadykov, S. I.; Lobanov, A. V.; Raigorodskii, A. M.

doi:10.1134/S0361768823060063

Gradient-Free Algorithms for Solving Stochastic Saddle Optimization Problems with the Polyak–Łojasiewicz Condition

Published: 01 December 2023

Volume 49, pages 535–547, (2023)
Cite this article

Programming and Computer Software Aims and scope Submit manuscript

107 Accesses
1 Citation
Explore all metrics

Abstract

This paper focuses on solving a subclass of stochastic nonconvex-nonconcave black box optimization problems with a saddle point that satisfy the Polyak–Łojasiewicz (PL) condition. To solve this problem, we provide the first (to our best knowledge) gradient-free algorithm. The proposed approach is based on applying a gradient approximation (kernel approximation) to an oracle-biased stochastic gradient descent algorithm. We present theoretical estimates that guarantee its global linear rate of convergence to the desired accuracy. The theoretical results are checked on a model example by comparison with an algorithm using Gaussian approximation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Gradient-Free Methods with Inexact Oracle for Convex-Concave Stochastic Saddle-Point Problem

Zeroth-order algorithms for nonconvex–strongly-concave minimax problems with improved complexities

Article 29 April 2022

Gradient-free proximal methods with inexact oracle for convex stochastic nonsmooth optimization problems on the simplex

Article 12 November 2016

REFERENCES

Goodfellow, I., Bengio, Y., and Courville, A., Deep Learning, MIT Press, 2016.
MATH Google Scholar
Dai, B., et al., SBEED: Convergent reinforcement learning with nonlinear function approximation, Proc. Int. Conf. Machine Learning, 2018, pp. 1125–1134.
Namkoong, H. and Duchi, J.C., Variance-based regularization with convex objectives, Adv. Neural Inf. Process. Syst., 2017, vol. 30.
Xu, L., et al., Maximum margin clustering, Adv. Neural Inf. Process. Syst., 2004, vol. 17.
Sinha, A., et al., Certifying some distributional robustness with principled adversarial training, 2017.
Audet, C. and Hare, W., Derivative-free and blackbox optimization, 2017.
Rosenbrock, H.H., An automatic method for finding the greatest or least value of a function, Comput. J., 1960, vol. 3, no. 3, pp. 175–184.
Article MathSciNet Google Scholar
Gasnikov, A., et al., Randomized gradient-free methods in convex optimization, 2022.
Lobanov, A., et al., Gradient-free federated learning methods with l ₁ and l ₂-randomization for non-smooth convex stochastic optimization problems, 2022.
Gasnikov, A., et al., The power of first-order smooth optimization for black-box non-smooth problems, Proc. Int. Conf. Machine Learning, 2022, pp. 7241–7265.
Bach, F. and Perchet, V., Highly-smooth zero-th order online optimization, Proc. Conf. Learning Theory, 2016, pp. 257–283.
Beznosikov, A., Novitskii, V., and Gasnikov, A., One-point gradient-free methods for smooth and non-smooth saddle-point problems, Proc. 20th Int. Conf. Mathematical Optimization Theory and Operations Research (MOTOR), Irkutsk, Russia, 2021, pp. 144–158.
Akhavan, A., Pontil, M., and Tsybakov, A., Exploiting higher order smoothness in derivative-free optimization and continuous bandits, Adv. Neural Inf. Process. Syst., 2020, vol. 33, pp. 9017–9027.
Google Scholar
Polyak, B.T., Gradient methods for the minimisation of functionals, USSR Comput. Math. Math. Phys., 1963, vol. 3, no. 4, pp. 864–878.
Article MATH Google Scholar
Łojasiewicz, S., Une propriété topologique des sous-ensembles analytiques réels, Les Equations aux Dérivées Partielles, 1963, vol. 117, pp. 87–89.
Ajalloeian, A. and Stich, S.U., On the convergence of SGD with biased gradients, 2020.
Lobanov, A., Gasnikov, A., and Stonyakin, F., Highly smoothness zero-order methods for solving optimization problems under PL condition, 2023.
Yue, P., Fang, C., and Lin, Z., On the lower bound of minimizing Polyak–Łojasiewicz functions, 2022.
Yang, J., Kiyavash, N., and He, N., Global convergence and variance-reduced optimization for a class of nonconvex-nonconcave minimax problems, 2020.
Akhavan, A., et al., Gradient-free optimization of highly smooth functions: Improved analysis and a new algorithm, 2023.
Nouiehed, M., et al., Solving a class of non-convex min-max games using iterative first order methods, Adv. Neural Inf. Process. Syst., 2019, vol. 32.
Osserman, R., The isoperimetric inequality, Bull. Am. Math. Soc., 1978, vol. 84, no. 6, pp. 1182–1238.
Article MathSciNet MATH Google Scholar
Beckner, W., A generalized Poincaré inequality for Gaussian measures, Proc. Am. Math. Soc., 1989, vol. 105, no. 2, pp. 397–400.
MathSciNet MATH Google Scholar
Karimi, H., Nutini, J., and Schmidt, M., Linear convergence of gradient and proximal-gradient methods under the Polyak–Łojasiewicz condition, Proc. Eur. Conf. Machine Learning and Knowledge Discovery in Databases (ECML PKDD), Riva del Garda, Italy, 2016, pp. 795–811.
Zorich, V.A., Mathematical Analysis II, Berlin: Springer, 2016.
Book MATH Google Scholar

Download references

Funding

The work carried out by A.M. Raigorodskii in Sections 1–3 was supported by a grant for leading scientific schools (grant no. NSh775.2022.1.1), while the work carried out in Sections 4–6 was supported by the Russian Science Foundation (project no. 21-71-30005), https://rscf.ru/project/21-71-30005.

Author information

Authors and Affiliations

Moscow Institute of Physics and Technology, Institutskii per. 9, 141701, Dolgoprudny, Russia
S. I. Sadykov, A. V. Lobanov & A. M. Raigorodskii
Trusted Artificial Intelligence Research Center of the Ivannikov Institute for System Programming, Russian Academy of Sciences, ul. Solzhenitsyna 25, 109004, Moscow, Russia
A. V. Lobanov
Caucasian Mathematical Center of the Adyghe State University, ul. Pervomaiskaya 208, 385000, Maykop, Russia
A. M. Raigorodskii

Authors

S. I. Sadykov
View author publications
You can also search for this author in PubMed Google Scholar
A. V. Lobanov
View author publications
You can also search for this author in PubMed Google Scholar
A. M. Raigorodskii
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to S. I. Sadykov, A. V. Lobanov or A. M. Raigorodskii.

Ethics declarations

The authors declare that they have no conflicts of interest.

Additional information

Translated by Yu. Kornienko

Appendices

AUXILIARY LEMMAS TO PROVE THEOREM 1

Suppose that ${{\kappa }_{\beta }} = \int {\text{|}}u{{{\text{|}}}^{\beta }}\,{\text{|}}K(u){\text{|}}du$ and κ = $\int {{K}^{2}}(u)du$. When K is a weighted sum of Legendre polynomials, ${{\kappa }_{\beta }}$ and $\kappa $ do not depend on d [11] (see A.3), they depend only on $\beta $, for $\beta \geqslant 1$:

$${{\kappa }_{\beta }} \leqslant 2\sqrt 2 (\beta - 1),$$

(A.1)

$${{\kappa }_{ \leqslant }}3{{\beta }^{3}}.$$

(A.2)

First, several key lemmas need to be proved.

Lemma 1 ([24]). If $f( \cdot )$ is ${{L}_{2}}$-smooth and satisfies the PL condition with constant $\mu $, then it also satisfies the bounded error condition with $\mu $, i.e.,

$${\text{||}}\nabla f(x){\text{||}} \geqslant \mu {\text{||}}{{x}_{p}} - x{\text{||}},\quad \forall x,$$

where x_p is the projection of x onto the optimal set, and it also satisfies the quadratic growth condition with $\mu $, i.e.,

$$f(x) - f\text{*} \geqslant \frac{\mu }{2}{\text{||}}{{x}_{p}} - x{\text{|}}{{{\text{|}}}^{2}},\quad \forall x.$$

Conversely, if $f( \cdot )$ is ${{L}_{2}}$-smooth and satisfies the bounded error condition with constant $\mu $, then it satisfies the PL condition with constant $\mu {\text{/}}{{L}_{2}}$.

It can be seen from this lemma that ${{L}_{2}} \geqslant \mu $.

Lemma 2 ([21]). In the minimax problem, if $ - f(x, \cdot )$ satisfies the PL condition with constant ${{\mu }_{y}}$ for any $x$ and f satisfies Assumption 1, then function $g(x)\,: = \,\mathop {\max }\nolimits_y f(x,y)$ is L-smooth with $L: = {{L}_{2}} + L_{2}^{2}{\text{/}}{{\mu }_{y}}$ and $\nabla g(x)$ = ${{\nabla }_{x}}f(x,y\text{*}(x))$ for any $y\text{*}(x) \in \arg \mathop {\max }\limits_y f(x,y)$.

For the next lemma, we need to consider problem $\mathop {\min }\limits_x f(x)$.

Lemma 3. Suppose that ${{\{ {{x}_{k}}\} }_{{k \geqslant 0}}}$ is the number of iterations in the Mini-batch SGD algorithm that are generated on functions $f( \cdot )$ under Assumptions 1–5. Then, there is a step size $\eta \leqslant \frac{1}{{(M + 1){{L}_{2}}}}$ such that it holds for all $N \geqslant 0$:

$$\begin{gathered} \mathbb{E}[f({{x}_{N}})] - f\text{*} \\ \, \leqslant {{(1 - \eta \mu )}^{N}}\left( {f({{x}_{0}}) - f\text{*}} \right) + \frac{{{{\zeta }^{2}}}}{{2\mu }} + \frac{{\eta {{L}_{2}}{{\sigma }^{2}}}}{{2B\mu }}. \\ \end{gathered} $$

where L₂ is the Lipschitz constant for the gradient such that ${\text{||}}\nabla f(x) - \nabla f(y){\text{||}} \leqslant {{L}_{2}}{\text{||}}x - y{\text{||}}$.

Proof. Due to the ${{L}_{2}}$-smoothness of f and the choice of step size $\eta \leqslant \frac{1}{{(M + 1){{L}_{2}}}}$, we have

$$\begin{gathered} \mathbb{E}[f({{x}_{{k + 1}}})] \leqslant f({{x}_{k}}) + \langle \nabla f({{x}_{k}}),{{x}_{{k + 1}}} - {{x}_{k}}\rangle \\ \, + \frac{{{{L}_{2}}}}{2}{\text{||}}{{x}_{{k + 1}}} - {{x}_{k}}{\text{|}}{{{\text{|}}}^{2}} \leqslant f({{x}_{k}}) - \eta \langle \nabla f({{x}_{k}}),\mathbb{E}[{{{\mathbf{G}}}_{k}}]\rangle \\ \, + \frac{{{{\eta }^{2}}{{L}_{2}}}}{2}(\mathbb{E}[{\text{||}}{{{\mathbf{G}}}_{k}} - \mathbb{E}[{{{\mathbf{G}}}_{k}}]{\text{|}}{{{\text{|}}}^{2}}] + \mathbb{E}[{\text{||}}\mathbb{E}[{{{\mathbf{G}}}_{k}}]{\text{|}}{{{\text{|}}}^{2}}]) \\ \end{gathered} $$

$$\begin{matrix} \,\overset{}{\mathop{=}}\,f({{x}_{k}})-\eta \langle \nabla f({{x}_{k}}),\nabla f({{x}_{k}})+\mathbf{b}({{x}_{k}})\rangle \\ \,+\frac{{{\eta }^{2}}{{L}_{2}}}{2}(\mathbb{E}[\text{ }\!\!|\!\!\text{ }\!\!|\!\!\text{ }\mathbf{n}({{x}_{k}},\xi )\text{ }\!\!|\!\!\text{ }{{\text{ }\!\!|\!\!\text{ }}^{2}}]+\mathbb{E}[\text{ }\!\!|\!\!\text{ }\!\!|\!\!\text{ }\nabla f({{x}_{k}})+\mathbf{b}({{x}_{k}})\text{ }\!\!|\!\!\text{ }{{\text{ }\!\!|\!\!\text{ }}^{2}}]) \\ \end{matrix}$$

(A.3)

$$\begin{matrix} \,\overset{}{\mathop{\le }}\,f({{x}_{k}})-\eta \langle \nabla f({{x}_{k}}),\nabla f({{x}_{k}})+\mathbf{b}({{x}_{k}})\rangle \\ \,+\frac{{{\eta }^{2}}{{L}_{2}}}{2}((M+1)\mathbb{E}[\text{ }\!\!|\!\!\text{ }\!\!|\!\!\text{ }\nabla f({{x}_{k}})+\mathbf{b}({{x}_{k}})\text{ }\!\!|\!\!\text{ }{{\text{ }\!\!|\!\!\text{ }}^{2}}]+{{\sigma }^{2}}) \\ \end{matrix}$$

$$\begin{gathered} \, = f({{x}_{k}}) + \frac{\eta }{2}\left( { \pm {\text{||}}\nabla f({{x}_{k}}){\text{|}}{{{\text{|}}}^{2}}} \right. \\ \, - \left. {2\langle \nabla f({{x}_{k}}),\nabla f({{x}_{k}}) + {\mathbf{b}}({{x}_{k}})\rangle + \,{\text{||}}\nabla f({{x}_{k}}) + {\mathbf{b}}({{x}_{k}}){\text{|}}{{{\text{|}}}^{2}}} \right) \\ \end{gathered} $$

$$\begin{matrix} \,+\frac{{{\eta }^{2}}{{L}_{2}}}{2}{{\sigma }^{2}}=f({{x}_{k}})+\frac{\eta }{2}(-\text{ }\!\!|\!\!\text{ }\!\!|\!\!\text{ }\nabla f({{x}_{k}})\text{ }\!\!|\!\!\text{ }{{\text{ }\!\!|\!\!\text{ }}^{2}}+\,\text{ }\!\!|\!\!\text{ }\!\!|\!\!\text{ }\mathbf{b}({{x}_{k}})\text{ }\!\!|\!\!\text{ }{{\text{ }\!\!|\!\!\text{ }}^{2}}) \\ \,+\frac{{{\eta }^{2}}{{L}_{2}}}{2}{{\sigma }^{2}}\overset{2,}{\mathop{\le }}\,(1-\eta \mu )\left( f({{x}_{k}})-f\text{*} \right) \\ \,+\frac{\eta {{\zeta }^{2}}}{2}+\frac{{{\eta }^{2}}{{L}_{2}}}{2}{{\sigma }^{2}}+f\text{*}, \\ \end{matrix}$$

where uses Definition 2, uses Assumption 4, and uses Assumption 5.

By applying recursion to (A.3) and by adding batching (with batch size B), we arrive at

$$\begin{gathered} \mathbb{E}[f({{x}_{N}})] - f\text{*} \\ \, \leqslant {{(1 - \eta \mu )}^{N}}\left( {f({{x}_{0}}) - f\text{*}} \right) + \frac{{{{\zeta }^{2}}}}{{2\mu }} + \frac{{\eta {{L}_{2}}{{\sigma }^{2}}}}{{2B\mu }}. \\ \end{gathered} $$

(A.4)

$\square $

Theorem 4. Suppose that Assumptions 1–5 hold and $f(x,y)$ satisfies the two-sided PL condition with ${{\mu }_{x}}$ and ${{\mu }_{y}}$. If we run one iteration of Algorithm 1 with $\tau _{x}^{t} = {{\tau }_{x}} \leqslant \frac{1}{{(M + 1)L}}$ (L is specified by Lemma 2) and $\tau _{y}^{t} = {{\tau }_{y}} \leqslant \frac{1}{{(M + 1){{L}_{2}}}}$, then

$$\begin{gathered} {{a}_{{t + 1}}} + \lambda {{b}_{{t + 1}}} \leqslant \max \{ {{k}_{1}},{{k}_{2}}\} ({{a}_{t}} + \lambda {{b}_{t}}) \\ \, + \lambda \left( {\tau _{y}^{2}{{L}_{2}}\frac{{{{\sigma }^{2}}}}{B} + {{\tau }_{y}}{{\zeta }^{2}}} \right), \\ \end{gathered} $$

where

$${{k}_{1}}: = 1 - {{\mu }_{x}}{{\tau }_{x}}[1 + \lambda (1 - {{\mu }_{y}}{{\tau }_{y}})],$$

(A.5)

$${{k}_{2}}: = 1 + \frac{{L_{2}^{2}{{\tau }_{x}}}}{{{{\mu }_{y}}\lambda }} - {{\mu }_{y}}{{\tau }_{y}} + {{\sigma }^{2}}\frac{{L_{2}^{2}}}{{{{\mu }_{y}}}}{{\tau }_{x}} - {{\tau }_{x}}{{\tau }_{y}}L_{2}^{2}{{\sigma }^{2}}.$$

(A.6)

Proof. With g being L-smooth in accordance with Lemma 2, by choosing a step size such that ${{\tau }_{x}}\, \leqslant \,\frac{1}{{(M\, + \,1)L}}$, we have

$$\begin{gathered} \mathbb{E}[g({{x}_{{k + 1}}})] \leqslant g({{x}_{k}}) + \langle \nabla g({{x}_{k}}),{{x}_{{k + 1}}} - {{x}_{k}}\rangle \\ \, + \frac{L}{2}{\text{||}}{{x}_{{k + 1}}} - {{x}_{k}}{\text{|}}{{{\text{|}}}^{2}} \leqslant g({{x}_{k}}) - {{\tau }_{x}}\langle \nabla g({{x}_{k}}),\mathbb{E}[{{{\mathbf{G}}}_{k}}]\rangle \\ \, + \frac{{\tau _{x}^{2}L}}{2}(\mathbb{E}[{\text{||}}{{{\mathbf{G}}}_{k}} - \mathbb{E}[{{{\mathbf{G}}}_{k}}]{\text{|}}{{{\text{|}}}^{2}}] + \mathbb{E}[{\text{||}}\mathbb{E}[{{{\mathbf{G}}}_{k}}]{\text{|}}{{{\text{|}}}^{2}}]) \\ \end{gathered} $$

$$\begin{matrix} \,\overset{}{\mathop{=}}\,g({{x}_{k}})-{{\tau }_{x}}\langle \nabla g({{x}_{k}}),{{\nabla }_{x}}f({{x}_{k}},{{y}_{k}})+\mathbf{b}({{x}_{k}})\rangle +\frac{\tau _{x}^{2}L}{2} \\ \,\times (\mathbb{E}[\text{ }\!\!|\!\!\text{ }\!\!|\!\!\text{ }\mathbf{n}({{x}_{k}},{{y}_{k}},\xi )\text{ }\!\!|\!\!\text{ }{{\text{ }\!\!|\!\!\text{ }}^{2}}]+\mathbb{E}[\text{ }\!\!|\!\!\text{ }\!\!|\!\!\text{ }\nabla g({{x}_{k}})+{{\mathbf{b}}_{x}}({{x}_{k}},{{y}_{k}})\text{ }\!\!|\!\!\text{ }{{\text{ }\!\!|\!\!\text{ }}^{2}}]) \\ \,\overset{}{\mathop{\le }}\,g({{x}_{k}})-{{\tau }_{x}}\langle \nabla g({{x}_{k}}),{{\nabla }_{x}}f({{x}_{k}},{{y}_{k}})+{{\mathbf{b}}_{x}}({{x}_{k}},{{y}_{k}})\rangle \\ \end{matrix}$$

$$\, + \frac{{\tau _{x}^{2}L}}{2}((M + 1)\mathbb{E}[{\text{||}}{{\nabla }_{x}}f({{x}_{k}},{{y}_{k}}) + {{{\mathbf{b}}}_{x}}({{x}_{k}},{{y}_{k}}){\text{|}}{{{\text{|}}}^{2}}]$$

(A.7)

$$\begin{gathered} \, + {{\sigma }^{2}}) = g({{x}_{k}}) + \frac{{{{\tau }_{x}}}}{2}( \pm {\text{||}}\nabla g({{x}_{k}}){\text{|}}{{{\text{|}}}^{2}} \\ \, - 2\langle \nabla g({{x}_{k}}),\nabla f({{x}_{k}},{{y}_{k}}) + {{{\mathbf{b}}}_{x}}({{x}_{k}},{{y}_{k}})\rangle \\ + \;{\text{||}}{{\nabla }_{x}}f({{x}_{k}},{{y}_{k}}) + {{{\mathbf{b}}}_{x}}({{x}_{k}},{{y}_{k}}){\text{|}}{{{\text{|}}}^{2}}) + \frac{{\tau _{x}^{2}L}}{2}{{\sigma }^{2}} \\ \end{gathered} $$

$$\begin{gathered} \, = g({{x}_{k}}) + \frac{{{{\tau }_{x}}}}{2}( - {\text{||}}\nabla g({{x}_{k}}){\text{|}}{{{\text{|}}}^{2}} \\ + \,{\text{||}}{\kern 1pt} - {\kern 1pt} \nabla g({{x}_{k}}) + {{{\mathbf{b}}}_{x}}({{x}_{k}},{{y}_{k}}) + {{\nabla }_{x}}f({{x}_{k}},{{y}_{k}}){\text{|}}{{{\text{|}}}^{2}}) + \frac{{\tau _{x}^{2}L}}{2}{{\sigma }^{2}}, \\ \end{gathered} $$

where uses Definition 2 and uses Assumption 4.

Now, it is sufficient to express ${\text{||}}g({{x}_{t}}){\text{|}}{{{\text{|}}}^{2}}$ and ${\text{||}}{{\nabla }_{x}}f({{x}_{t}},{{y}_{t}}) - \nabla g({{x}_{t}}){\text{|}}{{{\text{|}}}^{2}}$ in terms of ${{a}_{t}}$ and ${{b}_{t}}$. Using Lemma 2, we have

$$\begin{gathered} {\text{||}}{{\nabla }_{x}}f({{x}_{t}},{{y}_{t}}) - \nabla g({{x}_{t}}){\text{|}}{{{\text{|}}}^{2}} \\ \, = {\text{||}}{{\nabla }_{x}}f({{x}_{t}},{{y}_{t}}) - {{\nabla }_{x}}f({{x}_{t}},y\text{*}({{x}_{t}})){\text{|}}{{{\text{|}}}^{2}} \\ \, \leqslant L_{2}^{2}{\text{||}}y\text{*}({{x}_{t}}) - {{y}_{t}}{\text{|}}{{{\text{|}}}^{2}} \\ \end{gathered} $$

(A.8)

for any $y{\kern 1pt} {\text{*}}({{x}_{t}}) \in \arg \mathop {\max }\limits_y f({{x}_{t}},y)$. Now, we can fix $y{\kern 1pt} {\text{*}}({{x}_{t}})$ as a projection of ${{y}_{t}}$ onto set $\arg \mathop {\max }\limits_y f({{x}_{t}},y)$. Since $ - f({{x}_{t}}, \cdot )$ satisfies the PL condition with ${{\mu }_{y}}$ and Lemma 1 suggests that the function also satisfies the quadratic growth condition with ${{\mu }_{y}}$, i.e.,

$${\text{||}}y\text{*}({{x}_{t}}) - {{y}_{t}}{\text{|}}{{{\text{|}}}^{2}} \leqslant \frac{2}{{{{\mu }_{y}}}}[g({{x}_{t}}) - f({{x}_{t}},{{y}_{t}})],$$

(A.9)

taking into account (A.8), we obtain

$${\text{||}}{{\nabla }_{x}}f({{x}_{t}},{{y}_{t}}) - \nabla g({{x}_{t}}){\text{|}}{{{\text{|}}}^{2}} \leqslant \frac{{2L_{2}^{2}}}{{{{\mu }_{y}}}}[g({{x}_{t}}) - f({{x}_{t}},{{y}_{t}})].$$

(A.10)

Since g satisfies the PL condition with ${{\mu }_{x}}$,

$${\text{||}}\nabla g({{x}_{t}}){\text{|}}{{{\text{|}}}^{2}} \geqslant 2{{\mu }_{x}}[g({{x}_{t}}) - g\text{*}].$$

(A.11)

By computing the expectation for both sides of A.7 and by substituting A.10 and A.11, we obtain

$${{a}_{{t + 1}}} \leqslant (1 - {{\tau }_{x}}{{\mu }_{x}}){{a}_{t}} + {{\tau }_{x}}\frac{{L_{2}^{2}}}{{{{\mu }_{y}}}}{{b}_{t}} + \frac{{{{\tau }_{x}}}}{2}{\text{||}}{{{\mathbf{b}}}_{x}}{\text{|}}{{{\text{|}}}^{2}}.$$

(A.12)

Since $ - f({{x}_{{t + 1}}},y)$ is L₂-smooth and ${{\mu }_{y}}$-PL, based on inequality (A.3) from Lemma 3 for ${{\tau }_{y}} \leqslant \frac{1}{{(M + 1){{L}_{2}}}}$, we have

$$\mathbb{E}\left[ {g({{x}_{{t + 1}}}) - f({{x}_{{t + 1}}},{{y}_{{t + 1}}})] \leqslant (1 - {{\mu }_{y}}{{\tau }_{y}})\mathbb{E}[g({{x}_{{t + 1}}})} \right.$$

$$\begin{gathered} \, - \left. {f({{x}_{{t + 1}}},{{y}_{t}})} \right] + \frac{{{{\tau }_{y}}{{\zeta }^{2}}}}{2} + \frac{{\tau _{y}^{2}{{L}_{2}}}}{2}{{\sigma }^{2}} \\ \, \leqslant (1 - {{\mu }_{y}}{{\tau }_{y}})\mathbb{E}[g({{x}_{t}}) - f({{x}_{t}},{{y}_{t}}) + f({{x}_{t}},{{y}_{t}}) \\ \end{gathered} $$

(A.13)

$$\, - f({{x}_{{t + 1}}},{{y}_{t}}) + g({{x}_{{t + 1}}}) - g({{x}_{t}})] + \frac{{{{\tau }_{y}}{{\zeta }^{2}}}}{2} + \frac{{\tau _{y}^{2}{{L}_{2}}}}{2}{{\sigma }^{2}}.$$

Using Lemma 3, we can bound f(x_t, ${{y}_{t}}) - f({{x}_{{t + 1}}},{{y}_{t}})$ as follows:

$$f({{x}_{t}},{{y}_{t}}) - f({{x}_{{t + 1}}},{{y}_{t}}) \leqslant \frac{{{{\tau }_{x}}}}{2}{{\zeta }^{2}} + \frac{{\tau _{x}^{2}{{L}_{2}}}}{2}{{\sigma }^{2}}.$$

(A.14)

In addition, from A.12, we have

$$\mathbb{E}[g({{x}_{{t + 1}}}) - g({{x}_{t}})] \leqslant - {{\tau }_{x}}{{\mu }_{x}}{{a}_{t}} + \frac{{{{\tau }_{x}}L_{2}^{2}}}{{{{\mu }_{y}}}}{{b}_{t}} + \frac{{{{\tau }_{x}}}}{2}{{\zeta }^{2}}.$$

(A.15)

By combining (A.13), (A.14), and (A.15), we obtain

$$\mathbb{E}[g({{x}_{{t + 1}}}) - f({{x}_{{t + 1}}},{{y}_{{t + 1}}})] \leqslant (1 - {{\mu }_{y}}{{\tau }_{y}})\left( { - {{\tau }_{x}}{{\mu }_{x}}{{a}_{t}}_{{_{{_{{_{{_{{_{{_{{_{{}}}}}}}}}}}}}}}}} \right.$$

$$\begin{gathered} \, + \,\left. {\left( {1\, + \,\frac{{{{\tau }_{x}}L_{2}^{2}}}{{{{\mu }_{y}}}}{{\sigma }^{2}}} \right){{b}_{t}}} \right)\, + \,(1\, - \,{{\mu }_{y}}{{\tau }_{y}})\left( {{{\tau }_{x}}{{\zeta }^{2}}\, + \,\frac{{\tau _{x}^{2}{{L}_{2}}}}{2}{{\sigma }^{2}}} \right) \\ \, + \frac{{\tau _{y}^{2}{{L}_{2}}}}{2}{{\sigma }^{2}} + \frac{{{{\tau }_{y}}{{\zeta }^{2}}}}{2} \leqslant (1 - {{\mu }_{y}}{{\tau }_{y}}) \\ \end{gathered} $$

(A.16)

$$ \times \left( { - {{\tau }_{x}}{{\mu }_{x}}{{a}_{t}} + \left( {1 + \frac{{{{\tau }_{x}}L_{2}^{2}}}{{{{\mu }_{y}}}}{{\sigma }^{2}}} \right){{b}_{t}}} \right) + \tau _{y}^{2}{{L}_{2}}{{\sigma }^{2}} + \frac{3}{4}{{\tau }_{y}}{{\zeta }^{2}}.$$

Here, the last inequality takes into account that ${{\tau }_{x}}$ is smaller than ${{\tau }_{y}}$. We can even assume that ${{\tau }_{x}} \leqslant \frac{\lambda }{2}{{\tau }_{y}}$. By combining (A.12) and (A.16), $\forall \lambda > 0$, we obtain

$$\begin{gathered} {{a}_{{t + 1}}} + \lambda {{b}_{{t + 1}}} \leqslant {{a}_{t}}\left[ {1 - {{\mu }_{x}}{{\tau }_{x}} - \lambda (1 - {{\mu }_{y}}{{\tau }_{y}}){{\mu }_{x}}{{\tau }_{x}}} \right] \\ \, + \lambda {{b}_{t}}\left[ {1 + \frac{{L_{2}^{2}{{\tau }_{x}}}}{{{{\mu }_{y}}\lambda }} - {{\mu }_{y}}{{\tau }_{y}} + \frac{{{{\tau }_{x}}L_{2}^{2}}}{{{{\mu }_{y}}}}{{\sigma }^{2}} - {{\tau }_{x}}{{\tau }_{y}}L_{2}^{2}{{\sigma }^{2}}} \right] \\ \, + \lambda \left( {\tau _{y}^{2}{{L}_{2}}{{\sigma }^{2}} + {{\tau }_{y}}{{\zeta }^{2}}} \right). \\ \end{gathered} $$

By adding batching (with batch size B), we arrive at

$$\begin{gathered} {{a}_{{t + 1}}} + \lambda {{b}_{{t + 1}}} \leqslant {{a}_{t}}\left[ {1 - {{\mu }_{x}}{{\tau }_{x}} - \lambda (1 - {{\mu }_{y}}{{\tau }_{y}}){{\mu }_{x}}{{\tau }_{x}}} \right] + \\ \, + \lambda {{b}_{t}}\left[ {1 + \frac{{{{l}^{2}}{{\tau }_{x}}}}{{{{\mu }_{y}}\lambda }} - {{\mu }_{y}}{{\tau }_{y}} + \frac{{{{\tau }_{x}}L_{2}^{2}}}{{{{\mu }_{y}}}}\frac{{{{\sigma }^{2}}}}{B} - {{\tau }_{x}}{{\tau }_{y}}L_{2}^{2}\frac{{{{\sigma }^{2}}}}{B}} \right] + \\ \, + \lambda \left( {\tau _{y}^{2}{{L}_{2}}\frac{{{{\sigma }^{2}}}}{B} + {{\tau }_{y}}{{\zeta }^{2}}} \right). \\ \end{gathered} $$

(A.17)

$\square $

Proof of Theorem 1.

Proof. Under the conditions of Lemma 4 $\tau _{x}^{t} = {{\tau }_{x}}$ and $\tau _{y}^{t} = {{\tau }_{y}},\forall t$ we only need to select ${{\tau }_{x}}$, ${{\tau }_{y}}$, and $\lambda $ so that ${{k}_{1}},{{k}_{2}} < 1$. Here, $\lambda = 1{\text{/}}10$ is selected first. Then,

$${{k}_{1}} = 1 - {{\mu }_{x}}[{{\tau }_{x}} + \lambda (1 - {{\mu }_{y}}{{\tau }_{y}}){{\tau }_{x}}] \leqslant 1 - {{\tau }_{x}}{{\mu }_{x}}.$$

(A.18)

In addition,

$$\begin{gathered} {{k}_{2}} = 1 + \frac{{{{\tau }_{x}}L_{2}^{2}}}{{{{\mu }_{y}}\lambda }} - {{\mu }_{y}}{{\tau }_{y}} + \frac{{{{\tau }_{x}}L_{2}^{2}}}{{{{\mu }_{y}}}}\frac{{{{\sigma }^{2}}}}{B} - {{\tau }_{x}}{{\tau }_{y}}L_{2}^{2}\frac{{{{\sigma }^{2}}}}{B} \\ \, = 1\, - \,\frac{{{{\tau }_{x}}L_{2}^{2}}}{{{{\mu }_{y}}}}\left\{ {\frac{{\mu _{y}^{2}{{\tau }_{y}}}}{{{{\tau }_{x}}L_{2}^{2}}}\, - \,\frac{1}{\lambda }\, - \,\frac{{{{\sigma }^{2}}}}{B}(1\, - \,{{\mu }_{y}}{{\tau }_{y}})} \right\}\, \leqslant \,1\, - \,\frac{{{{\tau }_{x}}L_{2}^{2}}}{{{{\mu }_{y}}}}, \\ \end{gathered} $$

(A.19)

where, in the last inequality, $\lambda $ is substituted and $\frac{{\mu _{y}^{2}{{\tau }_{y}}}}{{{{\tau }_{x}}L_{2}^{2}}} \geqslant 12$ is used due to the selection of ${{\tau }_{x}}$. By choosing large B on the order of d², we can make $\frac{{{{\sigma }^{2}}}}{B} \leqslant 1$. Note that ${{\tau }_{x}}{{\mu }_{x}}\, < \,\frac{{{{l}^{2}}{{\tau }_{x}}}}{{{{\mu }_{y}}}}$ because $\left( {{{\tau }_{x}}{{\mu }_{x}}} \right){\text{/}}\left( {\frac{{{{l}^{2}}{{\tau }_{x}}}}{{{{\mu }_{y}}}}} \right)\, = \,\frac{{{{\mu }_{x}}{{\mu }_{y}}}}{{{{l}^{2}}}}$ < 1. Suppose that ${{P}_{t}}: = {{a}_{t}} + \frac{1}{{10}}{{b}_{t}}$ and, by Theorem 4,

$${{P}_{{t + 1}}} \leqslant \left( {1 - {{\tau }_{x}}{{\mu }_{x}}} \right){{P}_{t}} + \frac{1}{{10}}\left( {\tau _{y}^{2}{{L}_{2}}\frac{{{{\sigma }^{2}}}}{B} + {{\tau }_{y}}{{\zeta }^{2}}} \right).$$

By simple calculations, we obtain

$${{P}_{t}} \leqslant {{(1 - {{\mu }_{x}}{{\tau }_{x}})}^{t}}{{P}_{0}} + \frac{{\tau _{y}^{2}{{L}_{2}}\frac{{{{\sigma }^{2}}}}{B} + {{\tau }_{y}}{{\zeta }^{2}}}}{{10{{\mu }_{x}}{{\tau }_{x}}}}.$$

(A.20)

The check of ${{\tau }_{x}} \leqslant \frac{1}{{(M + 1)L}}$ is carried out due to the fact that ${{\tau }_{x}} \leqslant \frac{{\mu _{y}^{2}{{\tau }_{y}}}}{{12L_{2}^{2}}}$ ≤ $\frac{{\mu _{y}^{2}}}{{12(M + 1)L_{2}^{3}}} \leqslant \frac{{{{\mu }_{y}}}}{{2(M + 1)L_{2}^{2}}}$ and $L = {{L}_{2}} + \frac{{L_{2}^{2}}}{{{{\mu }_{y}}}} \leqslant \frac{{2L_{2}^{2}}}{{{{\mu }_{y}}}}$.

$\square $

Proofs for zero-order methods.

Here, we prove lemmas for different variants of the problem. In the following lemmas, we do not confine ourselves to the saddle point problem and focus more on the kernel approximation of the gradient, which is why we consider problem ${{\min }_{{x \in }}}f(x)$ for the following lemmas.

Lemma 5 (reduction of an integral over domain to an integral over surface). Suppose that $D$ is an open connected subset of $\mathbb{R}$ with piecewise-smooth boundary $\partial D$ that is oriented along outer unit normal ${\mathbf{n}} = ({{n}_{1}}, \ldots ,{{n}_{m}}{{)}^{ \top }}$. Suppose also that $f$ is a smooth function in $D \cup \partial D$, then

$$\int\limits_D {\nabla f(x)dx} = \int\limits_{\partial D} f (x){\mathbf{n}}(x)dS(x).$$

Remark 4. For the definition of piecewise-smooth surfaces and their orientations, we refer to [25], Section 12.3.2, Definitions 4 and 5, respectively.

Lemma 6. Suppose that $f:{{\mathbb{R}}^{d}} \to \mathbb{R}$ is a continuously differentiable function. Suppose also that $r,{\mathbf{h}},{\mathbf{e}}$ are uniformly distributed over $[ - 1,1],\mathcal{B}_{2}^{d}$, and ${{\mathcal{S}}^{d}}$, respectively. Then, for any $\gamma > 0$, we have

$$\mathbb{E}[\nabla f(x + \gamma r{\mathbf{h}})rK(r)] = \frac{d}{\gamma }\mathbb{E}[f(x + \gamma r{\mathbf{e}}){\mathbf{e}}K(r)].$$

Proof. Let us fix $r \in [ - 1,1]{{\backslash }}\{ 0\} $. We define $\phi :{{\mathbb{R}}^{d}} \to \mathbb{R}$ as $\phi ({\mathbf{h}}) = f(x + \gamma r{\mathbf{h}})K(r)$ and note that $\nabla \phi ({\mathbf{h}}) = \gamma r\nabla f(x + \gamma r{\mathbf{h}})K(r)$. Hence, we have

$$\begin{gathered} \mathbb{E}[\nabla f(x + \gamma r{\mathbf{h}})K(r)\,{\text{|}}\,r] = \frac{1}{{\gamma r}}\mathbb{E}[\nabla \phi ({\mathbf{h}})\,{\text{|}}\,r]\frac{d}{{\gamma r}} \\ \, \times \mathbb{E}[\phi ({\mathbf{e}}){\mathbf{e}}\,{\text{|}}\,r] = \frac{d}{{\gamma r}}K(r)\mathbb{E}[f(x + \gamma r{\mathbf{e}}){\mathbf{e}}\,{\text{|}}\,r], \\ \end{gathered} $$

where the second equality is obtained from Theorem 5. The proof is completed with multiplying both sides by r, using the fact that r follows continuous distribution, and taking the full expectation.

$\square $

PROOF OF THEOREM 2

Lemma 7 (kernel approximation bias). Suppose that Assumptions 1–3 hold. Suppose also that ${{x}_{t}}$ and ${\mathbf{G}}({{x}_{t}},{\mathbf{e}})$ are determined by Algorithm 1 at instant $t \geqslant 1$ with gradient approximation (4.2) for zero-order oracle (4.1). Then,

$$\begin{gathered} {\text{||}}\mathbb{E}[{\mathbf{G}}({{x}_{t}},\xi ,{\mathbf{e}})\,{\text{|}}\,{{x}_{t}}] - \nabla f({{x}_{t}}){\text{||}} \\ \, \leqslant {{\kappa }_{\beta }}\frac{{{{L}_{\beta }}}}{{(l - 1)!}} \cdot \frac{d}{{d + \beta - 1}}{{\gamma }^{{\beta - 1}}} + {{\kappa }_{\beta }}d\frac{\Delta }{\gamma }, \\ \end{gathered} $$

(В.1)

where we recall that $l=\beta $.

Proof. Using Lemma 6, the fact that $\int_{ - 1}^1 rK(r)dr$ = 1, and the variational representation of the Euclidean norm, we can write

$${\text{||}}\mathbb{E}[{\mathbf{G}}({{x}_{t}},\xi ,{\mathbf{e}})\,{\text{|}}\,{{x}_{t}}] - \nabla f({{x}_{t}}){\text{||}}$$

$$\begin{gathered} \, = \mathop {\sup }\limits_{{\mathbf{v}} \in {{\mathcal{S}}^{d}}} \mathbb{E}[({{\nabla }_{{\mathbf{v}}}}f(x + \gamma r{\mathbf{h}},\xi ) - {{\nabla }_{{\mathbf{v}}}}f(x,\xi ) \\ \, + \frac{d}{{2\gamma r}}(\delta (x + \gamma r{\mathbf{h}}) - \delta (x - \gamma r{\mathbf{h}})))rK(r)] \\ \end{gathered} $$

(В.2)

$$\, \leqslant \mathop {\sup }\limits_{{\mathbf{v}} \in {{\mathcal{S}}^{d}}} \mathbb{E}[({{\nabla }_{{\mathbf{v}}}}f(x + \gamma r{\mathbf{h}}) - {{\nabla }_{{\mathbf{v}}}}f(x))rK(r)] + {{\kappa }_{\beta }}d\frac{\Delta }{\gamma },$$

where h is uniformly distributed over $\mathcal{B}_{2}^{d}$. Since $f(x)$ satisfies the Hölder condition with constants $\beta $ and ${{L}_{\beta }}$, for any ${\mathbf{v}} \in {{\mathcal{S}}^{d}}$, directed gradient ${{\nabla }_{{\mathbf{v}}}}f( \cdot )$ satisfies the Hölder condition with constants $\beta - 1$ and ${{L}_{\beta }}$. Thus, the following Taylor expansion holds:

$$\begin{gathered} {{\nabla }_{{\mathbf{v}}}}f({{x}_{t}} + \gamma r{\mathbf{h}}) = {{\nabla }_{{\mathbf{v}}}}f({{x}_{t}}) \\ \, + \sum\limits_{1 \leqslant |{\mathbf{m}}| \leqslant l - 1} \frac{{{{{(r\gamma )}}^{{|{\mathbf{m}}|}}}}}{{{\mathbf{m}}!}}{{D}^{{\mathbf{m}}}}{{\nabla }_{{\mathbf{v}}}}f({{x}_{t}})({\mathbf{h}}{{)}^{{\mathbf{m}}}} + R(\gamma r{\mathbf{h}}), \\ \end{gathered} $$

(В.3)

where remainder term $R( \cdot )$ satisfies condition ${\text{|}}R(x){\text{|}} \leqslant \frac{{{{L}_{\beta }}}}{{(l - 1)!}}{\text{||}}x{\text{|}}{{{\text{|}}}^{{\beta - 1}}}$.

Substituting equation (B.3) into equation (B.2) and using the “nullification” properties of kernel K, we arrive at

$$\begin{gathered} {\text{||}}\mathbb{E}[{\mathbf{G}}({{x}_{t}},\xi ,{\mathbf{e}})\,{\text{|}}\,{{x}_{t}}] - \nabla f({{x}_{t}}){\text{||}} \\ \, \leqslant {{\kappa }_{\beta }}{{\gamma }^{{\beta - 1}}}\frac{{{{L}_{\beta }}}}{{(l - 1)!}}\mathbb{E}{\text{||}}{\mathbf{h}}{\text{|}}{{{\text{|}}}^{{\beta - 1}}} \\ \, = {{\kappa }_{\beta }}{{\gamma }^{{\beta - 1}}}\frac{{{{L}_{\beta }}}}{{(l - 1)!}}\frac{d}{{d + \beta - 1}} + {{\kappa }_{\beta }}d\frac{\Delta }{\gamma }, \\ \end{gathered} $$

where the last equality is obtained from the fact that $\mathbb{E}{\text{||}}{\mathbf{h}}{\text{|}}{{{\text{|}}}^{q}} = \frac{d}{{d + q}}$ for any $q \geqslant 0$.

$\square $

Lemma 8 (kernel approximation variance). Suppose that Assumptions 1–3 hold and that ${{x}_{t}}$ and ${\mathbf{G}}({{x}_{t}},\xi ,{\mathbf{e}})$ are determined by Algorithm 3 with gradient approximation (4.2) for zero-order oracle (4.1). Suppose also that $f \in {{\mathcal{F}}_{2}}({{L}_{2}})$; then, for $d \geqslant 2$,

$$\mathbb{E}{\text{||}}{\mathbf{G}}({{x}_{t}},\xi ,{\mathbf{e}}){\text{|}}{{{\text{|}}}^{2}} \leqslant \frac{{{{d}^{2}}\kappa }}{{d - 1}}\mathbb{E}\left[ {{\text{||}}\nabla f({{x}_{t}}){\text{||}} + {{L}_{2}}{{\gamma }^{2}}} \right] + \frac{{{{d}^{2}}{{\Delta }^{2}}\kappa }}{{{{\gamma }^{2}}}},$$

where we recall that $\kappa = \int_{ - 1}^1 {{K}^{2}}(r)dr$.

The result of Lemma 8 can be further simplified as follows:

$$\begin{gathered} \mathbb{E}{\text{||}}{\mathbf{G}}({{x}_{t}},\xi ,{\mathbf{e}}){\text{|}}{{{\text{|}}}^{2}} \leqslant 4d\kappa \mathbb{E}{\text{||}}\nabla f({{x}_{t}}){\text{|}}{{{\text{|}}}^{2}} \\ \, + 4d\kappa L_{2}^{2}{{\gamma }^{2}} + \frac{{{{d}^{2}}{{\Delta }^{2}}\kappa }}{{{{\gamma }^{2}}}},\quad d \geqslant 2. \\ \end{gathered} $$

(B.4)

Proof. For simplicity, we omit subscript $t$ for all quantities. Let us write the second moment of the following quantity:

$$\begin{gathered} \mathbb{E}{\text{||}}{\mathbf{G}}(x,\xi ,{\mathbf{e}}){\text{|}}{{{\text{|}}}^{2}} \\ \, = \frac{{{{d}^{2}}}}{{4{{\gamma }^{2}}}}\mathbb{E}\left[ {(f(x + \gamma r{\mathbf{e}},\xi ) - f(x - \gamma r{\mathbf{e}},\xi )} \right. \\ \end{gathered} $$

$$\, + (\delta (x + \gamma r{\mathbf{e}}) - \delta (x - \gamma r{\mathbf{e}}{{)))}^{2}}{{K}^{2}}(r)]$$

(B.5)

$$\begin{gathered} \, \leqslant \frac{{{{d}^{2}}}}{{4{{\gamma }^{2}}}}\left( {\mathbb{E}\left[ {(f(x + \gamma r{\mathbf{e}})} \right.} \right. \\ - \,f{{(x - \gamma r{\mathbf{e}})}^{2}}{{K}^{2}}(r)] + 4\kappa {{\Delta }^{2}}). \\ \end{gathered} $$

Below, all expectations should conventionally be understood on x_t. It should be noted that, since $\mathbb{E}[f(x + hr{\mathbf{e}}) - f(x - hr{\mathbf{e}})\,{\text{|}}\,r] = 0$ and $f \in {{\mathcal{F}}_{2}}({{L}_{2}})$, the use of the Wirtinger–Poincaré inequality [22, 23], see Eq. (3.1) or Theorem 2, leads to

$$\begin{gathered} \mathbb{E}\left[ {{{{(f(x + hr{\mathbf{e}}) - f(x - hr{\mathbf{e}}))}}^{2}}\,{\text{|}}\,r} \right] \\ \, \leqslant \frac{{{{h}^{2}}}}{{d - 1}}\mathbb{E}\left[ {{\text{||}}\nabla f(x + hr{\mathbf{e}}) + \nabla f(x - hr{\mathbf{e}}){\text{|}}{{{\text{|}}}^{2}}\,{\text{|}}\,r} \right]. \\ \end{gathered} $$

(B.6)

Since $f \in {{\mathcal{F}}_{2}}({{L}_{2}})$, the triangle inequality implies that

$$\begin{gathered} \mathbb{E}\left[ {{\text{||}}\nabla f(x + hr{\mathbf{e}}) + \nabla f(x - hr{\mathbf{e}}){\text{|}}{{{\text{|}}}^{2}}\,{\text{|}}\,r} \right] \\ \, \leqslant 4({\text{||}}\nabla f(x){\text{||}} + {{L}_{2}}\gamma {{)}^{2}}. \\ \end{gathered} $$

(B.7)

Finally, we substitute the above estimate into (B.6) and take (B.6) into account.

$\square $

We can now compute the noise and bias of the kernel approximation:

$$M = 4d{{\beta }^{3}}\quad {{\sigma }^{2}} = 4d{{\beta }^{3}}{{L}_{2}}{{\gamma }^{2}} + \frac{{{{d}^{2}}{{\Delta }^{2}}{{\beta }^{3}}}}{{{{\gamma }^{2}}}},$$

(B.8)

$${{\zeta }^{2}} = {{\beta }^{2}}{{\left( {\frac{{{{L}_{\beta }}}}{{(l - 1)!}}\frac{d}{{d + \beta - 1}}{{\gamma }^{{\beta - 1}}} + d\frac{\Delta }{\gamma }} \right)}^{2}}.$$

(B.9)

A rougher estimate for the bias is

$${{\zeta }^{2}} = {{\beta }^{2}}{{\left( {\frac{{{{L}_{\beta }}}}{{(l - 1)!}}} \right)}^{2}}{{\gamma }^{{2\beta - 2}}} + {{\beta }^{2}}{{d}^{2}}\frac{{{{\Delta }^{2}}}}{{{{\gamma }^{2}}}}.$$

Now, we can estimate the convergence rate for the kernel approximation by substituting the found constants into the final convergence estimate:

$${{P}_{t}} \leqslant {{(1 - {{\mu }_{x}}{{\tau }_{x}})}^{t}}{{P}_{0}} + \frac{{\tau _{y}^{2}{{L}_{2}}\frac{{{{\sigma }^{2}}}}{B} + {{\tau }_{y}}{{\zeta }^{2}}}}{{10{{\mu }_{x}}{{\tau }_{x}}}} = (1 - {{\mu }_{x}}{{\tau }_{x}}{{)}^{t}}{{P}_{0}}$$

$$\begin{gathered} \, + \frac{{12}}{{5B}}\frac{{L_{2}^{3}d{{\gamma }^{2}}}}{{{{\mu }_{x}}\mu _{y}^{2}}} + \frac{3}{{5B}}\frac{{L_{2}^{2}{{d}^{2}}{{\Delta }^{2}}}}{{{{\mu }_{x}}\mu _{y}^{2}{{\gamma }^{2}}}} \\ \, + \frac{{12}}{5}\frac{{L_{2}^{2}{{\beta }^{2}}}}{{{{\mu }_{x}}\mu _{y}^{2}}}{{\left( {\frac{{{{L}_{\beta }}}}{{(l - 1)!}}} \right)}^{2}}{{\gamma }^{{2\beta - 2}}} + \frac{{12}}{5}\frac{{L_{2}^{2}{{\beta }^{2}}{{d}^{2}}{{\Delta }^{2}}}}{{{{\mu }_{x}}\mu _{y}^{2}{{\gamma }^{2}}}} \\ \end{gathered} $$

(B.10)

$$\, = \mathcal{O}\left( {\frac{{L_{2}^{2}d{{\gamma }^{2}}}}{{B{{\mu }_{x}}\mu _{y}^{2}}} + \frac{{L_{2}^{2}{{\beta }^{2}}}}{{{{\mu }_{x}}\mu _{y}^{2}}}{{{\left( {\frac{{{{L}_{\beta }}}}{{(l - 1)!}}} \right)}}^{2}}{{\gamma }^{{2\beta - 2}}} + \frac{{L_{2}^{2}{{\beta }^{2}}{{d}^{2}}{{\Delta }^{2}}}}{{{{\mu }_{x}}\mu _{y}^{2}{{\gamma }^{2}}}}} \right).$$

Here, we substitute the values for ${{\tau }_{y}} = \frac{1}{{(M + 1){{L}_{2}}}}$ and ${{\tau }_{x}} = \frac{{\mu _{y}^{2}{{\tau }_{y}}}}{{12L_{2}^{2}}}$.

Since B can be large, the second and third terms are responsible for the asymptote. We find the optimal smoothing parameter γ that minimizes the last two terms:

$$\begin{gathered} {{P}_{t}} = \mathcal{O}\left( {\frac{{L_{2}^{2}{{\beta }^{2}}{{d}^{2}}}}{{{{\mu }_{x}}\mu _{y}^{2}}}{{{\left( {\frac{{{{L}_{\beta }}}}{{(l - 1)!}}} \right)}}^{{\frac{2}{\beta }}}}{{{\left( {\frac{{\beta - 1}}{{d + \beta - 1}}} \right)}}^{{\frac{2}{\beta }}}}{{\Delta }^{{\frac{{2(\beta - 1)}}{\beta }}}}} \right) \\ \, = \mathcal{O}\left( {\frac{1}{{{{\mu }_{x}}\mu _{y}^{2}}}{{d}^{{\frac{{2(\beta - 1)}}{\beta }}}}{{\Delta }^{{\frac{{2(\beta - 1)}}{\beta }}}}} \right) \\ \end{gathered} $$

(B.11)

where ${{\gamma }_{k}} = {{\left( {\frac{{(l - 1)!}}{{{{L}_{\beta }}}}\frac{{d + \beta - 1}}{{\beta - 1}}\Delta } \right)}^{{1{\text{/}}\beta }}}$ is the optimal smoothing parameter. Then, from (B.11), we can find the maximum noise level while assuming that ${{(d\Delta )}^{{\frac{{2(\beta - 1)}}{\beta }}}}\, \leqslant \,\varepsilon $ for $\varepsilon > 0$. Thus, we have

$$\Delta = \mathcal{O}\left( {{{{({{\mu }_{x}}\mu _{y}^{2})}}^{{\frac{\beta }{{2(\beta - 1)}}}}}{{\varepsilon }^{{\frac{\beta }{{2(\beta - 1)}}}}}{{d}^{{ - 1}}}} \right).$$

With this maximum noise, ${{\gamma }_{k}} = \mathcal{O}\left( {{{{({{\mu }_{x}}\mu _{y}^{2}\varepsilon )}}^{{\frac{1}{{2(\beta - 1)}}}}}} \right)$. Hence, we guarantee that the second and third terms in (B.10) are smaller than $\varepsilon $ (up to a constant) for the selected parameters. To reduce the number of iterations, we select the batch size on the order of ${{\beta }^{3}}d$. Let us determine the minimum number of iterations by solving the following inequality:

$${{(1 - {{\mu }_{x}}{{\tau }_{x}})}^{N}}{{P}_{0}} \leqslant \varepsilon .$$

Thus, the minimum number of iterations is

$$\begin{gathered} N \geqslant \frac{1}{{{{\tau }_{x}}{{\mu }_{x}}}}\ln \frac{{{{P}_{0}}}}{\varepsilon } = 12\frac{{({{\beta }^{3}}d{\text{/}}B + 1)L_{2}^{3}}}{{{{\mu }_{x}}\mu _{y}^{2}}}\ln \frac{{{{P}_{0}}}}{\varepsilon } \\ \, = \mathcal{O}\left( {\mu _{x}^{{ - 1}}\mu _{y}^{{ - 2}}\ln \frac{1}{\varepsilon }} \right). \\ \end{gathered} $$

In the second inequality, we use the fact that τ_x = $\frac{{\mu _{y}^{2}{{\tau }_{y}}}}{{12L_{2}^{2}}}$, ${{\tau }_{y}}\, = \,\frac{1}{{(M\, + \,1){{L}_{2}}}}$ and $M = \mathcal{O}({{\beta }^{3}}d{\text{/}}B)$, d = max(d_x, d_y). For sufficiently large B of the order of ${{\beta }^{3}}d$, the dependence on dimension disappears.

The oracle complexity is found from the iterative complexity through multiplying by the batch size, i.e.,

$$T = \mathcal{O}\left( {{{\beta }^{3}}d\mu _{x}^{{ - 1}}\mu _{y}^{{ - 2}}\ln \frac{1}{\varepsilon }} \right).$$

Thus, all terms in formula (B.10) are smaller than ε.

PROOF OF THEOREM 3

Lemma 9 (kernel approximation bias). Suppose that Assumptions 1–5 hold. Suppose also that ${{x}_{t}}$ and ${\mathbf{G}}({{x}_{t}},\xi ,{\mathbf{e}})$ are determined by Algorithm 1 at instant $t \geqslant 1$ with gradient approximation (4.4) for zero-order oracle (4.3). Then,

$$\begin{gathered} {\text{||}}\mathbb{E}[{\mathbf{G}}({{x}_{t}},\xi ,{\mathbf{e}})\,{\text{|}}\,{{x}_{t}}] - \nabla f({{x}_{t}}){\text{||}} \\ \, \leqslant {{\kappa }_{\beta }}\frac{{{{L}_{\beta }}}}{{(l - 1)!}} \cdot \frac{d}{{d + \beta - 1}}{{\gamma }^{{\beta - 1}}}, \\ \end{gathered} $$

(C.1)

where we recall that $l=\beta $.

Proof. Using Lemma 6, the fact that $\int_{ - 1}^1 rK(r)dr$ = 1, and the variational representation of the Euclidean norm, we can write

$$\begin{gathered} {\text{||}}\mathbb{E}[{\mathbf{G}}({{x}_{t}},\xi ,{\mathbf{e}})\,{\text{|}}\,{{x}_{t}}] - \nabla f({{x}_{t}}){\text{||}} \\ \, = \mathop {\sup }\limits_{{\mathbf{v}} \in {{\mathcal{S}}^{d}}} \mathbb{E}[({{\nabla }_{{\mathbf{v}}}}f(x + \gamma r{\mathbf{h}}) - {{\nabla }_{{\mathbf{v}}}}f(x))rK(r)], \\ \end{gathered} $$

(C.2)

where h is uniformly distributed over $\mathcal{B}_{2}^{d}$. Since $f(x)$ satisfies the Hölder condition with constants $\beta $ and ${{L}_{\beta }}$, for any ${\mathbf{v}} \in {{\mathcal{S}}^{d}}$, directed gradient ${{\nabla }_{{\mathbf{v}}}}f( \cdot )$ satisfies the Hölder condition with constants $\beta - 1$ and ${{L}_{\beta }}$. Thus, the following Taylor expansion holds:

$$\begin{gathered} {{\nabla }_{{\mathbf{v}}}}f({{x}_{t}} + \gamma r{\mathbf{h}}) = {{\nabla }_{{\mathbf{v}}}}f({{x}_{t}}) \\ \, + \sum\limits_{1 \leqslant |{\mathbf{m}}| \leqslant l - 1} \frac{{{{{(r\gamma )}}^{{|{\mathbf{m}}|}}}}}{{{\mathbf{m}}!}}{{D}^{{\mathbf{m}}}}{{\nabla }_{{\mathbf{v}}}}f({{x}_{t}})({\mathbf{h}}{{)}^{{\mathbf{m}}}} + R(\gamma r{\mathbf{h}}), \\ \end{gathered} $$

(C.3)

where remainder term $R( \cdot )$ satisfies condition ${\text{|}}R(x){\text{|}} \leqslant \frac{{{{L}_{\beta }}}}{{(l - 1)!}}{\text{||}}x{\text{|}}{{{\text{|}}}^{{\beta - 1}}}$.

By substituting equation (C.3) into equation (C.2) and using the “nullification” properties of kernel K, we arrive at

$$\begin{gathered} {\text{||}}\mathbb{E}[{\mathbf{G}}({{x}_{t}},\xi ,{\mathbf{e}})\,{\text{|}}\,{{x}_{t}}] - \nabla f({{x}_{t}}){\text{||}} \\ \leqslant {{\kappa }_{\beta }}{{\gamma }^{{\beta - 1}}}\frac{{{{L}_{\beta }}}}{{(l - 1)!}}\mathbb{E}{\text{||}}{\mathbf{h}}{\text{|}}{{{\text{|}}}^{{\beta - 1}}} = {{\kappa }_{\beta }}{{\gamma }^{{\beta - 1}}}\frac{{{{L}_{\beta }}}}{{(l - 1)!}}\frac{d}{{d + \beta - 1}}, \\ \end{gathered} $$

where the last equality is obtained from the fact that $\mathbb{E}{\text{||}}{\mathbf{h}}{\text{|}}{{{\text{|}}}^{q}} = \frac{d}{{d + q}}$ for any $q \geqslant 0$.

$\square $

Lemma 10 (kernel approximation variance). Suppose that Assumptions 1–3 hold, as well as that ${{x}_{t}}$ and ${\mathbf{G}}({{x}_{t}},{\mathbf{e}})$ are determined by Algorithm 1 with gradient approximation (4.4) for zero-order oracle (4.3). Suppose also that $f \in {{\mathcal{F}}_{2}}({{L}_{2}})$. Then, for $d \geqslant 2$, we have

$$\mathbb{E}{\text{|}}{\mathbf{G}}({{x}_{t}},\xi ,{\mathbf{e}}){\text{|}}{{{\text{|}}}^{2}} \leqslant \frac{{{{d}^{2}}\kappa }}{{d - 1}}\mathbb{E}\left[ {{\text{||}}\nabla f({{x}_{t}}){\text{||}} + {{L}_{2}}{{\gamma }^{2}}} \right] + \frac{{{{d}^{2}}{{{\tilde {\Delta }}}^{2}}\kappa }}{{{{\gamma }^{2}}}},$$

where $\kappa = \int_{ - 1}^1 {{K}^{2}}(r)dr$.

The result of Lemma 10 can be further simplified as follows:

$$\begin{gathered} \mathbb{E}{\text{||}}{\mathbf{G}}({{x}_{t}},\xi ,{\mathbf{e}}){\text{|}}{{{\text{|}}}^{2}} \leqslant 4d\kappa \mathbb{E}{\text{||}}\nabla f({{x}_{t}}){\text{|}}{{{\text{|}}}^{2}} \\ \, + 4d\kappa L_{2}^{2}{{\gamma }^{2}} + \frac{{{{d}^{2}}{{{\tilde {\Delta }}}^{2}}\kappa }}{{{{\gamma }^{2}}}},\quad d \geqslant 2. \\ \end{gathered} $$

(C.4)

Proof. For simplicity, we omit subscript $t$ for all quantities. Let us write the second moment of the following quantity:

$$\begin{gathered} \mathbb{E}{\text{||}}{\mathbf{G}}(x,\xi ,{\mathbf{e}}){\text{|}}{{{\text{|}}}^{2}} = \frac{{{{d}^{2}}}}{{4{{\gamma }^{2}}}}\mathbb{E}\left[ {(f(x + \gamma r{\mathbf{e}}) - f{{{(x - \gamma r{\mathbf{e}})}}^{{}}}{\kern 1pt} } \right. \\ + \left. {({{\xi }_{1}} - {{\xi }_{2}}{{{))}}^{2}}{{K}^{2}}(r)} \right] \leqslant \frac{{{{d}^{2}}}}{{4{{\gamma }^{2}}}}\left( {\mathbb{E}\left[ {(f{{{(x + \gamma r{\mathbf{e}})}}^{{}}}{\kern 1pt} } \right.} \right. \\ \, - \left. {\left. {f{{{(x - \gamma r{\mathbf{e}})}}^{2}}{{K}^{2}}(r)} \right] + 4\kappa {{{\tilde {\Delta }}}^{2}}} \right). \\ \end{gathered} $$

(C.5)

Below, all expectations should conventionally be understood on x_t. It should be noted that, since $\mathbb{E}[f(x + hr{\mathbf{e}}) - f(x - hr{\mathbf{e}})\,{\text{|}}\,r] = 0$ and $f \in {{\mathcal{F}}_{2}}({{L}_{2}})$, the use of the Wirtinger–Poincaré inequality [22, 23], see Eq. (3.1) or Theorem 2, leads to

$$\begin{gathered} \mathbb{E}\left[ {{{{(f(x + hr{\mathbf{e}}) - f(x - hr{\mathbf{e}}))}}^{2}}\,{\text{|}}\,r} \right] \\ \, \leqslant \frac{{{{h}^{2}}}}{{d - 1}}\mathbb{E}\left[ {{\text{||}}\nabla f(x + hr{\mathbf{e}}) + \nabla f(x - hr{\mathbf{e}}){\text{|}}{{{\text{|}}}^{2}}\,{\text{|}}\,r} \right]. \\ \end{gathered} $$

(C.6)

Since $f \in {{\mathcal{F}}_{2}}({{L}_{2}})$, the triangle inequality implies that

$$\begin{gathered} \mathbb{E}\left[ {{\text{||}}\nabla f(x + hr{\mathbf{e}}) + \nabla f(x - hr{\mathbf{e}}){\text{|}}{{{\text{|}}}^{2}}\,{\text{|}}\,r} \right] \\ \, \leqslant 4({\text{||}}\nabla f(x){\text{||}} + {{L}_{2}}\gamma {{)}^{2}}. \\ \end{gathered} $$

(C.7)

Finally, we substitute the above estimate into (C0.6) and take (C0.5) into account.

$\square $

We can now compute the noise and bias of the kernel approximation:

$$M = 4d{{\beta }^{3}}\quad {{\sigma }^{2}} = 4d{{\beta }^{3}}{{L}_{2}}{{\gamma }^{2}} + \frac{{{{d}^{2}}{{{\tilde {\Delta }}}^{2}}{{\beta }^{3}}}}{{{{\gamma }^{2}}}},$$

(C.8)

$${{\zeta }^{2}} = {{\beta }^{2}}{{\left( {\frac{{{{L}_{\beta }}}}{{(l - 1)!}}\frac{d}{{d + \beta - 1}}{{\gamma }^{{\beta - 1}}}} \right)}^{2}}.$$

(C.9)

Now, we can estimate the convergence rate for the kernel approximation by substituting the constants found into the final convergence estimate:

$$\begin{gathered} {{P}_{t}}\, \leqslant \,{{(1\, - \,{{\mu }_{x}}{{\tau }_{x}})}^{t}}{{P}_{0}}\, + \,\frac{{\tau _{y}^{2}{{L}_{2}}{\kern 1pt} \frac{{{{\sigma }^{2}}}}{B}\, + \,{{\tau }_{y}}{{\zeta }^{2}}}}{{10{{\mu }_{x}}{{\tau }_{x}}}}\, = \,{{(1 - {{\mu }_{x}}{{\tau }_{x}})}^{t}}{{P}_{0}} \\ \, + \frac{{12}}{{5B}}{\kern 1pt} \frac{{L_{2}^{3}d{{\gamma }^{2}}}}{{{{\mu }_{x}}\mu _{y}^{2}}}\, + \,\frac{3}{{5B}}{\kern 1pt} \frac{{L_{2}^{2}{{d}^{2}}{{{\tilde {\Delta }}}^{2}}}}{{{{\mu }_{x}}\mu _{y}^{2}{{\gamma }^{2}}}}\, + \,\frac{{12}}{5}{\kern 1pt} \frac{{L_{2}^{2}{{\beta }^{2}}}}{{{{\mu }_{x}}\mu _{y}^{2}}}{{\left( {\frac{{{{L}_{\beta }}}}{{(l\, - \,1)!}}} \right)}^{2}}{{\gamma }^{{2\beta - 2}}} \\ \, = \mathcal{O}\left( {\frac{{L_{2}^{3}{{\gamma }^{2}}}}{{B{{\mu }_{x}}\mu _{y}^{2}}}\, + \,\frac{{L_{2}^{2}d{{{\tilde {\Delta }}}^{2}}}}{{B{{\mu }_{x}}\mu _{y}^{2}{{\gamma }^{2}}}}\, + \,\frac{{L_{2}^{2}{{\beta }^{2}}}}{{{{\mu }_{x}}\mu _{y}^{2}}}{{{\left( {\frac{{{{L}_{\beta }}}}{{(l - 1)!}}} \right)}}^{2}}{{\gamma }^{{2\beta - 2}}}} \right). \\ \end{gathered} $$

(C.10)

Here, we substitute the values for ${{\tau }_{y}} = \frac{1}{{(M + 1){{L}_{2}}}}$ and ${{\tau }_{x}} = \frac{{\mu _{y}^{2}{{\tau }_{y}}}}{{12L_{2}^{2}}}$. Let us find the constraints on the smoothing parameter γ by minimizing the approximation bias. The optimal parameter is ${{\gamma }_{k}} = \sqrt[4]{{\frac{{d{{{\tilde {\Delta }}}^{2}}}}{{4{{L}_{2}}}}}}$. We find the maximum noise level from the last term in (C.10): $\tilde {\Delta } = \mathcal{O}\left( {{{d}^{{\frac{{ - 1}}{2}}}}{{{({{\mu }_{x}}\mu _{y}^{2}\varepsilon )}}^{{\frac{1}{{\beta - 1}}}}}} \right)$. Then, the smoothing parameter takes the following form: γ_k = $\mathcal{O}\left( {{{{({{\mu }_{x}}\mu _{y}^{2}\varepsilon )}}^{{\frac{1}{{2(\beta - 1)}}}}}} \right)$. With these parameters, the last term is smaller than $\varepsilon $. When selecting B on the order of ${{\beta }^{3}}d$, the first two terms in (C.10) are smaller than $\varepsilon $. Let us determine the minimum number of iterations by solving the following inequality:

$${{(1 - {{\mu }_{x}}{{\tau }_{x}})}^{N}}{{P}_{0}} \leqslant \varepsilon $$

Thus, the minimum number of iterations is

$$\begin{gathered} N \geqslant \frac{1}{{{{\tau }_{x}}{{\mu }_{x}}}}\ln \frac{{{{P}_{0}}}}{\varepsilon } \\ \, = 12\frac{{\left( {{{\beta }^{3}}d{\text{/}}B + 1} \right)L_{2}^{3}}}{{{{\mu }_{x}}\mu _{y}^{2}}}\ln \frac{{{{P}_{0}}}}{\varepsilon } = \mathcal{O}\left( {\mu _{x}^{{ - 1}}\mu _{y}^{{ - 2}}\ln \frac{1}{\varepsilon }} \right). \\ \end{gathered} $$

Here, the second inequality uses the fact that ${{\tau }_{x}} = \frac{{\mu _{y}^{2}{{\tau }_{y}}}}{{12L_{2}^{2}}}$, ${{\tau }_{y}} = \frac{1}{{(M + 1){{L}_{2}}}}$ and $M = \mathcal{O}({{\beta }^{3}}d{\text{/}}B)$, d = $\max ({{d}_{x}},{{d}_{y}})$. For a sufficiently large B on the order of ${{\beta }^{3}}d$, the dimension dependence disappears.

The oracle complexity is found from the iterative complexity through multiplying by the batch size, i.e.,

$$T = \mathcal{O}\left( {{{\beta }^{3}}d\mu _{x}^{{ - 1}}\mu _{y}^{{ - 2}}\ln \frac{1}{\varepsilon }} \right).$$

For these parameters, Algorithm 1 with gradient approximation (4.4) converges with the desired accuracy in this gradient-free oracle model (4.3).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sadykov, S.I., Lobanov, A.V. & Raigorodskii, A.M. Gradient-Free Algorithms for Solving Stochastic Saddle Optimization Problems with the Polyak–Łojasiewicz Condition. Program Comput Soft 49, 535–547 (2023). https://doi.org/10.1134/S0361768823060063

Download citation

Received: 13 June 2023
Revised: 10 July 2023
Accepted: 20 July 2023
Published: 01 December 2023
Issue Date: December 2023
DOI: https://doi.org/10.1134/S0361768823060063

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Gradient-Free Algorithms for Solving Stochastic Saddle Optimization Problems with the Polyak–Łojasiewicz Condition

Abstract

Access this article

Similar content being viewed by others

Gradient-Free Methods with Inexact Oracle for Convex-Concave Stochastic Saddle-Point Problem

Zeroth-order algorithms for nonconvex–strongly-concave minimax problems with improved complexities

Gradient-free proximal methods with inexact oracle for convex stochastic nonsmooth optimization problems on the simplex

REFERENCES

Funding

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Additional information

Appendices

AUXILIARY LEMMAS TO PROVE THEOREM 1

PROOF OF THEOREM 2

PROOF OF THEOREM 3

Rights and permissions

About this article

Cite this article

Navigation

Gradient-Free Algorithms for Solving Stochastic Saddle Optimization Problems with the Polyak–Łojasiewicz Condition

Abstract

Access this article

Similar content being viewed by others

Gradient-Free Methods with Inexact Oracle for Convex-Concave Stochastic Saddle-Point Problem

Zeroth-order algorithms for nonconvex–strongly-concave minimax problems with improved complexities

Gradient-free proximal methods with inexact oracle for convex stochastic nonsmooth optimization problems on the simplex

REFERENCES

Funding

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Additional information

Appendices

AUXILIARY LEMMAS TO PROVE THEOREM 1

PROOF OF THEOREM 2

PROOF OF THEOREM 3

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation