Skip to main content
Log in

Gradient-Free Algorithms for Solving Stochastic Saddle Optimization Problems with the Polyak–Łojasiewicz Condition

  • Published:
Programming and Computer Software Aims and scope Submit manuscript

Abstract

This paper focuses on solving a subclass of stochastic nonconvex-nonconcave black box optimization problems with a saddle point that satisfy the Polyak–Łojasiewicz (PL) condition. To solve this problem, we provide the first (to our best knowledge) gradient-free algorithm. The proposed approach is based on applying a gradient approximation (kernel approximation) to an oracle-biased stochastic gradient descent algorithm. We present theoretical estimates that guarantee its global linear rate of convergence to the desired accuracy. The theoretical results are checked on a model example by comparison with an algorithm using Gaussian approximation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1.
Fig. 2.

Similar content being viewed by others

REFERENCES

  1. Goodfellow, I., Bengio, Y., and Courville, A., Deep Learning, MIT Press, 2016.

    MATH  Google Scholar 

  2. Dai, B., et al., SBEED: Convergent reinforcement learning with nonlinear function approximation, Proc. Int. Conf. Machine Learning, 2018, pp. 1125–1134.

  3. Namkoong, H. and Duchi, J.C., Variance-based regularization with convex objectives, Adv. Neural Inf. Process. Syst., 2017, vol. 30.

  4. Xu, L., et al., Maximum margin clustering, Adv. Neural Inf. Process. Syst., 2004, vol. 17.

  5. Sinha, A., et al., Certifying some distributional robustness with principled adversarial training, 2017.

  6. Audet, C. and Hare, W., Derivative-free and blackbox optimization, 2017.

  7. Rosenbrock, H.H., An automatic method for finding the greatest or least value of a function, Comput. J., 1960, vol. 3, no. 3, pp. 175–184.

    Article  MathSciNet  Google Scholar 

  8. Gasnikov, A., et al., Randomized gradient-free methods in convex optimization, 2022.

  9. Lobanov, A., et al., Gradient-free federated learning methods with l 1 and l 2-randomization for non-smooth convex stochastic optimization problems, 2022.

  10. Gasnikov, A., et al., The power of first-order smooth optimization for black-box non-smooth problems, Proc. Int. Conf. Machine Learning, 2022, pp. 7241–7265.

  11. Bach, F. and Perchet, V., Highly-smooth zero-th order online optimization, Proc. Conf. Learning Theory, 2016, pp. 257–283.

  12. Beznosikov, A., Novitskii, V., and Gasnikov, A., One-point gradient-free methods for smooth and non-smooth saddle-point problems, Proc. 20th Int. Conf. Mathematical Optimization Theory and Operations Research (MOTOR), Irkutsk, Russia, 2021, pp. 144–158.

  13. Akhavan, A., Pontil, M., and Tsybakov, A., Exploiting higher order smoothness in derivative-free optimization and continuous bandits, Adv. Neural Inf. Process. Syst., 2020, vol. 33, pp. 9017–9027.

    Google Scholar 

  14. Polyak, B.T., Gradient methods for the minimisation of functionals, USSR Comput. Math. Math. Phys., 1963, vol. 3, no. 4, pp. 864–878.

    Article  MATH  Google Scholar 

  15. Łojasiewicz, S., Une propriété topologique des sous-ensembles analytiques réels, Les Equations aux Dérivées Partielles, 1963, vol. 117, pp. 87–89.

  16. Ajalloeian, A. and Stich, S.U., On the convergence of SGD with biased gradients, 2020.

  17. Lobanov, A., Gasnikov, A., and Stonyakin, F., Highly smoothness zero-order methods for solving optimization problems under PL condition, 2023.

  18. Yue, P., Fang, C., and Lin, Z., On the lower bound of minimizing Polyak–Łojasiewicz functions, 2022.

  19. Yang, J., Kiyavash, N., and He, N., Global convergence and variance-reduced optimization for a class of nonconvex-nonconcave minimax problems, 2020.

  20. Akhavan, A., et al., Gradient-free optimization of highly smooth functions: Improved analysis and a new algorithm, 2023.

  21. Nouiehed, M., et al., Solving a class of non-convex min-max games using iterative first order methods, Adv. Neural Inf. Process. Syst., 2019, vol. 32.

  22. Osserman, R., The isoperimetric inequality, Bull. Am. Math. Soc., 1978, vol. 84, no. 6, pp. 1182–1238.

    Article  MathSciNet  MATH  Google Scholar 

  23. Beckner, W., A generalized Poincaré inequality for Gaussian measures, Proc. Am. Math. Soc., 1989, vol. 105, no. 2, pp. 397–400.

    MathSciNet  MATH  Google Scholar 

  24. Karimi, H., Nutini, J., and Schmidt, M., Linear convergence of gradient and proximal-gradient methods under the Polyak–Łojasiewicz condition, Proc. Eur. Conf. Machine Learning and Knowledge Discovery in Databases (ECML PKDD), Riva del Garda, Italy, 2016, pp. 795–811.

  25. Zorich, V.A., Mathematical Analysis II, Berlin: Springer, 2016.

    Book  MATH  Google Scholar 

Download references

Funding

The work carried out by A.M. Raigorodskii in Sections 1–3 was supported by a grant for leading scientific schools (grant no. NSh775.2022.1.1), while the work carried out in Sections 4–6 was supported by the Russian Science Foundation (project no. 21-71-30005), https://rscf.ru/project/21-71-30005.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to S. I. Sadykov, A. V. Lobanov or A. M. Raigorodskii.

Ethics declarations

The authors declare that they have no conflicts of interest.

Additional information

Translated by Yu. Kornienko

Appendices

AUXILIARY LEMMAS TO PROVE THEOREM 1

Suppose that \({{\kappa }_{\beta }} = \int {\text{|}}u{{{\text{|}}}^{\beta }}\,{\text{|}}K(u){\text{|}}du\) and κ = \(\int {{K}^{2}}(u)du\). When K is a weighted sum of Legendre polynomials, \({{\kappa }_{\beta }}\) and \(\kappa \) do not depend on d [11] (see A.3), they depend only on \(\beta \), for \(\beta \geqslant 1\):

$${{\kappa }_{\beta }} \leqslant 2\sqrt 2 (\beta - 1),$$
(A.1)
$${{\kappa }_{ \leqslant }}3{{\beta }^{3}}.$$
(A.2)

First, several key lemmas need to be proved.

Lemma 1 ([24]). If \(f( \cdot )\) is \({{L}_{2}}\)-smooth and satisfies the PL condition with constant \(\mu \), then it also satisfies the bounded error condition with \(\mu \), i.e.,

$${\text{||}}\nabla f(x){\text{||}} \geqslant \mu {\text{||}}{{x}_{p}} - x{\text{||}},\quad \forall x,$$

where xp is the projection of x onto the optimal set, and it also satisfies the quadratic growth condition with \(\mu \), i.e.,

$$f(x) - f\text{*} \geqslant \frac{\mu }{2}{\text{||}}{{x}_{p}} - x{\text{|}}{{{\text{|}}}^{2}},\quad \forall x.$$

Conversely, if \(f( \cdot )\) is \({{L}_{2}}\)-smooth and satisfies the bounded error condition with constant \(\mu \), then it satisfies the PL condition with constant \(\mu {\text{/}}{{L}_{2}}\).

It can be seen from this lemma that \({{L}_{2}} \geqslant \mu \).

Lemma 2 ([21]). In the minimax problem, if \( - f(x, \cdot )\) satisfies the PL condition with constant \({{\mu }_{y}}\) for any \(x\) and f satisfies Assumption 1, then function \(g(x)\,: = \,\mathop {\max }\nolimits_y f(x,y)\) is L-smooth with \(L: = {{L}_{2}} + L_{2}^{2}{\text{/}}{{\mu }_{y}}\) and \(\nabla g(x)\) = \({{\nabla }_{x}}f(x,y\text{*}(x))\) for any \(y\text{*}(x) \in \arg \mathop {\max }\limits_y f(x,y)\).

For the next lemma, we need to consider problem \(\mathop {\min }\limits_x f(x)\).

Lemma 3. Suppose that \({{\{ {{x}_{k}}\} }_{{k \geqslant 0}}}\) is the number of iterations in the Mini-batch SGD algorithm that are generated on functions \(f( \cdot )\) under Assumptions 1–5. Then, there is a step size \(\eta \leqslant \frac{1}{{(M + 1){{L}_{2}}}}\) such that it holds for all \(N \geqslant 0\):

$$\begin{gathered} \mathbb{E}[f({{x}_{N}})] - f\text{*} \\ \, \leqslant {{(1 - \eta \mu )}^{N}}\left( {f({{x}_{0}}) - f\text{*}} \right) + \frac{{{{\zeta }^{2}}}}{{2\mu }} + \frac{{\eta {{L}_{2}}{{\sigma }^{2}}}}{{2B\mu }}. \\ \end{gathered} $$

where L2 is the Lipschitz constant for the gradient such that \({\text{||}}\nabla f(x) - \nabla f(y){\text{||}} \leqslant {{L}_{2}}{\text{||}}x - y{\text{||}}\).

Proof. Due to the \({{L}_{2}}\)-smoothness of f and the choice of step size \(\eta \leqslant \frac{1}{{(M + 1){{L}_{2}}}}\), we have

$$\begin{gathered} \mathbb{E}[f({{x}_{{k + 1}}})] \leqslant f({{x}_{k}}) + \langle \nabla f({{x}_{k}}),{{x}_{{k + 1}}} - {{x}_{k}}\rangle \\ \, + \frac{{{{L}_{2}}}}{2}{\text{||}}{{x}_{{k + 1}}} - {{x}_{k}}{\text{|}}{{{\text{|}}}^{2}} \leqslant f({{x}_{k}}) - \eta \langle \nabla f({{x}_{k}}),\mathbb{E}[{{{\mathbf{G}}}_{k}}]\rangle \\ \, + \frac{{{{\eta }^{2}}{{L}_{2}}}}{2}(\mathbb{E}[{\text{||}}{{{\mathbf{G}}}_{k}} - \mathbb{E}[{{{\mathbf{G}}}_{k}}]{\text{|}}{{{\text{|}}}^{2}}] + \mathbb{E}[{\text{||}}\mathbb{E}[{{{\mathbf{G}}}_{k}}]{\text{|}}{{{\text{|}}}^{2}}]) \\ \end{gathered} $$
$$\begin{matrix} \,\overset{}{\mathop{=}}\,f({{x}_{k}})-\eta \langle \nabla f({{x}_{k}}),\nabla f({{x}_{k}})+\mathbf{b}({{x}_{k}})\rangle \\ \,+\frac{{{\eta }^{2}}{{L}_{2}}}{2}(\mathbb{E}[\text{ }\!\!|\!\!\text{ }\!\!|\!\!\text{ }\mathbf{n}({{x}_{k}},\xi )\text{ }\!\!|\!\!\text{ }{{\text{ }\!\!|\!\!\text{ }}^{2}}]+\mathbb{E}[\text{ }\!\!|\!\!\text{ }\!\!|\!\!\text{ }\nabla f({{x}_{k}})+\mathbf{b}({{x}_{k}})\text{ }\!\!|\!\!\text{ }{{\text{ }\!\!|\!\!\text{ }}^{2}}]) \\ \end{matrix}$$
(A.3)
$$\begin{matrix} \,\overset{}{\mathop{\le }}\,f({{x}_{k}})-\eta \langle \nabla f({{x}_{k}}),\nabla f({{x}_{k}})+\mathbf{b}({{x}_{k}})\rangle \\ \,+\frac{{{\eta }^{2}}{{L}_{2}}}{2}((M+1)\mathbb{E}[\text{ }\!\!|\!\!\text{ }\!\!|\!\!\text{ }\nabla f({{x}_{k}})+\mathbf{b}({{x}_{k}})\text{ }\!\!|\!\!\text{ }{{\text{ }\!\!|\!\!\text{ }}^{2}}]+{{\sigma }^{2}}) \\ \end{matrix}$$
$$\begin{gathered} \, = f({{x}_{k}}) + \frac{\eta }{2}\left( { \pm {\text{||}}\nabla f({{x}_{k}}){\text{|}}{{{\text{|}}}^{2}}} \right. \\ \, - \left. {2\langle \nabla f({{x}_{k}}),\nabla f({{x}_{k}}) + {\mathbf{b}}({{x}_{k}})\rangle + \,{\text{||}}\nabla f({{x}_{k}}) + {\mathbf{b}}({{x}_{k}}){\text{|}}{{{\text{|}}}^{2}}} \right) \\ \end{gathered} $$
$$\begin{matrix} \,+\frac{{{\eta }^{2}}{{L}_{2}}}{2}{{\sigma }^{2}}=f({{x}_{k}})+\frac{\eta }{2}(-\text{ }\!\!|\!\!\text{ }\!\!|\!\!\text{ }\nabla f({{x}_{k}})\text{ }\!\!|\!\!\text{ }{{\text{ }\!\!|\!\!\text{ }}^{2}}+\,\text{ }\!\!|\!\!\text{ }\!\!|\!\!\text{ }\mathbf{b}({{x}_{k}})\text{ }\!\!|\!\!\text{ }{{\text{ }\!\!|\!\!\text{ }}^{2}}) \\ \,+\frac{{{\eta }^{2}}{{L}_{2}}}{2}{{\sigma }^{2}}\overset{2,}{\mathop{\le }}\,(1-\eta \mu )\left( f({{x}_{k}})-f\text{*} \right) \\ \,+\frac{\eta {{\zeta }^{2}}}{2}+\frac{{{\eta }^{2}}{{L}_{2}}}{2}{{\sigma }^{2}}+f\text{*}, \\ \end{matrix}$$

where uses Definition 2, uses Assumption 4, and uses Assumption 5.

By applying recursion to (A.3) and by adding batching (with batch size B), we arrive at

$$\begin{gathered} \mathbb{E}[f({{x}_{N}})] - f\text{*} \\ \, \leqslant {{(1 - \eta \mu )}^{N}}\left( {f({{x}_{0}}) - f\text{*}} \right) + \frac{{{{\zeta }^{2}}}}{{2\mu }} + \frac{{\eta {{L}_{2}}{{\sigma }^{2}}}}{{2B\mu }}. \\ \end{gathered} $$
(A.4)

\(\square \)

Theorem 4. Suppose that Assumptions 1–5 hold and \(f(x,y)\) satisfies the two-sided PL condition with \({{\mu }_{x}}\) and \({{\mu }_{y}}\). If we run one iteration of Algorithm 1 with \(\tau _{x}^{t} = {{\tau }_{x}} \leqslant \frac{1}{{(M + 1)L}}\) (L is specified by Lemma 2) and \(\tau _{y}^{t} = {{\tau }_{y}} \leqslant \frac{1}{{(M + 1){{L}_{2}}}}\), then

$$\begin{gathered} {{a}_{{t + 1}}} + \lambda {{b}_{{t + 1}}} \leqslant \max \{ {{k}_{1}},{{k}_{2}}\} ({{a}_{t}} + \lambda {{b}_{t}}) \\ \, + \lambda \left( {\tau _{y}^{2}{{L}_{2}}\frac{{{{\sigma }^{2}}}}{B} + {{\tau }_{y}}{{\zeta }^{2}}} \right), \\ \end{gathered} $$

where

$${{k}_{1}}: = 1 - {{\mu }_{x}}{{\tau }_{x}}[1 + \lambda (1 - {{\mu }_{y}}{{\tau }_{y}})],$$
(A.5)
$${{k}_{2}}: = 1 + \frac{{L_{2}^{2}{{\tau }_{x}}}}{{{{\mu }_{y}}\lambda }} - {{\mu }_{y}}{{\tau }_{y}} + {{\sigma }^{2}}\frac{{L_{2}^{2}}}{{{{\mu }_{y}}}}{{\tau }_{x}} - {{\tau }_{x}}{{\tau }_{y}}L_{2}^{2}{{\sigma }^{2}}.$$
(A.6)

Proof. With g being L-smooth in accordance with Lemma 2, by choosing a step size such that \({{\tau }_{x}}\, \leqslant \,\frac{1}{{(M\, + \,1)L}}\), we have

$$\begin{gathered} \mathbb{E}[g({{x}_{{k + 1}}})] \leqslant g({{x}_{k}}) + \langle \nabla g({{x}_{k}}),{{x}_{{k + 1}}} - {{x}_{k}}\rangle \\ \, + \frac{L}{2}{\text{||}}{{x}_{{k + 1}}} - {{x}_{k}}{\text{|}}{{{\text{|}}}^{2}} \leqslant g({{x}_{k}}) - {{\tau }_{x}}\langle \nabla g({{x}_{k}}),\mathbb{E}[{{{\mathbf{G}}}_{k}}]\rangle \\ \, + \frac{{\tau _{x}^{2}L}}{2}(\mathbb{E}[{\text{||}}{{{\mathbf{G}}}_{k}} - \mathbb{E}[{{{\mathbf{G}}}_{k}}]{\text{|}}{{{\text{|}}}^{2}}] + \mathbb{E}[{\text{||}}\mathbb{E}[{{{\mathbf{G}}}_{k}}]{\text{|}}{{{\text{|}}}^{2}}]) \\ \end{gathered} $$
$$\begin{matrix} \,\overset{}{\mathop{=}}\,g({{x}_{k}})-{{\tau }_{x}}\langle \nabla g({{x}_{k}}),{{\nabla }_{x}}f({{x}_{k}},{{y}_{k}})+\mathbf{b}({{x}_{k}})\rangle +\frac{\tau _{x}^{2}L}{2} \\ \,\times (\mathbb{E}[\text{ }\!\!|\!\!\text{ }\!\!|\!\!\text{ }\mathbf{n}({{x}_{k}},{{y}_{k}},\xi )\text{ }\!\!|\!\!\text{ }{{\text{ }\!\!|\!\!\text{ }}^{2}}]+\mathbb{E}[\text{ }\!\!|\!\!\text{ }\!\!|\!\!\text{ }\nabla g({{x}_{k}})+{{\mathbf{b}}_{x}}({{x}_{k}},{{y}_{k}})\text{ }\!\!|\!\!\text{ }{{\text{ }\!\!|\!\!\text{ }}^{2}}]) \\ \,\overset{}{\mathop{\le }}\,g({{x}_{k}})-{{\tau }_{x}}\langle \nabla g({{x}_{k}}),{{\nabla }_{x}}f({{x}_{k}},{{y}_{k}})+{{\mathbf{b}}_{x}}({{x}_{k}},{{y}_{k}})\rangle \\ \end{matrix}$$
$$\, + \frac{{\tau _{x}^{2}L}}{2}((M + 1)\mathbb{E}[{\text{||}}{{\nabla }_{x}}f({{x}_{k}},{{y}_{k}}) + {{{\mathbf{b}}}_{x}}({{x}_{k}},{{y}_{k}}){\text{|}}{{{\text{|}}}^{2}}]$$
(A.7)
$$\begin{gathered} \, + {{\sigma }^{2}}) = g({{x}_{k}}) + \frac{{{{\tau }_{x}}}}{2}( \pm {\text{||}}\nabla g({{x}_{k}}){\text{|}}{{{\text{|}}}^{2}} \\ \, - 2\langle \nabla g({{x}_{k}}),\nabla f({{x}_{k}},{{y}_{k}}) + {{{\mathbf{b}}}_{x}}({{x}_{k}},{{y}_{k}})\rangle \\ + \;{\text{||}}{{\nabla }_{x}}f({{x}_{k}},{{y}_{k}}) + {{{\mathbf{b}}}_{x}}({{x}_{k}},{{y}_{k}}){\text{|}}{{{\text{|}}}^{2}}) + \frac{{\tau _{x}^{2}L}}{2}{{\sigma }^{2}} \\ \end{gathered} $$
$$\begin{gathered} \, = g({{x}_{k}}) + \frac{{{{\tau }_{x}}}}{2}( - {\text{||}}\nabla g({{x}_{k}}){\text{|}}{{{\text{|}}}^{2}} \\ + \,{\text{||}}{\kern 1pt} - {\kern 1pt} \nabla g({{x}_{k}}) + {{{\mathbf{b}}}_{x}}({{x}_{k}},{{y}_{k}}) + {{\nabla }_{x}}f({{x}_{k}},{{y}_{k}}){\text{|}}{{{\text{|}}}^{2}}) + \frac{{\tau _{x}^{2}L}}{2}{{\sigma }^{2}}, \\ \end{gathered} $$

where uses Definition 2 and uses Assumption 4.

Now, it is sufficient to express \({\text{||}}g({{x}_{t}}){\text{|}}{{{\text{|}}}^{2}}\) and \({\text{||}}{{\nabla }_{x}}f({{x}_{t}},{{y}_{t}}) - \nabla g({{x}_{t}}){\text{|}}{{{\text{|}}}^{2}}\) in terms of \({{a}_{t}}\) and \({{b}_{t}}\). Using Lemma 2, we have

$$\begin{gathered} {\text{||}}{{\nabla }_{x}}f({{x}_{t}},{{y}_{t}}) - \nabla g({{x}_{t}}){\text{|}}{{{\text{|}}}^{2}} \\ \, = {\text{||}}{{\nabla }_{x}}f({{x}_{t}},{{y}_{t}}) - {{\nabla }_{x}}f({{x}_{t}},y\text{*}({{x}_{t}})){\text{|}}{{{\text{|}}}^{2}} \\ \, \leqslant L_{2}^{2}{\text{||}}y\text{*}({{x}_{t}}) - {{y}_{t}}{\text{|}}{{{\text{|}}}^{2}} \\ \end{gathered} $$
(A.8)

for any \(y{\kern 1pt} {\text{*}}({{x}_{t}}) \in \arg \mathop {\max }\limits_y f({{x}_{t}},y)\). Now, we can fix \(y{\kern 1pt} {\text{*}}({{x}_{t}})\) as a projection of \({{y}_{t}}\) onto set \(\arg \mathop {\max }\limits_y f({{x}_{t}},y)\). Since \( - f({{x}_{t}}, \cdot )\) satisfies the PL condition with \({{\mu }_{y}}\) and Lemma 1 suggests that the function also satisfies the quadratic growth condition with \({{\mu }_{y}}\), i.e.,

$${\text{||}}y\text{*}({{x}_{t}}) - {{y}_{t}}{\text{|}}{{{\text{|}}}^{2}} \leqslant \frac{2}{{{{\mu }_{y}}}}[g({{x}_{t}}) - f({{x}_{t}},{{y}_{t}})],$$
(A.9)

taking into account (A.8), we obtain

$${\text{||}}{{\nabla }_{x}}f({{x}_{t}},{{y}_{t}}) - \nabla g({{x}_{t}}){\text{|}}{{{\text{|}}}^{2}} \leqslant \frac{{2L_{2}^{2}}}{{{{\mu }_{y}}}}[g({{x}_{t}}) - f({{x}_{t}},{{y}_{t}})].$$
(A.10)

Since g satisfies the PL condition with \({{\mu }_{x}}\),

$${\text{||}}\nabla g({{x}_{t}}){\text{|}}{{{\text{|}}}^{2}} \geqslant 2{{\mu }_{x}}[g({{x}_{t}}) - g\text{*}].$$
(A.11)

By computing the expectation for both sides of A.7 and by substituting A.10 and A.11, we obtain

$${{a}_{{t + 1}}} \leqslant (1 - {{\tau }_{x}}{{\mu }_{x}}){{a}_{t}} + {{\tau }_{x}}\frac{{L_{2}^{2}}}{{{{\mu }_{y}}}}{{b}_{t}} + \frac{{{{\tau }_{x}}}}{2}{\text{||}}{{{\mathbf{b}}}_{x}}{\text{|}}{{{\text{|}}}^{2}}.$$
(A.12)

Since \( - f({{x}_{{t + 1}}},y)\) is L2-smooth and \({{\mu }_{y}}\)-PL, based on inequality (A.3) from Lemma 3 for \({{\tau }_{y}} \leqslant \frac{1}{{(M + 1){{L}_{2}}}}\), we have

$$\mathbb{E}\left[ {g({{x}_{{t + 1}}}) - f({{x}_{{t + 1}}},{{y}_{{t + 1}}})] \leqslant (1 - {{\mu }_{y}}{{\tau }_{y}})\mathbb{E}[g({{x}_{{t + 1}}})} \right.$$
$$\begin{gathered} \, - \left. {f({{x}_{{t + 1}}},{{y}_{t}})} \right] + \frac{{{{\tau }_{y}}{{\zeta }^{2}}}}{2} + \frac{{\tau _{y}^{2}{{L}_{2}}}}{2}{{\sigma }^{2}} \\ \, \leqslant (1 - {{\mu }_{y}}{{\tau }_{y}})\mathbb{E}[g({{x}_{t}}) - f({{x}_{t}},{{y}_{t}}) + f({{x}_{t}},{{y}_{t}}) \\ \end{gathered} $$
(A.13)
$$\, - f({{x}_{{t + 1}}},{{y}_{t}}) + g({{x}_{{t + 1}}}) - g({{x}_{t}})] + \frac{{{{\tau }_{y}}{{\zeta }^{2}}}}{2} + \frac{{\tau _{y}^{2}{{L}_{2}}}}{2}{{\sigma }^{2}}.$$

Using Lemma 3, we can bound f(xt, \({{y}_{t}}) - f({{x}_{{t + 1}}},{{y}_{t}})\) as follows:

$$f({{x}_{t}},{{y}_{t}}) - f({{x}_{{t + 1}}},{{y}_{t}}) \leqslant \frac{{{{\tau }_{x}}}}{2}{{\zeta }^{2}} + \frac{{\tau _{x}^{2}{{L}_{2}}}}{2}{{\sigma }^{2}}.$$
(A.14)

In addition, from A.12, we have

$$\mathbb{E}[g({{x}_{{t + 1}}}) - g({{x}_{t}})] \leqslant - {{\tau }_{x}}{{\mu }_{x}}{{a}_{t}} + \frac{{{{\tau }_{x}}L_{2}^{2}}}{{{{\mu }_{y}}}}{{b}_{t}} + \frac{{{{\tau }_{x}}}}{2}{{\zeta }^{2}}.$$
(A.15)

By combining (A.13), (A.14), and (A.15), we obtain

$$\mathbb{E}[g({{x}_{{t + 1}}}) - f({{x}_{{t + 1}}},{{y}_{{t + 1}}})] \leqslant (1 - {{\mu }_{y}}{{\tau }_{y}})\left( { - {{\tau }_{x}}{{\mu }_{x}}{{a}_{t}}_{{_{{_{{_{{_{{_{{_{{_{{}}}}}}}}}}}}}}}}} \right.$$
$$\begin{gathered} \, + \,\left. {\left( {1\, + \,\frac{{{{\tau }_{x}}L_{2}^{2}}}{{{{\mu }_{y}}}}{{\sigma }^{2}}} \right){{b}_{t}}} \right)\, + \,(1\, - \,{{\mu }_{y}}{{\tau }_{y}})\left( {{{\tau }_{x}}{{\zeta }^{2}}\, + \,\frac{{\tau _{x}^{2}{{L}_{2}}}}{2}{{\sigma }^{2}}} \right) \\ \, + \frac{{\tau _{y}^{2}{{L}_{2}}}}{2}{{\sigma }^{2}} + \frac{{{{\tau }_{y}}{{\zeta }^{2}}}}{2} \leqslant (1 - {{\mu }_{y}}{{\tau }_{y}}) \\ \end{gathered} $$
(A.16)
$$ \times \left( { - {{\tau }_{x}}{{\mu }_{x}}{{a}_{t}} + \left( {1 + \frac{{{{\tau }_{x}}L_{2}^{2}}}{{{{\mu }_{y}}}}{{\sigma }^{2}}} \right){{b}_{t}}} \right) + \tau _{y}^{2}{{L}_{2}}{{\sigma }^{2}} + \frac{3}{4}{{\tau }_{y}}{{\zeta }^{2}}.$$

Here, the last inequality takes into account that \({{\tau }_{x}}\) is smaller than \({{\tau }_{y}}\). We can even assume that \({{\tau }_{x}} \leqslant \frac{\lambda }{2}{{\tau }_{y}}\). By combining (A.12) and (A.16), \(\forall \lambda > 0\), we obtain

$$\begin{gathered} {{a}_{{t + 1}}} + \lambda {{b}_{{t + 1}}} \leqslant {{a}_{t}}\left[ {1 - {{\mu }_{x}}{{\tau }_{x}} - \lambda (1 - {{\mu }_{y}}{{\tau }_{y}}){{\mu }_{x}}{{\tau }_{x}}} \right] \\ \, + \lambda {{b}_{t}}\left[ {1 + \frac{{L_{2}^{2}{{\tau }_{x}}}}{{{{\mu }_{y}}\lambda }} - {{\mu }_{y}}{{\tau }_{y}} + \frac{{{{\tau }_{x}}L_{2}^{2}}}{{{{\mu }_{y}}}}{{\sigma }^{2}} - {{\tau }_{x}}{{\tau }_{y}}L_{2}^{2}{{\sigma }^{2}}} \right] \\ \, + \lambda \left( {\tau _{y}^{2}{{L}_{2}}{{\sigma }^{2}} + {{\tau }_{y}}{{\zeta }^{2}}} \right). \\ \end{gathered} $$

By adding batching (with batch size B), we arrive at

$$\begin{gathered} {{a}_{{t + 1}}} + \lambda {{b}_{{t + 1}}} \leqslant {{a}_{t}}\left[ {1 - {{\mu }_{x}}{{\tau }_{x}} - \lambda (1 - {{\mu }_{y}}{{\tau }_{y}}){{\mu }_{x}}{{\tau }_{x}}} \right] + \\ \, + \lambda {{b}_{t}}\left[ {1 + \frac{{{{l}^{2}}{{\tau }_{x}}}}{{{{\mu }_{y}}\lambda }} - {{\mu }_{y}}{{\tau }_{y}} + \frac{{{{\tau }_{x}}L_{2}^{2}}}{{{{\mu }_{y}}}}\frac{{{{\sigma }^{2}}}}{B} - {{\tau }_{x}}{{\tau }_{y}}L_{2}^{2}\frac{{{{\sigma }^{2}}}}{B}} \right] + \\ \, + \lambda \left( {\tau _{y}^{2}{{L}_{2}}\frac{{{{\sigma }^{2}}}}{B} + {{\tau }_{y}}{{\zeta }^{2}}} \right). \\ \end{gathered} $$
(A.17)

\(\square \)

Proof of Theorem 1.

Proof. Under the conditions of Lemma 4 \(\tau _{x}^{t} = {{\tau }_{x}}\) and \(\tau _{y}^{t} = {{\tau }_{y}},\forall t\) we only need to select \({{\tau }_{x}}\), \({{\tau }_{y}}\), and \(\lambda \) so that \({{k}_{1}},{{k}_{2}} < 1\). Here, \(\lambda = 1{\text{/}}10\) is selected first. Then,

$${{k}_{1}} = 1 - {{\mu }_{x}}[{{\tau }_{x}} + \lambda (1 - {{\mu }_{y}}{{\tau }_{y}}){{\tau }_{x}}] \leqslant 1 - {{\tau }_{x}}{{\mu }_{x}}.$$
(A.18)

In addition,

$$\begin{gathered} {{k}_{2}} = 1 + \frac{{{{\tau }_{x}}L_{2}^{2}}}{{{{\mu }_{y}}\lambda }} - {{\mu }_{y}}{{\tau }_{y}} + \frac{{{{\tau }_{x}}L_{2}^{2}}}{{{{\mu }_{y}}}}\frac{{{{\sigma }^{2}}}}{B} - {{\tau }_{x}}{{\tau }_{y}}L_{2}^{2}\frac{{{{\sigma }^{2}}}}{B} \\ \, = 1\, - \,\frac{{{{\tau }_{x}}L_{2}^{2}}}{{{{\mu }_{y}}}}\left\{ {\frac{{\mu _{y}^{2}{{\tau }_{y}}}}{{{{\tau }_{x}}L_{2}^{2}}}\, - \,\frac{1}{\lambda }\, - \,\frac{{{{\sigma }^{2}}}}{B}(1\, - \,{{\mu }_{y}}{{\tau }_{y}})} \right\}\, \leqslant \,1\, - \,\frac{{{{\tau }_{x}}L_{2}^{2}}}{{{{\mu }_{y}}}}, \\ \end{gathered} $$
(A.19)

where, in the last inequality, \(\lambda \) is substituted and \(\frac{{\mu _{y}^{2}{{\tau }_{y}}}}{{{{\tau }_{x}}L_{2}^{2}}} \geqslant 12\) is used due to the selection of \({{\tau }_{x}}\). By choosing large B on the order of d2, we can make \(\frac{{{{\sigma }^{2}}}}{B} \leqslant 1\). Note that \({{\tau }_{x}}{{\mu }_{x}}\, < \,\frac{{{{l}^{2}}{{\tau }_{x}}}}{{{{\mu }_{y}}}}\) because \(\left( {{{\tau }_{x}}{{\mu }_{x}}} \right){\text{/}}\left( {\frac{{{{l}^{2}}{{\tau }_{x}}}}{{{{\mu }_{y}}}}} \right)\, = \,\frac{{{{\mu }_{x}}{{\mu }_{y}}}}{{{{l}^{2}}}}\) < 1. Suppose that \({{P}_{t}}: = {{a}_{t}} + \frac{1}{{10}}{{b}_{t}}\) and, by Theorem 4,

$${{P}_{{t + 1}}} \leqslant \left( {1 - {{\tau }_{x}}{{\mu }_{x}}} \right){{P}_{t}} + \frac{1}{{10}}\left( {\tau _{y}^{2}{{L}_{2}}\frac{{{{\sigma }^{2}}}}{B} + {{\tau }_{y}}{{\zeta }^{2}}} \right).$$

By simple calculations, we obtain

$${{P}_{t}} \leqslant {{(1 - {{\mu }_{x}}{{\tau }_{x}})}^{t}}{{P}_{0}} + \frac{{\tau _{y}^{2}{{L}_{2}}\frac{{{{\sigma }^{2}}}}{B} + {{\tau }_{y}}{{\zeta }^{2}}}}{{10{{\mu }_{x}}{{\tau }_{x}}}}.$$
(A.20)

The check of \({{\tau }_{x}} \leqslant \frac{1}{{(M + 1)L}}\) is carried out due to the fact that \({{\tau }_{x}} \leqslant \frac{{\mu _{y}^{2}{{\tau }_{y}}}}{{12L_{2}^{2}}}\)\(\frac{{\mu _{y}^{2}}}{{12(M + 1)L_{2}^{3}}} \leqslant \frac{{{{\mu }_{y}}}}{{2(M + 1)L_{2}^{2}}}\) and \(L = {{L}_{2}} + \frac{{L_{2}^{2}}}{{{{\mu }_{y}}}} \leqslant \frac{{2L_{2}^{2}}}{{{{\mu }_{y}}}}\).

\(\square \)

Proofs for zero-order methods.

Here, we prove lemmas for different variants of the problem. In the following lemmas, we do not confine ourselves to the saddle point problem and focus more on the kernel approximation of the gradient, which is why we consider problem \({{\min }_{{x \in }}}f(x)\) for the following lemmas.

Lemma 5 (reduction of an integral over domain to an integral over surface). Suppose that \(D\) is an open connected subset of \(\mathbb{R}\) with piecewise-smooth boundary \(\partial D\) that is oriented along outer unit normal \({\mathbf{n}} = ({{n}_{1}}, \ldots ,{{n}_{m}}{{)}^{ \top }}\). Suppose also that \(f\) is a smooth function in \(D \cup \partial D\), then

$$\int\limits_D {\nabla f(x)dx} = \int\limits_{\partial D} f (x){\mathbf{n}}(x)dS(x).$$

Remark 4. For the definition of piecewise-smooth surfaces and their orientations, we refer to [25], Section 12.3.2, Definitions 4 and 5, respectively.

Lemma 6. Suppose that \(f:{{\mathbb{R}}^{d}} \to \mathbb{R}\) is a continuously differentiable function. Suppose also that \(r,{\mathbf{h}},{\mathbf{e}}\) are uniformly distributed over \([ - 1,1],\mathcal{B}_{2}^{d}\), and \({{\mathcal{S}}^{d}}\), respectively. Then, for any \(\gamma > 0\), we have

$$\mathbb{E}[\nabla f(x + \gamma r{\mathbf{h}})rK(r)] = \frac{d}{\gamma }\mathbb{E}[f(x + \gamma r{\mathbf{e}}){\mathbf{e}}K(r)].$$

Proof. Let us fix \(r \in [ - 1,1]{{\backslash }}\{ 0\} \). We define \(\phi :{{\mathbb{R}}^{d}} \to \mathbb{R}\) as \(\phi ({\mathbf{h}}) = f(x + \gamma r{\mathbf{h}})K(r)\) and note that \(\nabla \phi ({\mathbf{h}}) = \gamma r\nabla f(x + \gamma r{\mathbf{h}})K(r)\). Hence, we have

$$\begin{gathered} \mathbb{E}[\nabla f(x + \gamma r{\mathbf{h}})K(r)\,{\text{|}}\,r] = \frac{1}{{\gamma r}}\mathbb{E}[\nabla \phi ({\mathbf{h}})\,{\text{|}}\,r]\frac{d}{{\gamma r}} \\ \, \times \mathbb{E}[\phi ({\mathbf{e}}){\mathbf{e}}\,{\text{|}}\,r] = \frac{d}{{\gamma r}}K(r)\mathbb{E}[f(x + \gamma r{\mathbf{e}}){\mathbf{e}}\,{\text{|}}\,r], \\ \end{gathered} $$

where the second equality is obtained from Theorem 5. The proof is completed with multiplying both sides by r, using the fact that r follows continuous distribution, and taking the full expectation.

\(\square \)

PROOF OF THEOREM 2

Lemma 7 (kernel approximation bias). Suppose that Assumptions 1–3 hold. Suppose also that \({{x}_{t}}\) and \({\mathbf{G}}({{x}_{t}},{\mathbf{e}})\) are determined by Algorithm 1 at instant \(t \geqslant 1\) with gradient approximation (4.2) for zero-order oracle (4.1). Then,

$$\begin{gathered} {\text{||}}\mathbb{E}[{\mathbf{G}}({{x}_{t}},\xi ,{\mathbf{e}})\,{\text{|}}\,{{x}_{t}}] - \nabla f({{x}_{t}}){\text{||}} \\ \, \leqslant {{\kappa }_{\beta }}\frac{{{{L}_{\beta }}}}{{(l - 1)!}} \cdot \frac{d}{{d + \beta - 1}}{{\gamma }^{{\beta - 1}}} + {{\kappa }_{\beta }}d\frac{\Delta }{\gamma }, \\ \end{gathered} $$
(В.1)

where we recall that \(l=\beta \).

Proof. Using Lemma 6, the fact that \(\int_{ - 1}^1 rK(r)dr\) = 1, and the variational representation of the Euclidean norm, we can write

$${\text{||}}\mathbb{E}[{\mathbf{G}}({{x}_{t}},\xi ,{\mathbf{e}})\,{\text{|}}\,{{x}_{t}}] - \nabla f({{x}_{t}}){\text{||}}$$
$$\begin{gathered} \, = \mathop {\sup }\limits_{{\mathbf{v}} \in {{\mathcal{S}}^{d}}} \mathbb{E}[({{\nabla }_{{\mathbf{v}}}}f(x + \gamma r{\mathbf{h}},\xi ) - {{\nabla }_{{\mathbf{v}}}}f(x,\xi ) \\ \, + \frac{d}{{2\gamma r}}(\delta (x + \gamma r{\mathbf{h}}) - \delta (x - \gamma r{\mathbf{h}})))rK(r)] \\ \end{gathered} $$
(В.2)
$$\, \leqslant \mathop {\sup }\limits_{{\mathbf{v}} \in {{\mathcal{S}}^{d}}} \mathbb{E}[({{\nabla }_{{\mathbf{v}}}}f(x + \gamma r{\mathbf{h}}) - {{\nabla }_{{\mathbf{v}}}}f(x))rK(r)] + {{\kappa }_{\beta }}d\frac{\Delta }{\gamma },$$

where h is uniformly distributed over \(\mathcal{B}_{2}^{d}\). Since \(f(x)\) satisfies the Hölder condition with constants \(\beta \) and \({{L}_{\beta }}\), for any \({\mathbf{v}} \in {{\mathcal{S}}^{d}}\), directed gradient \({{\nabla }_{{\mathbf{v}}}}f( \cdot )\) satisfies the Hölder condition with constants \(\beta - 1\) and \({{L}_{\beta }}\). Thus, the following Taylor expansion holds:

$$\begin{gathered} {{\nabla }_{{\mathbf{v}}}}f({{x}_{t}} + \gamma r{\mathbf{h}}) = {{\nabla }_{{\mathbf{v}}}}f({{x}_{t}}) \\ \, + \sum\limits_{1 \leqslant |{\mathbf{m}}| \leqslant l - 1} \frac{{{{{(r\gamma )}}^{{|{\mathbf{m}}|}}}}}{{{\mathbf{m}}!}}{{D}^{{\mathbf{m}}}}{{\nabla }_{{\mathbf{v}}}}f({{x}_{t}})({\mathbf{h}}{{)}^{{\mathbf{m}}}} + R(\gamma r{\mathbf{h}}), \\ \end{gathered} $$
(В.3)

where remainder term \(R( \cdot )\) satisfies condition \({\text{|}}R(x){\text{|}} \leqslant \frac{{{{L}_{\beta }}}}{{(l - 1)!}}{\text{||}}x{\text{|}}{{{\text{|}}}^{{\beta - 1}}}\).

Substituting equation (B.3) into equation (B.2) and using the “nullification” properties of kernel K, we arrive at

$$\begin{gathered} {\text{||}}\mathbb{E}[{\mathbf{G}}({{x}_{t}},\xi ,{\mathbf{e}})\,{\text{|}}\,{{x}_{t}}] - \nabla f({{x}_{t}}){\text{||}} \\ \, \leqslant {{\kappa }_{\beta }}{{\gamma }^{{\beta - 1}}}\frac{{{{L}_{\beta }}}}{{(l - 1)!}}\mathbb{E}{\text{||}}{\mathbf{h}}{\text{|}}{{{\text{|}}}^{{\beta - 1}}} \\ \, = {{\kappa }_{\beta }}{{\gamma }^{{\beta - 1}}}\frac{{{{L}_{\beta }}}}{{(l - 1)!}}\frac{d}{{d + \beta - 1}} + {{\kappa }_{\beta }}d\frac{\Delta }{\gamma }, \\ \end{gathered} $$

where the last equality is obtained from the fact that \(\mathbb{E}{\text{||}}{\mathbf{h}}{\text{|}}{{{\text{|}}}^{q}} = \frac{d}{{d + q}}\) for any \(q \geqslant 0\).

\(\square \)

Lemma 8 (kernel approximation variance). Suppose that Assumptions 1–3 hold and that \({{x}_{t}}\) and \({\mathbf{G}}({{x}_{t}},\xi ,{\mathbf{e}})\) are determined by Algorithm 3 with gradient approximation (4.2) for zero-order oracle (4.1). Suppose also that \(f \in {{\mathcal{F}}_{2}}({{L}_{2}})\); then, for \(d \geqslant 2\),

$$\mathbb{E}{\text{||}}{\mathbf{G}}({{x}_{t}},\xi ,{\mathbf{e}}){\text{|}}{{{\text{|}}}^{2}} \leqslant \frac{{{{d}^{2}}\kappa }}{{d - 1}}\mathbb{E}\left[ {{\text{||}}\nabla f({{x}_{t}}){\text{||}} + {{L}_{2}}{{\gamma }^{2}}} \right] + \frac{{{{d}^{2}}{{\Delta }^{2}}\kappa }}{{{{\gamma }^{2}}}},$$

where we recall that \(\kappa = \int_{ - 1}^1 {{K}^{2}}(r)dr\).

The result of Lemma 8 can be further simplified as follows:

$$\begin{gathered} \mathbb{E}{\text{||}}{\mathbf{G}}({{x}_{t}},\xi ,{\mathbf{e}}){\text{|}}{{{\text{|}}}^{2}} \leqslant 4d\kappa \mathbb{E}{\text{||}}\nabla f({{x}_{t}}){\text{|}}{{{\text{|}}}^{2}} \\ \, + 4d\kappa L_{2}^{2}{{\gamma }^{2}} + \frac{{{{d}^{2}}{{\Delta }^{2}}\kappa }}{{{{\gamma }^{2}}}},\quad d \geqslant 2. \\ \end{gathered} $$
(B.4)

Proof. For simplicity, we omit subscript \(t\) for all quantities. Let us write the second moment of the following quantity:

$$\begin{gathered} \mathbb{E}{\text{||}}{\mathbf{G}}(x,\xi ,{\mathbf{e}}){\text{|}}{{{\text{|}}}^{2}} \\ \, = \frac{{{{d}^{2}}}}{{4{{\gamma }^{2}}}}\mathbb{E}\left[ {(f(x + \gamma r{\mathbf{e}},\xi ) - f(x - \gamma r{\mathbf{e}},\xi )} \right. \\ \end{gathered} $$
$$\, + (\delta (x + \gamma r{\mathbf{e}}) - \delta (x - \gamma r{\mathbf{e}}{{)))}^{2}}{{K}^{2}}(r)]$$
(B.5)
$$\begin{gathered} \, \leqslant \frac{{{{d}^{2}}}}{{4{{\gamma }^{2}}}}\left( {\mathbb{E}\left[ {(f(x + \gamma r{\mathbf{e}})} \right.} \right. \\ - \,f{{(x - \gamma r{\mathbf{e}})}^{2}}{{K}^{2}}(r)] + 4\kappa {{\Delta }^{2}}). \\ \end{gathered} $$

Below, all expectations should conventionally be understood on xt. It should be noted that, since \(\mathbb{E}[f(x + hr{\mathbf{e}}) - f(x - hr{\mathbf{e}})\,{\text{|}}\,r] = 0\) and \(f \in {{\mathcal{F}}_{2}}({{L}_{2}})\), the use of the Wirtinger–Poincaré inequality [22, 23], see Eq. (3.1) or Theorem 2, leads to

$$\begin{gathered} \mathbb{E}\left[ {{{{(f(x + hr{\mathbf{e}}) - f(x - hr{\mathbf{e}}))}}^{2}}\,{\text{|}}\,r} \right] \\ \, \leqslant \frac{{{{h}^{2}}}}{{d - 1}}\mathbb{E}\left[ {{\text{||}}\nabla f(x + hr{\mathbf{e}}) + \nabla f(x - hr{\mathbf{e}}){\text{|}}{{{\text{|}}}^{2}}\,{\text{|}}\,r} \right]. \\ \end{gathered} $$
(B.6)

Since \(f \in {{\mathcal{F}}_{2}}({{L}_{2}})\), the triangle inequality implies that

$$\begin{gathered} \mathbb{E}\left[ {{\text{||}}\nabla f(x + hr{\mathbf{e}}) + \nabla f(x - hr{\mathbf{e}}){\text{|}}{{{\text{|}}}^{2}}\,{\text{|}}\,r} \right] \\ \, \leqslant 4({\text{||}}\nabla f(x){\text{||}} + {{L}_{2}}\gamma {{)}^{2}}. \\ \end{gathered} $$
(B.7)

Finally, we substitute the above estimate into (B.6) and take (B.6) into account.

\(\square \)

We can now compute the noise and bias of the kernel approximation:

$$M = 4d{{\beta }^{3}}\quad {{\sigma }^{2}} = 4d{{\beta }^{3}}{{L}_{2}}{{\gamma }^{2}} + \frac{{{{d}^{2}}{{\Delta }^{2}}{{\beta }^{3}}}}{{{{\gamma }^{2}}}},$$
(B.8)
$${{\zeta }^{2}} = {{\beta }^{2}}{{\left( {\frac{{{{L}_{\beta }}}}{{(l - 1)!}}\frac{d}{{d + \beta - 1}}{{\gamma }^{{\beta - 1}}} + d\frac{\Delta }{\gamma }} \right)}^{2}}.$$
(B.9)

A rougher estimate for the bias is

$${{\zeta }^{2}} = {{\beta }^{2}}{{\left( {\frac{{{{L}_{\beta }}}}{{(l - 1)!}}} \right)}^{2}}{{\gamma }^{{2\beta - 2}}} + {{\beta }^{2}}{{d}^{2}}\frac{{{{\Delta }^{2}}}}{{{{\gamma }^{2}}}}.$$

Now, we can estimate the convergence rate for the kernel approximation by substituting the found constants into the final convergence estimate:

$${{P}_{t}} \leqslant {{(1 - {{\mu }_{x}}{{\tau }_{x}})}^{t}}{{P}_{0}} + \frac{{\tau _{y}^{2}{{L}_{2}}\frac{{{{\sigma }^{2}}}}{B} + {{\tau }_{y}}{{\zeta }^{2}}}}{{10{{\mu }_{x}}{{\tau }_{x}}}} = (1 - {{\mu }_{x}}{{\tau }_{x}}{{)}^{t}}{{P}_{0}}$$
$$\begin{gathered} \, + \frac{{12}}{{5B}}\frac{{L_{2}^{3}d{{\gamma }^{2}}}}{{{{\mu }_{x}}\mu _{y}^{2}}} + \frac{3}{{5B}}\frac{{L_{2}^{2}{{d}^{2}}{{\Delta }^{2}}}}{{{{\mu }_{x}}\mu _{y}^{2}{{\gamma }^{2}}}} \\ \, + \frac{{12}}{5}\frac{{L_{2}^{2}{{\beta }^{2}}}}{{{{\mu }_{x}}\mu _{y}^{2}}}{{\left( {\frac{{{{L}_{\beta }}}}{{(l - 1)!}}} \right)}^{2}}{{\gamma }^{{2\beta - 2}}} + \frac{{12}}{5}\frac{{L_{2}^{2}{{\beta }^{2}}{{d}^{2}}{{\Delta }^{2}}}}{{{{\mu }_{x}}\mu _{y}^{2}{{\gamma }^{2}}}} \\ \end{gathered} $$
(B.10)
$$\, = \mathcal{O}\left( {\frac{{L_{2}^{2}d{{\gamma }^{2}}}}{{B{{\mu }_{x}}\mu _{y}^{2}}} + \frac{{L_{2}^{2}{{\beta }^{2}}}}{{{{\mu }_{x}}\mu _{y}^{2}}}{{{\left( {\frac{{{{L}_{\beta }}}}{{(l - 1)!}}} \right)}}^{2}}{{\gamma }^{{2\beta - 2}}} + \frac{{L_{2}^{2}{{\beta }^{2}}{{d}^{2}}{{\Delta }^{2}}}}{{{{\mu }_{x}}\mu _{y}^{2}{{\gamma }^{2}}}}} \right).$$

Here, we substitute the values for \({{\tau }_{y}} = \frac{1}{{(M + 1){{L}_{2}}}}\) and \({{\tau }_{x}} = \frac{{\mu _{y}^{2}{{\tau }_{y}}}}{{12L_{2}^{2}}}\).

Since B can be large, the second and third terms are responsible for the asymptote. We find the optimal smoothing parameter γ that minimizes the last two terms:

$$\begin{gathered} {{P}_{t}} = \mathcal{O}\left( {\frac{{L_{2}^{2}{{\beta }^{2}}{{d}^{2}}}}{{{{\mu }_{x}}\mu _{y}^{2}}}{{{\left( {\frac{{{{L}_{\beta }}}}{{(l - 1)!}}} \right)}}^{{\frac{2}{\beta }}}}{{{\left( {\frac{{\beta - 1}}{{d + \beta - 1}}} \right)}}^{{\frac{2}{\beta }}}}{{\Delta }^{{\frac{{2(\beta - 1)}}{\beta }}}}} \right) \\ \, = \mathcal{O}\left( {\frac{1}{{{{\mu }_{x}}\mu _{y}^{2}}}{{d}^{{\frac{{2(\beta - 1)}}{\beta }}}}{{\Delta }^{{\frac{{2(\beta - 1)}}{\beta }}}}} \right) \\ \end{gathered} $$
(B.11)

where \({{\gamma }_{k}} = {{\left( {\frac{{(l - 1)!}}{{{{L}_{\beta }}}}\frac{{d + \beta - 1}}{{\beta - 1}}\Delta } \right)}^{{1{\text{/}}\beta }}}\) is the optimal smoothing parameter. Then, from (B.11), we can find the maximum noise level while assuming that \({{(d\Delta )}^{{\frac{{2(\beta - 1)}}{\beta }}}}\, \leqslant \,\varepsilon \) for \(\varepsilon > 0\). Thus, we have

$$\Delta = \mathcal{O}\left( {{{{({{\mu }_{x}}\mu _{y}^{2})}}^{{\frac{\beta }{{2(\beta - 1)}}}}}{{\varepsilon }^{{\frac{\beta }{{2(\beta - 1)}}}}}{{d}^{{ - 1}}}} \right).$$

With this maximum noise, \({{\gamma }_{k}} = \mathcal{O}\left( {{{{({{\mu }_{x}}\mu _{y}^{2}\varepsilon )}}^{{\frac{1}{{2(\beta - 1)}}}}}} \right)\). Hence, we guarantee that the second and third terms in (B.10) are smaller than \(\varepsilon \) (up to a constant) for the selected parameters. To reduce the number of iterations, we select the batch size on the order of \({{\beta }^{3}}d\). Let us determine the minimum number of iterations by solving the following inequality:

$${{(1 - {{\mu }_{x}}{{\tau }_{x}})}^{N}}{{P}_{0}} \leqslant \varepsilon .$$

Thus, the minimum number of iterations is

$$\begin{gathered} N \geqslant \frac{1}{{{{\tau }_{x}}{{\mu }_{x}}}}\ln \frac{{{{P}_{0}}}}{\varepsilon } = 12\frac{{({{\beta }^{3}}d{\text{/}}B + 1)L_{2}^{3}}}{{{{\mu }_{x}}\mu _{y}^{2}}}\ln \frac{{{{P}_{0}}}}{\varepsilon } \\ \, = \mathcal{O}\left( {\mu _{x}^{{ - 1}}\mu _{y}^{{ - 2}}\ln \frac{1}{\varepsilon }} \right). \\ \end{gathered} $$

In the second inequality, we use the fact that τx = \(\frac{{\mu _{y}^{2}{{\tau }_{y}}}}{{12L_{2}^{2}}}\), \({{\tau }_{y}}\, = \,\frac{1}{{(M\, + \,1){{L}_{2}}}}\) and \(M = \mathcal{O}({{\beta }^{3}}d{\text{/}}B)\), d = max(dx, dy). For sufficiently large B of the order of \({{\beta }^{3}}d\), the dependence on dimension disappears.

The oracle complexity is found from the iterative complexity through multiplying by the batch size, i.e.,

$$T = \mathcal{O}\left( {{{\beta }^{3}}d\mu _{x}^{{ - 1}}\mu _{y}^{{ - 2}}\ln \frac{1}{\varepsilon }} \right).$$

Thus, all terms in formula (B.10) are smaller than ε.

PROOF OF THEOREM 3

Lemma 9 (kernel approximation bias). Suppose that Assumptions 1–5 hold. Suppose also that \({{x}_{t}}\) and \({\mathbf{G}}({{x}_{t}},\xi ,{\mathbf{e}})\) are determined by Algorithm 1 at instant \(t \geqslant 1\) with gradient approximation (4.4) for zero-order oracle (4.3). Then,

$$\begin{gathered} {\text{||}}\mathbb{E}[{\mathbf{G}}({{x}_{t}},\xi ,{\mathbf{e}})\,{\text{|}}\,{{x}_{t}}] - \nabla f({{x}_{t}}){\text{||}} \\ \, \leqslant {{\kappa }_{\beta }}\frac{{{{L}_{\beta }}}}{{(l - 1)!}} \cdot \frac{d}{{d + \beta - 1}}{{\gamma }^{{\beta - 1}}}, \\ \end{gathered} $$
(C.1)

where we recall that \(l=\beta \).

Proof. Using Lemma 6, the fact that \(\int_{ - 1}^1 rK(r)dr\) = 1, and the variational representation of the Euclidean norm, we can write

$$\begin{gathered} {\text{||}}\mathbb{E}[{\mathbf{G}}({{x}_{t}},\xi ,{\mathbf{e}})\,{\text{|}}\,{{x}_{t}}] - \nabla f({{x}_{t}}){\text{||}} \\ \, = \mathop {\sup }\limits_{{\mathbf{v}} \in {{\mathcal{S}}^{d}}} \mathbb{E}[({{\nabla }_{{\mathbf{v}}}}f(x + \gamma r{\mathbf{h}}) - {{\nabla }_{{\mathbf{v}}}}f(x))rK(r)], \\ \end{gathered} $$
(C.2)

where h is uniformly distributed over \(\mathcal{B}_{2}^{d}\). Since \(f(x)\) satisfies the Hölder condition with constants \(\beta \) and \({{L}_{\beta }}\), for any \({\mathbf{v}} \in {{\mathcal{S}}^{d}}\), directed gradient \({{\nabla }_{{\mathbf{v}}}}f( \cdot )\) satisfies the Hölder condition with constants \(\beta - 1\) and \({{L}_{\beta }}\). Thus, the following Taylor expansion holds:

$$\begin{gathered} {{\nabla }_{{\mathbf{v}}}}f({{x}_{t}} + \gamma r{\mathbf{h}}) = {{\nabla }_{{\mathbf{v}}}}f({{x}_{t}}) \\ \, + \sum\limits_{1 \leqslant |{\mathbf{m}}| \leqslant l - 1} \frac{{{{{(r\gamma )}}^{{|{\mathbf{m}}|}}}}}{{{\mathbf{m}}!}}{{D}^{{\mathbf{m}}}}{{\nabla }_{{\mathbf{v}}}}f({{x}_{t}})({\mathbf{h}}{{)}^{{\mathbf{m}}}} + R(\gamma r{\mathbf{h}}), \\ \end{gathered} $$
(C.3)

where remainder term \(R( \cdot )\) satisfies condition \({\text{|}}R(x){\text{|}} \leqslant \frac{{{{L}_{\beta }}}}{{(l - 1)!}}{\text{||}}x{\text{|}}{{{\text{|}}}^{{\beta - 1}}}\).

By substituting equation (C.3) into equation (C.2) and using the “nullification” properties of kernel K, we arrive at

$$\begin{gathered} {\text{||}}\mathbb{E}[{\mathbf{G}}({{x}_{t}},\xi ,{\mathbf{e}})\,{\text{|}}\,{{x}_{t}}] - \nabla f({{x}_{t}}){\text{||}} \\ \leqslant {{\kappa }_{\beta }}{{\gamma }^{{\beta - 1}}}\frac{{{{L}_{\beta }}}}{{(l - 1)!}}\mathbb{E}{\text{||}}{\mathbf{h}}{\text{|}}{{{\text{|}}}^{{\beta - 1}}} = {{\kappa }_{\beta }}{{\gamma }^{{\beta - 1}}}\frac{{{{L}_{\beta }}}}{{(l - 1)!}}\frac{d}{{d + \beta - 1}}, \\ \end{gathered} $$

where the last equality is obtained from the fact that \(\mathbb{E}{\text{||}}{\mathbf{h}}{\text{|}}{{{\text{|}}}^{q}} = \frac{d}{{d + q}}\) for any \(q \geqslant 0\).

\(\square \)

Lemma 10 (kernel approximation variance). Suppose that Assumptions 1–3 hold, as well as that \({{x}_{t}}\) and \({\mathbf{G}}({{x}_{t}},{\mathbf{e}})\) are determined by Algorithm 1 with gradient approximation (4.4) for zero-order oracle (4.3). Suppose also that \(f \in {{\mathcal{F}}_{2}}({{L}_{2}})\). Then, for \(d \geqslant 2\), we have

$$\mathbb{E}{\text{|}}{\mathbf{G}}({{x}_{t}},\xi ,{\mathbf{e}}){\text{|}}{{{\text{|}}}^{2}} \leqslant \frac{{{{d}^{2}}\kappa }}{{d - 1}}\mathbb{E}\left[ {{\text{||}}\nabla f({{x}_{t}}){\text{||}} + {{L}_{2}}{{\gamma }^{2}}} \right] + \frac{{{{d}^{2}}{{{\tilde {\Delta }}}^{2}}\kappa }}{{{{\gamma }^{2}}}},$$

where \(\kappa = \int_{ - 1}^1 {{K}^{2}}(r)dr\).

The result of Lemma 10 can be further simplified as follows:

$$\begin{gathered} \mathbb{E}{\text{||}}{\mathbf{G}}({{x}_{t}},\xi ,{\mathbf{e}}){\text{|}}{{{\text{|}}}^{2}} \leqslant 4d\kappa \mathbb{E}{\text{||}}\nabla f({{x}_{t}}){\text{|}}{{{\text{|}}}^{2}} \\ \, + 4d\kappa L_{2}^{2}{{\gamma }^{2}} + \frac{{{{d}^{2}}{{{\tilde {\Delta }}}^{2}}\kappa }}{{{{\gamma }^{2}}}},\quad d \geqslant 2. \\ \end{gathered} $$
(C.4)

Proof. For simplicity, we omit subscript \(t\) for all quantities. Let us write the second moment of the following quantity:

$$\begin{gathered} \mathbb{E}{\text{||}}{\mathbf{G}}(x,\xi ,{\mathbf{e}}){\text{|}}{{{\text{|}}}^{2}} = \frac{{{{d}^{2}}}}{{4{{\gamma }^{2}}}}\mathbb{E}\left[ {(f(x + \gamma r{\mathbf{e}}) - f{{{(x - \gamma r{\mathbf{e}})}}^{{}}}{\kern 1pt} } \right. \\ + \left. {({{\xi }_{1}} - {{\xi }_{2}}{{{))}}^{2}}{{K}^{2}}(r)} \right] \leqslant \frac{{{{d}^{2}}}}{{4{{\gamma }^{2}}}}\left( {\mathbb{E}\left[ {(f{{{(x + \gamma r{\mathbf{e}})}}^{{}}}{\kern 1pt} } \right.} \right. \\ \, - \left. {\left. {f{{{(x - \gamma r{\mathbf{e}})}}^{2}}{{K}^{2}}(r)} \right] + 4\kappa {{{\tilde {\Delta }}}^{2}}} \right). \\ \end{gathered} $$
(C.5)

Below, all expectations should conventionally be understood on xt. It should be noted that, since \(\mathbb{E}[f(x + hr{\mathbf{e}}) - f(x - hr{\mathbf{e}})\,{\text{|}}\,r] = 0\) and \(f \in {{\mathcal{F}}_{2}}({{L}_{2}})\), the use of the Wirtinger–Poincaré inequality [22, 23], see Eq. (3.1) or Theorem 2, leads to

$$\begin{gathered} \mathbb{E}\left[ {{{{(f(x + hr{\mathbf{e}}) - f(x - hr{\mathbf{e}}))}}^{2}}\,{\text{|}}\,r} \right] \\ \, \leqslant \frac{{{{h}^{2}}}}{{d - 1}}\mathbb{E}\left[ {{\text{||}}\nabla f(x + hr{\mathbf{e}}) + \nabla f(x - hr{\mathbf{e}}){\text{|}}{{{\text{|}}}^{2}}\,{\text{|}}\,r} \right]. \\ \end{gathered} $$
(C.6)

Since \(f \in {{\mathcal{F}}_{2}}({{L}_{2}})\), the triangle inequality implies that

$$\begin{gathered} \mathbb{E}\left[ {{\text{||}}\nabla f(x + hr{\mathbf{e}}) + \nabla f(x - hr{\mathbf{e}}){\text{|}}{{{\text{|}}}^{2}}\,{\text{|}}\,r} \right] \\ \, \leqslant 4({\text{||}}\nabla f(x){\text{||}} + {{L}_{2}}\gamma {{)}^{2}}. \\ \end{gathered} $$
(C.7)

Finally, we substitute the above estimate into (C0.6) and take (C0.5) into account.

\(\square \)

We can now compute the noise and bias of the kernel approximation:

$$M = 4d{{\beta }^{3}}\quad {{\sigma }^{2}} = 4d{{\beta }^{3}}{{L}_{2}}{{\gamma }^{2}} + \frac{{{{d}^{2}}{{{\tilde {\Delta }}}^{2}}{{\beta }^{3}}}}{{{{\gamma }^{2}}}},$$
(C.8)
$${{\zeta }^{2}} = {{\beta }^{2}}{{\left( {\frac{{{{L}_{\beta }}}}{{(l - 1)!}}\frac{d}{{d + \beta - 1}}{{\gamma }^{{\beta - 1}}}} \right)}^{2}}.$$
(C.9)

Now, we can estimate the convergence rate for the kernel approximation by substituting the constants found into the final convergence estimate:

$$\begin{gathered} {{P}_{t}}\, \leqslant \,{{(1\, - \,{{\mu }_{x}}{{\tau }_{x}})}^{t}}{{P}_{0}}\, + \,\frac{{\tau _{y}^{2}{{L}_{2}}{\kern 1pt} \frac{{{{\sigma }^{2}}}}{B}\, + \,{{\tau }_{y}}{{\zeta }^{2}}}}{{10{{\mu }_{x}}{{\tau }_{x}}}}\, = \,{{(1 - {{\mu }_{x}}{{\tau }_{x}})}^{t}}{{P}_{0}} \\ \, + \frac{{12}}{{5B}}{\kern 1pt} \frac{{L_{2}^{3}d{{\gamma }^{2}}}}{{{{\mu }_{x}}\mu _{y}^{2}}}\, + \,\frac{3}{{5B}}{\kern 1pt} \frac{{L_{2}^{2}{{d}^{2}}{{{\tilde {\Delta }}}^{2}}}}{{{{\mu }_{x}}\mu _{y}^{2}{{\gamma }^{2}}}}\, + \,\frac{{12}}{5}{\kern 1pt} \frac{{L_{2}^{2}{{\beta }^{2}}}}{{{{\mu }_{x}}\mu _{y}^{2}}}{{\left( {\frac{{{{L}_{\beta }}}}{{(l\, - \,1)!}}} \right)}^{2}}{{\gamma }^{{2\beta - 2}}} \\ \, = \mathcal{O}\left( {\frac{{L_{2}^{3}{{\gamma }^{2}}}}{{B{{\mu }_{x}}\mu _{y}^{2}}}\, + \,\frac{{L_{2}^{2}d{{{\tilde {\Delta }}}^{2}}}}{{B{{\mu }_{x}}\mu _{y}^{2}{{\gamma }^{2}}}}\, + \,\frac{{L_{2}^{2}{{\beta }^{2}}}}{{{{\mu }_{x}}\mu _{y}^{2}}}{{{\left( {\frac{{{{L}_{\beta }}}}{{(l - 1)!}}} \right)}}^{2}}{{\gamma }^{{2\beta - 2}}}} \right). \\ \end{gathered} $$
(C.10)

Here, we substitute the values for \({{\tau }_{y}} = \frac{1}{{(M + 1){{L}_{2}}}}\) and \({{\tau }_{x}} = \frac{{\mu _{y}^{2}{{\tau }_{y}}}}{{12L_{2}^{2}}}\). Let us find the constraints on the smoothing parameter γ by minimizing the approximation bias. The optimal parameter is \({{\gamma }_{k}} = \sqrt[4]{{\frac{{d{{{\tilde {\Delta }}}^{2}}}}{{4{{L}_{2}}}}}}\). We find the maximum noise level from the last term in (C.10): \(\tilde {\Delta } = \mathcal{O}\left( {{{d}^{{\frac{{ - 1}}{2}}}}{{{({{\mu }_{x}}\mu _{y}^{2}\varepsilon )}}^{{\frac{1}{{\beta - 1}}}}}} \right)\). Then, the smoothing parameter takes the following form: γk = \(\mathcal{O}\left( {{{{({{\mu }_{x}}\mu _{y}^{2}\varepsilon )}}^{{\frac{1}{{2(\beta - 1)}}}}}} \right)\). With these parameters, the last term is smaller than \(\varepsilon \). When selecting B on the order of \({{\beta }^{3}}d\), the first two terms in (C.10) are smaller than \(\varepsilon \). Let us determine the minimum number of iterations by solving the following inequality:

$${{(1 - {{\mu }_{x}}{{\tau }_{x}})}^{N}}{{P}_{0}} \leqslant \varepsilon $$

Thus, the minimum number of iterations is

$$\begin{gathered} N \geqslant \frac{1}{{{{\tau }_{x}}{{\mu }_{x}}}}\ln \frac{{{{P}_{0}}}}{\varepsilon } \\ \, = 12\frac{{\left( {{{\beta }^{3}}d{\text{/}}B + 1} \right)L_{2}^{3}}}{{{{\mu }_{x}}\mu _{y}^{2}}}\ln \frac{{{{P}_{0}}}}{\varepsilon } = \mathcal{O}\left( {\mu _{x}^{{ - 1}}\mu _{y}^{{ - 2}}\ln \frac{1}{\varepsilon }} \right). \\ \end{gathered} $$

Here, the second inequality uses the fact that \({{\tau }_{x}} = \frac{{\mu _{y}^{2}{{\tau }_{y}}}}{{12L_{2}^{2}}}\), \({{\tau }_{y}} = \frac{1}{{(M + 1){{L}_{2}}}}\) and \(M = \mathcal{O}({{\beta }^{3}}d{\text{/}}B)\), d = \(\max ({{d}_{x}},{{d}_{y}})\). For a sufficiently large B on the order of \({{\beta }^{3}}d\), the dimension dependence disappears.

The oracle complexity is found from the iterative complexity through multiplying by the batch size, i.e.,

$$T = \mathcal{O}\left( {{{\beta }^{3}}d\mu _{x}^{{ - 1}}\mu _{y}^{{ - 2}}\ln \frac{1}{\varepsilon }} \right).$$

For these parameters, Algorithm 1 with gradient approximation (4.4) converges with the desired accuracy in this gradient-free oracle model (4.3).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sadykov, S.I., Lobanov, A.V. & Raigorodskii, A.M. Gradient-Free Algorithms for Solving Stochastic Saddle Optimization Problems with the Polyak–Łojasiewicz Condition. Program Comput Soft 49, 535–547 (2023). https://doi.org/10.1134/S0361768823060063

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S0361768823060063

Navigation