Skip to main content
Log in

A Theoretical and Empirical Comparison of Gradient Approximations in Derivative-Free Optimization

  • Published:
Foundations of Computational Mathematics Aims and scope Submit manuscript

Abstract

In this paper, we analyze several methods for approximating gradients of noisy functions using only function values. These methods include finite differences, linear interpolation, Gaussian smoothing, and smoothing on a sphere. The methods differ in the number of functions sampled, the choice of the sample points, and the way in which the gradient approximations are derived. For each method, we derive bounds on the number of samples and the sampling radius which guarantee favorable convergence properties for a line search or fixed step size descent method. To this end, we use the results in Berahas et al. (Global convergence rate analysis of a generic line search algorithm with noise, arXiv:1910.04055, 2019) and show how each method can satisfy the sufficient conditions, possibly only with some sufficiently large probability at each iteration, as happens to be the case with Gaussian smoothing and smoothing on a sphere. Finally, we present numerical results evaluating the quality of the gradient approximations as well as their performance in conjunction with a line search derivative-free optimization algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. Throughout the paper, N denotes the size of the sample set \(\{u_i: i=1, \ldots , N\}\). Note that for the central versions of the gradient approximations, the number of sampled functions is equal to 2N.

  2. The norms used in this paper are Euclidean norms.

  3. The bound (2.10) was presented in [29] without proof; we would like to thank the first author of [29] for providing us with guidance of this proof.

References

  1. Søren Asmussen and Peter W. Glynn. Stochastic simulation - algorithms and analysis, volume 57 of Stochastic modeling and applied probability. Springer, 2007.

  2. Afonso Bandeira, Katya Scheinberg, and Luis N Vicente. Computation of sparse low degree interpolating polynomials and their application to derivative-free optimization. Mathematical Programming, Series B, 134:223–257, 2012.

  3. Anastasia Bayandina, Alexander Gasnikov, Fariman Guliev, and Anastasia Lagunovskaya. Gradient-free two-points optimal method for nonsmooth stochastic convex optimization problem with additional small noise. arXiv preprintarXiv:1701.03821, 2017.

  4. Albert S Berahas, Richard H Byrd, and Jorge Nocedal. Derivative-free optimization of noisy functions via quasi-newton methods. SIAM Journal on Optimization, 29(2):965–993, 2019.

  5. Albert S Berahas, Liyuan Cao, and Katya Scheinberg. Global convergence rate analysis of a generic line search algorithm with noise. arXiv preprintarXiv:1910.04055, 2019.

  6. Lev Bogolubsky, Pavel Dvurechenskii, Alexander Gasnikov, Gleb Gusev, Yurii Nesterov, Andrei M Raigorodskii, Aleksey Tikhonov, and Maksim Zhukovskii. Learning supervised pagerank with gradient-based and gradient-free optimization methods. Advances in neural information processing systems, 29:4914–4922, 2016.

  7. Raghu Bollapragada and Stefan M Wild. Adaptive sampling quasi-newton methods for derivative-free stochastic optimization. arXiv preprintarXiv:1910.13516, 2019.

  8. Richard P Brent. Algorithms for minimization without derivatives. Courier Corporation, 2013.

  9. Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprintarXiv:1606.01540, 2016.

  10. Richard H Byrd, Gillian M Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical programming, 134(1):127–155, 2012.

  11. Richard G Carter. On the global convergence of trust region algorithms using inexact gradient information. SIAM Journal on Numerical Analysis, 28(1):251–265, 1991.

  12. Coralia Cartis and Katya Scheinberg. Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Mathematical Programming, pages 1–39, 2018.

  13. Krzysztof Choromanski, Atil Iscen, Vikas Sindhwani, Jie Tan, and Erwin Coumans. Optimizing simulations with noise-tolerant structured exploration. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 2970–2977. IEEE, 2018.

  14. Krzysztof Choromanski, Mark Rowland, Vikas Sindhwani, Richard E Turner, and Adrian Weller. Structured evolution with compact architectures for scalable policy optimization. arXiv preprintarXiv:1804.02395, 2018.

  15. Andrew R Conn, Katya Scheinberg, and Philippe L Toint. On the convergence of derivative-free methods for unconstrained optimization. In A. Iserles and M. Buhmann, editors, Approximation Theory and Optimization: Tributes to M. J. D. Powell, pages 83–108, Cambridge, England, 1997. Cambridge University Press.

  16. Andrew R Conn, Katya Scheinberg, and Philippe L Toint. A derivative free optimization algorithm in practice. Proceedings of the 7th AIAA/USAF/NASA/ISSMO Symposium on Multidisciplinary Analysis and Optimization, St. Louis, Missouri, September 2-4, 1998.

  17. Andrew R Conn, Katya Scheinberg, and Luis N Vicente. Geometry of interpolation sets in derivative free optimization. Mathematical programming, 111(1-2):141–172, 2008.

  18. Andrew R Conn, Katya Scheinberg, and Luis N Vicente. Introduction to Derivative-free Optimization. MPS-SIAM Optimization series. SIAM, Philadelphia, USA, 2008.

  19. Elizabeth D Dolan and Jorge J Moré. Benchmarking Optimization Software with Performance Profiles. Mathematical Programming, 91(2):201–213, 2002.

  20. John C Duchi, Michael I Jordan, Martin J Wainwright, and Andre Wibisono. Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5):2788–2806, 2015.

  21. Pavel Dvurechensky, Eduard Gorbunov, and Alexander Gasnikov. An accelerated directional derivative method for smooth stochastic convex optimization. European Journal of Operational Research, 2020.

  22. Maryam Fazel, Rong Ge, Sham M Kakade, and Mehran Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. arXiv preprintarXiv:1801.05039, 2018.

  23. Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages 385–394. Society for Industrial and Applied Mathematics, 2005.

  24. Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.

    Article  MathSciNet  Google Scholar 

  25. Kevin G Jamieson, Robert Nowak, and Ben Recht. Query complexity of derivative-free optimization. Advances in Neural Information Processing Systems, 25:2672–2680, 2012.

  26. Jack C Kiefer and Jacob Wolfowitz. Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, 23(3):462–466, 1952.

  27. Jeffrey Larson, Matt Menickelly, and Stefan M Wild. Derivative-free optimization methods. Acta Numerica, 28:287–404, 2019.

  28. Sijia Liu, Bhavya Kailkhura, Pin-Yu Chen, Paishun Ting, Shiyu Chang, and Lisa Amini. Zeroth-order stochastic variance reduction for nonconvex optimization. In Advances in Neural Information Processing Systems, pages 3727–3737, 2018.

  29. Alvaro Maggiar, Andreas Wächter, Irina S Dolinskaya, and Jeremy Staum. A derivative-free trust-region algorithm for the optimization of functions smoothed via gaussian convolution using adaptive multiple importance sampling. SIAM Journal on Optimization, 28(2):1478–1507, 2018.

  30. Jorge J Moré and Stefan M Wild. Benchmarking derivative-free optimization algorithms. SIAM Journal on Optimization, 20(1):172–191, 2009.

  31. Jorge J Moré and Stefan M Wild. Estimating computational noise. SIAM Journal on Scientific Computing, 33(3):1292–1314, 2011.

  32. Yurii Nesterov and Vladimir Spokoiny. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17(2):527–566, 2017.

    Article  MathSciNet  Google Scholar 

  33. Jorge Nocedal and Stephen J Wright. Numerical Optimization, Second Edition. Springer, 2006.

  34. Courtney Paquette and Katya Scheinberg. A stochastic line search method with expected complexity analysis. SIAM Journal on Optimization, 30(1):349–376, 2020.

    Article  MathSciNet  Google Scholar 

  35. Raghu Pasupathy, Peter Glynn, Soumyadip Ghosh, and Fatemeh S Hashemi. On sampling rates in simulation-based recursions. SIAM Journal on Optimization, 28(1):45–73, 2018.

  36. Valentin V Petrov. On lower bounds for tail probabilities. Journal of statistical planning and inference, 137(8):2703–2705, 2007.

  37. Boris T Polyak. Introduction to Optimization (1987). Optimization Software, Inc, New York.

  38. Michael J D Powell. Unconstrained minimization algorithms without computation of derivatives. Bollettino delle Unione Matematica Italiana, 9:60–69, 1974.

    MathSciNet  MATH  Google Scholar 

  39. Michael J D Powell. The NEWUOA software for unconstrained optimization without derivatives. In Large-Scale Nonlinear Optimization, volume 83, pages 255–297. Springer, US, 2006.

  40. Mark Rowland, Krzysztof Choromanski, François Chalus, Aldo Pacchiano, Tamas Sarlós, Turner Richard E, and Adrian Weller. Geometrically coupled monte carlo sampling. In Advances in Neural Information Processing Systems, pages 195–205, 2018.

  41. Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. Technical Report arXiv:1703.03864, 2016.

  42. Klaus Schittkowski. More test examples for nonlinear programming codes, volume 282. Springer Science & Business Media, 2012.

  43. John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889–1897, 2015.

  44. Ohad Shamir. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. The Journal of Machine Learning Research, 18(1):1703–1713, 2017.

    MathSciNet  MATH  Google Scholar 

  45. Sara Shashaani, Fatemeh S Hashemi, and Raghu Pasupathy. Astro-df: A class of adaptive sampling trust-region algorithms for derivative-free stochastic optimization. SIAM Journal on Optimization, 28(4):3145–3176, 2018.

  46. James C Spall. Adaptive stochastic approximation by the simultaneous perturbation method. IEEE transactions on automatic control, 45(10):1839–1853, 2000.

  47. James C Spall. Introduction to stochastic search and optimization: estimation, simulation, and control, volume 65. John Wiley & Sons, 2005.

  48. Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier, and Michael I Jordan. Stochastic cubic regularization for fast nonconvex optimization. In Advances in neural information processing systems, pages 2899–2908, 2018.

  49. Joel A Tropp. An introduction to matrix concentration inequalities. arXiv preprintarXiv:1501.01571, 2015.

  50. Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and Jürgen Schmidhuber. Natural evolution strategies. The Journal of Machine Learning Research, 15(1):949–980, 2014.

    MathSciNet  MATH  Google Scholar 

  51. Stefan M Wild, Rommel G Regis, and Christine A Shoemaker. ORBIT: optimization by radial basis function interpolation in trust-regions. SIAM Journal on Scientific Computing, 30(6):3197–3219, 2008.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Katya Scheinberg.

Additional information

Communicated by Michael Overton.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was partially supported by NSF Grants CCF 16-18717 and TRIPODS 17-40796, DARPA Lagrange award HR-001117S0039 and a Google Faculty Award.

Appendices

Derivations

1.1 Derivation of (2.10)

$$\begin{aligned} \Vert \nabla F(x) - \nabla \phi (x))\Vert&= \left\| {\mathbb {E}}_{u \sim \mathcal {N}(0,I)} \left[ \frac{1}{\sigma } f(x+\sigma u) u \right] - \nabla \phi (x)\right\| \\&= \left\| {\mathbb {E}}_{u \sim \mathcal {N}(0,I)} \left[ \frac{\phi (x+\sigma u) + \epsilon (x+\sigma u)}{\sigma } u \right] - \nabla \phi (x)\right\| \\&= \left\| {\mathbb {E}}_{u \sim \mathcal {N}(0,I)} \left[ \nabla \phi (x+\sigma u) - \nabla \phi (x) \right] + {\mathbb {E}}_{u \sim \mathcal {N}(0,I)} \left[ \frac{\epsilon (x+\sigma u)}{\sigma } u \right] \right\| \\&\le \left\| {\mathbb {E}}_{u \sim \mathcal {N}(0,I)} \left[ \nabla \phi (x+\sigma u) - \nabla \phi (x) \right] \right\| + \left\| {\mathbb {E}}_{u \sim \mathcal {N}(0,I)} \left[ \frac{\epsilon (x+\sigma u)}{\sigma } u \right] \right\| \\&\le {\mathbb {E}}_{u \sim \mathcal {N}(0,I)} [\Vert \nabla \phi (x+\sigma u) - \nabla \phi (x)\Vert ] + {\mathbb {E}}_{u \sim \mathcal {N}(0,I)} \left[ \left\| \frac{\epsilon (x+\sigma u)}{\sigma } u \right\| \right] \\&\le L \sigma {\mathbb {E}}_{u \sim \mathcal {N}(0,I)} [\Vert u \Vert ] + \frac{\epsilon _f}{\sigma } {\mathbb {E}}_{u \sim \mathcal {N}(0,I)} [\Vert u \Vert ] \\&= \left( L \sigma + \frac{\epsilon _f}{\sigma } \right) \sqrt{2}\frac{\varGamma (\frac{n+1}{2})}{\varGamma (\frac{n}{2})} \le \sqrt{n}L \sigma + \frac{\sqrt{n} \epsilon _f}{\sigma }. \end{aligned}$$

1.2 Derivation of (2.11)

$$\begin{aligned}&\Vert \nabla F(x) - \nabla \phi (x))\Vert \\&\quad ={} \left\| {\mathbb {E}}_{u \sim \mathcal {N}(0,I)} \left[ \frac{\phi (x+\sigma u) + \epsilon (x+\sigma u) - \phi (x+\sigma u) - \epsilon (x+\sigma u)}{2\sigma } u \right] - \nabla \phi (x)\right\| \\&\quad ={} \left\| {\mathbb {E}}_{u \sim \mathcal {N}(0,I)} \left[ \frac{1}{2}\nabla \phi (x+\sigma u) + \frac{1}{2}\nabla \phi (x-\sigma u) - \nabla \phi (x)\right] \right. \\&\qquad \left. + {\mathbb {E}}_{u \sim \mathcal {N}(0,I)} \left[ \frac{\epsilon (x+\sigma u) - \epsilon (x+\sigma u)}{2\sigma } u \right] \right\| \\&\quad \le {} \frac{1}{2}{\mathbb {E}}_{u \sim \mathcal {N}(0,I)} \left[ \Vert \left( \nabla \phi (x+\sigma u) - \nabla \phi (x)\right) - (\nabla \phi (x) - \phi (x-\sigma u)) \Vert \right] \\&\qquad + {\mathbb {E}}_{u \sim \mathcal {N}(0,I)} \left[ \left\| \frac{\epsilon (x+\sigma u) - \epsilon (x+\sigma u)}{2\sigma } u \right\| \right] \\&\quad \le {} \frac{1}{2}{\mathbb {E}}_{u \sim \mathcal {N}(0,I)} \left[ \Vert \left( \nabla \phi (x+\sigma u) - \nabla \phi (x)\right) - (\nabla \phi (x) - \phi (x-\sigma u)) \Vert \right] \\&\qquad + {\mathbb {E}}_{u \sim \mathcal {N}(0,I)} \left[ \frac{\epsilon _f}{\sigma } \Vert u\Vert \right] \\&\quad ={} \frac{1}{2}{\mathbb {E}}_{u \sim \mathcal {N}(0,I)} \left[ \Vert \left( \nabla ^2 \phi (x+ \xi _1 u) - \nabla ^2 \phi (x - \xi _2 u )\right) \sigma u \Vert \right] + {\mathbb {E}}_{u \sim \mathcal {N}(0,I)} \left[ \frac{\epsilon _f}{\sigma } \Vert u\Vert \right] , \end{aligned}$$

for some \(0 \le \xi _1 \le \sigma \) and \(0 \le \xi _2 \le \sigma \) by the intermediate value theorem. Then

$$\begin{aligned} \Vert \nabla F(x) - \nabla \phi (x))\Vert \le {}&\frac{1}{2} {\mathbb {E}}_{u \sim \mathcal {N}(0,I)} \left[ \Vert \nabla ^2 \phi (x+ \xi _1 u) - \nabla ^2 \phi (x - \xi _2 u ) \Vert \Vert \sigma u \Vert \right] \\&\quad + {\mathbb {E}}_{u \sim \mathcal {N}(0,I)} \left[ \frac{\epsilon _f}{\sigma } \Vert u\Vert \right] \\ \le {}&\frac{1}{2} {\mathbb {E}}_{u \sim \mathcal {N}(0,I)} \left[ M \Vert \xi _1 u + \xi _2 u\Vert \cdot \sigma \Vert u\Vert \right] + {\mathbb {E}}_{u \sim \mathcal {N}(0,I)} \left[ \frac{\epsilon _f}{\sigma } \Vert u\Vert \right] \\ ={}&\frac{1}{2} {\mathbb {E}}_{u \sim \mathcal {N}(0,I)} \left[ |\xi _1 + \xi _2| \cdot \Vert u\Vert ^2 M \sigma \right] + {\mathbb {E}}_{u \sim \mathcal {N}(0,I)} \left[ \frac{\epsilon _f}{\sigma } \Vert u\Vert \right] \\ \le {}&n M \sigma ^2 + \frac{\sqrt{n} \epsilon _f}{\sigma } . \end{aligned}$$

1.3 Derivation of (2.18)

For the first equality, let \(A = \mathrm {E}_{u \sim {\mathcal {N}}(0,I)} (a^\intercal u)^2 u u ^\intercal \). Then for any \((i,j) \in \{1,2,\dots , n\}^2\) with \(i \ne j\), we have

$$\begin{aligned} A_{ij}= & {} \mathrm {E} \left\{ (a^\intercal u )^2 u _i u _j \right\} \\= & {} \sum _{k=1}^n \sum _{l=1}^n \mathrm {E} \left\{ a_k u _k a_l u _l u _i u _j \right\} \\= & {} \sum _{k=i} \sum _{l=i} \mathrm {E} \left\{ a_k u _k a_l u _l u _i u _j \right\} + \sum _{k\ne i} \sum _{l=i} \mathrm {E} \left\{ a_k u _k a_l u _l u _i u _j \right\} \\&\quad + \sum _{k=i} \sum _{l\ne i} \mathrm {E} \left\{ a_k u _k a_l u _l u _i u _j \right\} + \sum _{k\ne i} \sum _{l\ne i} \mathrm {E} \left\{ a_k u _k a_l u _l u _i u _j \right\} \\= & {} \mathrm {E} \left\{ a_i^2 u _i^3 u _j \right\} + \sum _{k\ne i} \mathrm {E} \left\{ a_k a_i u _k u _i^2 u _j \right\} + \sum _{l\ne i} \mathrm {E} \left\{ a_i a_l u _l u _i^2 u _j \right\} \\&\quad + \mathrm {E} \left\{ u _i \right\} \sum _{k\ne i} \sum _{l\ne i}\mathrm {E} \left\{ a_k u _k a_l u _l u _j \right\} \\= & {} 0 + \sum _{k\ne i} \mathrm {E} \left\{ a_k a_i u _k u _i^2 u _j \right\} + \sum _{l\ne i} \mathrm {E} \left\{ a_i a_l u _l u _i^2 u _j \right\} + 0 \\= & {} \mathrm {E} \left\{ a_i a_j u _i^2 u _j^2 \right\} + \mathrm {E} \left\{ a_i a_j u _i^2 u _j^2 \right\} \\= & {} 2 a_i a_j. \end{aligned}$$

For any \(i \in \{1,2,\dots ,n\}\),

$$\begin{aligned} \begin{aligned} A_{ii}&= \mathrm {E} \left\{ (a^\intercal u )^2 u _i^2 \right\} \\&= \sum _{k=i} \sum _{l=i} \mathrm {E} \left\{ a_k u _k a_l u _l u _i^2 \right\} + \sum _{k\ne i} \sum _{l=i} \mathrm {E} \left\{ a_k u _k a_l u _l u _i^2 \right\} \\&\quad + \sum _{k=i} \sum _{l\ne i} \mathrm {E} \left\{ a_k u _k a_l u _l u _i^2 \right\} + \sum _{k\ne i} \sum _{l\ne i} \mathrm {E} \left\{ a_k u _k a_l u _l u _i^2 \right\} \\&= \mathrm {E} \left\{ a_i^2 u _i^4 \right\} + \sum _{k\ne i} \mathrm {E} \left\{ a_k a_i u _k u _i^3 \right\} + \sum _{l\ne i} \mathrm {E} \left\{ a_i a_l u _l u _i^3 \right\} \\&\quad + \mathrm {E} \left\{ u _i^2 \right\} \sum _{k\ne i} \sum _{l\ne i} \mathrm {E} \left\{ a_k u _k a_l u _l \right\} \\&= 3a_i^2 + 0 + 0 + 1 \times \sum _{k=l\ne i} \mathrm {E} \left\{ a_k u _k a_l u _l \right\} = 3a_i^2 + \sum _{k\ne i} \mathrm {E} \left\{ a_k^2 u_k^2 \right\} \\&= 3a_i^2 + \sum _{k\ne i} a_k^2 = 2 a_i^2 + \sum _{k=1}^n a_k^2. \end{aligned} \end{aligned}$$

Then by writing the result in matrix format, we get \(\mathrm {E}_{u \sim {\mathcal {N}}(0,I)} \left[ (a^\intercal u)^2 u u ^\intercal \right] = a^\intercal a I + 2 a a^\intercal \). This result is valid for any distribution for u such that \(u_i\), \(i \in \{1,2,\dots ,n\}\) are i.i.d. and has \({\mathbb {E}} u_i = 0\) and \({\mathbb {E}} u_i^2=1\) for all \(i \in \{1,2,\dots ,n\}\).

For the second equality, since the possibility density function of \({\mathcal {N}}(0, I)\) is even while \(a^\intercal u \cdot \Vert u\Vert ^k \cdot u u ^\intercal \) is an odd function, the expectation \({\mathbb {E}}_{u \sim {\mathcal {N}}({0},I)} \left[ a^T u \cdot \Vert u\Vert ^k \cdot u u^T \right] \) is zero.

Because \({\mathbb {E}}_{u \sim {\mathcal {N}}({0},I)} \left[ \Vert u\Vert ^k u^\intercal u \right] = {\mathbb {E}}_{u \sim {\mathcal {N}}({0},I)} \left[ \Vert u\Vert ^{k+2} \right] \) is the \((k+2)\)nd moment of a Chi distributed variable for all \(k \in {\mathbb {N}}\), we have

$$\begin{aligned} {\mathbb {E}}_{u \sim {\mathcal {N}}({0},I)} \left[ \Vert u\Vert ^k u^\intercal u \right] = \frac{2^{1 + k/2} \varGamma ((n+k+2)/2)}{\varGamma (n/2)}. \end{aligned}$$

This value is also the trace of the matrix \({\mathbb {E}}_{u \sim {\mathcal {N}}({0},I)} \left[ \Vert u\Vert ^k u u^T \right] \). Considering all n elements on the diagonal of this matrix are the same, we have

$$\begin{aligned} {\mathbb {E}}_{u \sim {\mathcal {N}}({0},I)} \left[ \Vert u\Vert ^k u u^T \right] = \frac{2^{1 + k/2} \varGamma ((n+k+2)/2)}{n \varGamma (n/2)} I \text { for } k = 0,1, 2, \dots . \end{aligned}$$

For even k, this quantity is equal to \(\prod _{i=1}^{k/2} (n + 2i)\). For odd k, this quantity is equal to \(\left[ \sqrt{2} \varGamma \left( \frac{n+1}{2} \right) \big / \varGamma \left( \frac{n}{2} \right) \right] \frac{1}{n} \prod _{i=1}^{(k+1)/2} (n + 2i - 1)\). Use the inequality \(\sqrt{2} \left. \varGamma \left( \frac{n+1}{2} \right) \big / \varGamma \left( \frac{n}{2} \right) \right. \le \sqrt{n}\) for all \(n \in {\mathbb {N}}\), we have

$$\begin{aligned} {\mathbb {E}}_{u \sim {\mathcal {N}}({0},I)} \left[ \Vert u\Vert ^k u u^T \right] \preceq (n+1)(n+3)\cdots (n+k) \cdot n^{-0.5} I \text { for } k = 1, 3, 5, \dots . \end{aligned}$$

1.4 Derivation of (2.28)

$$\begin{aligned} {\mathbb {E}} \left[ \Vert g(x) - \nabla F(x)\Vert ^2\right] ={}&{\mathbb {E}} \left[ \left\| \frac{1}{N} \sum _{i=1}^N \frac{f(x+\sigma u_i) - f(x)}{\sigma } u_i - \nabla F(x) \right\| ^2 \right] \\ ={}&\frac{1}{N} {\mathbb {E}}_{u \sim \mathcal {N}(0,I)} \left[ \left( \frac{f(x+\sigma u) - f(x)}{\sigma } \right) ^2 u^\intercal u \right] - \frac{1}{N} \nabla F(x)^\intercal \nabla F(x) \\ ={}&\frac{1}{N} {\mathbb {E}}_{u \sim \mathcal {N}(0,I)} \left[ \left( a^\intercal u\right) ^2 u^\intercal u \right] - \frac{1}{N} a^\intercal a \\ ={}&\frac{1}{N} (n + 1) a^\intercal a. \end{aligned}$$

1.5 Derivation of (2.29)

The expression for \(E\left[ \Vert g(x) - \nabla F(x)\Vert ^4\right] \) is a sum of \(N^4\) terms with each term being the product of four vectors:

$$\begin{aligned}&{\mathbb {E}} \left[ \Vert g(x) - \nabla F(x)\Vert ^4\right] \\&\quad = \frac{1}{N^4} {\mathbb {E}} \left[ \sum _{i=1}^N \sum _{j=1}^N \sum _{k=1}^N \sum _{l=1}^N \prod _{w \in \{i,j,k,l\}} \left( \frac{f(x+\sigma u_w) - f(x)}{\sigma } u_w - \nabla F(x) \right) \right] , \end{aligned}$$

where \(\prod \) denotes the operation which is a product of the inner products of the two pairs of vectors. Specifically, given four vectors \(a_1, a_2, a_3, a_4\in \mathbb {R}^n\), \(\prod _{i \in \{1,2,3,4\}} a_i = (a_1^\intercal a_2) \cdot (a_3^\intercal a_4)\) and \(\prod _{i \in \{1,1,2,2\}} a_i = (a_1^\intercal a_1) \cdot (a_2^\intercal a_2)\).

We first observe that \(\prod _{w \in \{i,j,k,l\}} \left( \frac{f(x+\sigma u_w) - f(x)}{\sigma } u_w - \nabla F(x) \right) =0\) whenever one of the indices (ijkl) is different from all of the other ones. This is because all \(u_w\), for \(w \in \{i,j,k,l\}\) are independent of each other if their indices are different and

$$\begin{aligned} {\mathbb {E}} \left[ \frac{f(x+\sigma u_w) - f(x)}{\sigma } u_w - \nabla F(x) \right] = 0. \end{aligned}$$

Thus, we need only to consider the terms having one of the following conditions:

  1. 1.

    \(i=j=k=l\);

  2. 2.

    \(i=j \ne k=l\);

  3. 3.

    \(i=k \ne j=l\);

  4. 4.

    \(i=l \ne j=k\).

First we consider the case: \(i=j \ne k=l\), which occurs when \(N>1\).

$$\begin{aligned} \begin{aligned}&{\mathbb {E}} \left[ \sum _{i=1}^N \sum _{k=1, k\ne i}^N \prod _{w \in \{i,i,k,k\}} \left( \frac{f(x+\sigma u_w) - f(x)}{\sigma } u_w - \nabla F(x) \right) \right] \\&\quad ={} \sum _{i=1}^N {\mathbb {E}} \left[ \left\| \frac{f(x+\sigma u_i) - f(x)}{\sigma } u_i - \nabla F(x) \right\| ^2 \right] \\&\qquad \cdot \sum _{k=1,k\ne i}^N {\mathbb {E}} \left[ \left\| \frac{f(x+\sigma u_k) - f(x)}{\sigma } u_k - \nabla F(x) \right\| ^2 \right] \\&\quad ={} N(N-1) \left[ (n + 1) a^\intercal a \right] ^2. \end{aligned} \end{aligned}$$

We now consider two other cases: \(i=k \ne j=l\) and \(i=l \ne j=k\) that are essentially the same. We have

$$\begin{aligned} \begin{aligned}&{\mathbb {E}} \left[ \sum _{i=1}^N \sum _{k=1, k\ne i}^N \prod _{w \in \{i,k,i,k\}} \left( \frac{f(x+\sigma u_w) - f(x)}{\sigma } u_w - \nabla F(x) \right) \right] \\&\quad ={} \sum _{i=1}^N \sum _{k=1, k\ne i}^N {\mathbb {E}} \left\{ \left[ \left( \frac{f(x+\sigma u_i) - f(x)}{\sigma } u_i - \nabla F(x) \right) ^\intercal \right. \right. \\&\quad \left. \left. \left( \frac{f(x+\sigma u_k) - f(x)}{\sigma } u_k - \nabla F(x) \right) \right] ^2 \right\} \\&\quad ={} \sum _{i=1}^N \sum _{k=1, k\ne i}^N {\mathbb {E}} \left( \left\{ \left[ (a^\intercal u_i) u_i - a \right] ^\intercal \left[ (a^\intercal u_k) u_k - a \right] \right\} ^2 \right) \\&\quad ={} \sum _{i=1}^N \sum _{k=1, k\ne i}^N {\mathbb {E}} \left( \left[ (a^\intercal u_i) (a^\intercal u_k) (u_i^\intercal u_k) - (a^\intercal u_i)^2 - (a^\intercal u_k)^2 + a^\intercal a \right] ^2 \right) \\&\quad ={} \sum _{i=1}^N \sum _{k=1, k\ne i}^N {\mathbb {E}} \left[ \begin{array}{rl} &{}(a^\intercal u_i)^2 (a^\intercal u_k)^2 (u_i^\intercal u_k)^2 + (a^\intercal u_i)^4 + (a^\intercal u_k)^4 + (a^\intercal a)^2 \\ +&{} 2 (a^\intercal a) (a^\intercal u_i) (a^\intercal u_k) (u_i^\intercal u_k) - 2(a^\intercal a) (a^\intercal u_i)^2 - 2(a^\intercal a) (a^\intercal u_k)^2 \\ -&{} 2 (a^\intercal u_i)^3 (a^\intercal u_k) (u_i^\intercal u_k) - 2 (a^\intercal u_i) (a^\intercal u_k)^3 (u_i^\intercal u_k) + 2 (a^\intercal u_i)^2 (a^\intercal u_k)^2 \end{array} \right] \\&\qquad ={} \sum _{i=1}^N \sum _{k=1, k\ne i}^N \left[ \begin{array}{rl} &{} (n+8) (a^\intercal a)^2 + 3(a^\intercal a)^2 + 3(a^\intercal a)^2 + (a^\intercal a)^2 \\ +&{} 2 (a^\intercal a)^2 - 2(a^\intercal a)^2 - 2(a^\intercal a)^2 \\ -&{} 6 (a^\intercal a)^2 - 6 (a^\intercal a)^2 + 2 (a^\intercal a)^2 \end{array} \right] \\&\qquad ={} \sum _{i=1}^N \sum _{k=1, k\ne i}^N (n+3) (a^\intercal a)^2 = N(N-1) (n+3) (a^\intercal a)^2 \end{aligned} \end{aligned}$$

Finally, we have the \(i=j=k=l\) case:

$$\begin{aligned} \begin{aligned}&{\mathbb {E}} \left[ \sum _{i=1}^N \prod _{w \in \{i,i,i,i\}} \left( \frac{f(x+\sigma u_w) - f(x)}{\sigma } u_w - \nabla F(x) \right) \right] \\&\quad ={} N {\mathbb {E}}_{u \sim \mathcal {N}(0,I)} \left[ \left\| \frac{f(x+\sigma u) - f(x)}{\sigma } u - \nabla F(x) \right\| ^4 \right] \\&\quad ={} N {\mathbb {E}}_{u \sim \mathcal {N}(0,I)} \left\{ \left[ \left( \frac{f(x+\sigma u) - f(x)}{\sigma }\right) ^2 u^\intercal u\right. \right. \\&\qquad \left. \left. - 2 \left( \frac{f(x+\sigma u) - f(x)}{\sigma }\right) u^\intercal \nabla F(x) + \nabla F(x)^\intercal \nabla F(x) \right] ^2 \right\} \\&\quad ={} N {\mathbb {E}}_{u \sim \mathcal {N}(0,I)} \left[ \begin{array}{rl} &{}\left( \frac{f(x+\sigma u) - f(x)}{\sigma }\right) ^4 (u^\intercal u)^2 + 4 \left( \frac{f(x+\sigma u) - f(x)}{\sigma }\right) ^2 \left( u^\intercal \nabla F(x) \right) ^2 \\ +&{} \left( \nabla F(x)^\intercal \nabla F(x) \right) ^2 - 4 \left( \frac{f(x+\sigma u) - f(x)}{\sigma }\right) ^3 (u^\intercal u) \left( u^\intercal \nabla F(x) \right) \\ -&{} 4 \left( \frac{f(x+\sigma u) - f(x)}{\sigma }\right) (u^\intercal \nabla F(x)) (\nabla F(x)^\intercal \nabla F(x)) \\ +&{} 2 \left( \frac{f(x+\sigma u) - f(x)}{\sigma }\right) ^2 (u^\intercal u) (\nabla F(x)^\intercal \nabla F(x)) \end{array} \right] \\&\quad ={} N {\mathbb {E}}_{u \sim \mathcal {N}(0,I)} \left[ \begin{array}{rl} &{}\left( a^\intercal u \right) ^4 (u^\intercal u)^2 + 4 \left( a^\intercal u \right) ^2 \left( u^\intercal a \right) ^2 + \left( a^\intercal a \right) ^2 - 4 \left( a^\intercal u \right) ^3 (u^\intercal u) \left( u^\intercal a \right) \\ -&{} 4 \left( a^\intercal u \right) (u^\intercal a) (a^\intercal a) + 2 \left( a^\intercal u \right) ^2 (u^\intercal u) (a^\intercal a) \end{array} \right] \\&\quad ={} N \left[ \begin{array}{rl} &{} 3 (n+4)(n+6) (a^\intercal a)^2 + 12 (a^\intercal a)^2 + (a^\intercal a)^2 - 12 (n+4) (a^\intercal a)^2 \\ -&{} 4(a^\intercal a)^2 + 2 (n+2) (a^\intercal a)^2 \end{array} \right] \\&\qquad ={} N (3n^2 + 20n + 37) (a^\intercal a)^2 \end{aligned} \end{aligned}$$

In summary, we have

$$\begin{aligned} \begin{aligned}&N^4 {\mathbb {E}} \left[ \Vert g(x) - \nabla F(x)\Vert ^4 \right] \\&\quad ={} N(N-1) (n+1)^2 (a^\intercal a)^2 + 2N(N-1)(n+3) (a^\intercal a)^2 +N(3n^2 + 20n + 37) (a^\intercal a)^2 \\&\quad ={} N(N-1)(n^2 + 4n + 7) (a^\intercal a)^2 + N(3n^2 + 20n + 37) (a^\intercal a)^2. \end{aligned} \end{aligned}$$

1.6 Derivation of (2.35)

$$\begin{aligned} \Vert \nabla F(x) - \nabla \phi (x))\Vert&= \left\| {\mathbb {E}}_{u \sim {\mathcal {U}}({\mathcal {S}}(0,1))} \left[ \frac{n}{\sigma } f(x+\sigma u) u \right] - \nabla \phi (x)\right\| \\&= \left\| {\mathbb {E}}_{u \sim {\mathcal {U}}({\mathcal {S}}(0,1))} \left[ \frac{n}{\sigma } (\phi (x+\sigma u) + \epsilon (x+\sigma u)) u \right] - \nabla \phi (x)\right\| \\&= \left\| {\mathbb {E}}_{u \sim {\mathcal {U}}({\mathcal {B}}(0,1))} \left[ \nabla \phi (x+\sigma u) - \nabla \phi (x) \right] \right. \\&\quad \left. + {\mathbb {E}}_{u \sim {\mathcal {U}}({\mathcal {S}}(0,1))} \left[ \frac{n\epsilon (x+\sigma u)}{\sigma } u \right] \right\| \\&\le \left\| {\mathbb {E}}_{u \sim {\mathcal {U}}({\mathcal {B}}(0,1))} \left[ \nabla \phi (x+\sigma u) - \nabla \phi (x) \right] \right\| \\&\quad + \left\| {\mathbb {E}}_{u \sim {\mathcal {U}}({\mathcal {S}}(0,1))} \left[ \frac{n\epsilon (x+\sigma u)}{\sigma } u \right] \right\| \\&\le {\mathbb {E}}_{u \sim {\mathcal {U}}({\mathcal {B}}(0,1))} [\Vert \nabla \phi (x+\sigma u) - \nabla \phi (x)\Vert ]\\&\quad + {\mathbb {E}}_{u \sim {\mathcal {U}}({\mathcal {S}}(0,1))} \left[ \left\| \frac{n\epsilon (x+\sigma u)}{\sigma } u \right\| \right] \\&\le L \sigma {\mathbb {E}}_{u \sim {\mathcal {U}}({\mathcal {B}}(0,1))} [\Vert u \Vert ] + \frac{n \epsilon _f}{\sigma } {\mathbb {E}}_{u \sim {\mathcal {U}}({\mathcal {S}}(0,1))} [\Vert u \Vert ] \\&= L \sigma \frac{n}{n+1} + \frac{n \epsilon _f}{\sigma } \le L \sigma + \frac{n \epsilon _f}{\sigma }. \end{aligned}$$

1.7 Derivation of (2.36)

$$\begin{aligned}&\Vert \nabla F(x) - \nabla \phi (x))\Vert \\&\quad ={} \left\| {\mathbb {E}}_{u \sim {\mathcal {U}}({\mathcal {S}}(0, 1))} \left[ \frac{n}{2\sigma } (\phi (x+\sigma u) + \epsilon (x+\sigma u) - \phi (x+\sigma u) - \epsilon (x+\sigma u)) u \right] - \nabla \phi (x)\right\| \\&\quad ={} \left\| {\mathbb {E}}_{u \sim {\mathcal {U}}({\mathcal {B}}(0, 1))} \left[ \frac{1}{2}\nabla \phi (x+\sigma u) + \frac{1}{2}\nabla \phi (x-\sigma u) - \nabla \phi (x)\right] \right. \\&\qquad \left. + {\mathbb {E}}_{u \sim {\mathcal {U}}({\mathcal {S}}(0, 1))} \left[ \frac{n}{2\sigma } (\epsilon (x+\sigma u) - \epsilon (x+\sigma u)) u \right] \right\| \\&\quad \le {} \frac{1}{2}{\mathbb {E}}_{u \sim {\mathcal {U}}({\mathcal {B}}(0, 1))} \left[ \Vert \left( \nabla \phi (x+\sigma u) - \nabla \phi (x)\right) - (\nabla \phi (x) - \phi (x-\sigma u)) \Vert \right] \\&\qquad + {\mathbb {E}}_{u \sim {\mathcal {U}}({\mathcal {S}}(0, 1))} \left[ \left\| \frac{n}{2\sigma } (\epsilon (x+\sigma u) - \epsilon (x+\sigma u)) u \right\| \right] \\&\quad \le {} \frac{1}{2}{\mathbb {E}}_{u \sim {\mathcal {U}}({\mathcal {B}}(0, 1))} \left[ \Vert \left( \nabla \phi (x+\sigma u) - \nabla \phi (x)\right) - (\nabla \phi (x) - \phi (x-\sigma u)) \Vert \right] \\&\qquad + {\mathbb {E}}_{u \sim {\mathcal {U}}({\mathcal {S}}(0, 1))} \left[ \frac{n \epsilon _f}{\sigma } \Vert u\Vert \right] \\&\quad ={} \frac{1}{2}{\mathbb {E}}_{u \sim {\mathcal {U}}({\mathcal {B}}(0, 1))} \left[ \Vert \left( \nabla ^2 \phi (x+ \xi _1 u) - \nabla ^2 \phi (x - \xi _2 u )\right) \sigma u \Vert \right] + {\mathbb {E}}_{u \sim {\mathcal {U}}({\mathcal {S}}(0, 1))} \left[ \frac{n \epsilon _f}{\sigma } \Vert u\Vert \right] , \end{aligned}$$

for some \(0 \le \xi _1 \le \sigma \) and \(0 \le \xi _2 \le \sigma \) by the intermediate value theorem. Then

$$\begin{aligned} \Vert \nabla F(x) - \nabla \phi (x))\Vert \le {}&\frac{1}{2} {\mathbb {E}}_{u \sim {\mathcal {U}}({\mathcal {B}}(0, 1))} \left[ \Vert \nabla ^2 \phi (x+ \xi _1 u) - \nabla ^2 \phi (x - \xi _2 u ) \Vert \Vert \sigma u \Vert \right] \\&\quad + {\mathbb {E}}_{u \sim {\mathcal {U}}({\mathcal {S}}(0, 1))} \left[ \frac{n \epsilon _f}{\sigma } \Vert u\Vert \right] \\ \le {}&\frac{1}{2} {\mathbb {E}}_{u \sim {\mathcal {U}}({\mathcal {B}}(0, 1))} \left[ M \Vert \xi _1 u + \xi _2 u\Vert \cdot \sigma \Vert u\Vert \right] \\&\quad + {\mathbb {E}}_{u \sim {\mathcal {U}}({\mathcal {S}}(0, 1))} \left[ \frac{n \epsilon _f}{\sigma } \Vert u\Vert \right] \\ ={}&\frac{1}{2} {\mathbb {E}}_{u \sim {\mathcal {U}}({\mathcal {B}}(0, 1))} \left[ |\xi _1 + \xi _2| \cdot \Vert u\Vert ^2 M \sigma \right] \\&\quad + {\mathbb {E}}_{u \sim {\mathcal {U}}({\mathcal {S}}(0, 1))} \left[ \frac{n \epsilon _f}{\sigma } \Vert u\Vert \right] \\ \le {}&M \sigma ^2 + \frac{n \epsilon _f}{\sigma } . \end{aligned}$$

1.8 Derivation of (2.39)

The first and third equalities of (A.8) comes from the first and third equalities of (2.18). Considering any vector of iid Gaussian v, dividing by its own norm, can be expressed as \(v = \Vert v\Vert u\). Moreover, \(\Vert v\Vert \) and u are independent. Thus, any homogeneous polynomial p in the entries of u of degree k has the property that

$$\begin{aligned} {\mathbb {E}}_{u \sim {\mathcal {U}}({\mathcal {S}}(0,1))} [ p(u) ] = \frac{{\mathbb {E}}_{v \sim {\mathcal {N}}(0,I)} [ p(v) ]}{{\mathbb {E}}_{v \sim {\mathcal {N}}(0,I)} \Vert v\Vert ^k}. \end{aligned}$$

Then

$$\begin{aligned} \begin{aligned} {\mathbb {E}}_{u \sim {\mathcal {U}}({\mathcal {S}}(0,1))} \left[ (a^T u)^2 u u^T \right]&= \frac{{\mathbb {E}}_{u \sim {\mathcal {N}}(0,I))} \left[ (a^T u)^2 u u^T \right] }{{\mathbb {E}}_{u \sim {\mathcal {N}}(0,I))} \Vert u \Vert ^4} = \frac{a^T a I + 2 a a^T}{n(n+2)} \\ {\mathbb {E}}_{u \sim {\mathcal {U}}({\mathcal {S}}(0,1))} \left[ \Vert u\Vert ^k u u^T \right]&= \frac{{\mathbb {E}}_{u \sim {\mathcal {N}}(0,I))} \left[ \Vert u\Vert ^k u u^T \right] }{{\mathbb {E}}_{u \sim {\mathcal {N}}(0,I))} \Vert u\Vert ^{k+2}} = \frac{1}{n} I. \end{aligned} \end{aligned}$$

The second equality of (2.39) being 0 follows the same argument as that for the second equality of (2.18).

Additional Details: RL Experiments

In all RL experiments the blackbox function f takes as input the parameters of the policy \(\pi _{\theta }:{\mathcal {S}} \rightarrow {\mathcal {A}}\) which maps states to proposed actions. The output of f is the total reward obtained by an agent applying that particular policy \(\pi _{\theta }\) in the given environment.

To encode policies \(\pi _{\theta }\), we used fully-connected feedforward neural networks with two hidden layers, each of \(h=41\) neurons and with \(\mathrm {tanh}\) nonlinearities. The matrices of connections were encoded by low-displacement rank neural networks (see [14]), as in several recent papers on applying orthogonal directions in gradient estimation for ES methods in reinforcement learning. We did not apply any additional techniques such as state/reward renormalization, ranking or filtering, in order to solely focus on the evaluation of the presented proposals.

All experiments were run with hyperparameter \(\sigma =0.1\). Experiments that did not apply line search were run with the use of \(\mathrm {Adam}\) optimizer and \(\alpha = 0.01\). For line search experiments, we were using adaptive \(\alpha \) that was updated via Armijo condition with Armijo parameter \(c_{1}=0.2\) and backtracking factor \(\tau =0.3\).

Finally, in order to construct orthogonal samples, at each iteration we were conducting orthogonalization of random Gaussian matrices with entries taken independently at random from \({\mathcal {N}}(0,1)\) via Gram–Schmidt procedure (see [14]). Instead of the orthogonalization of Gaussian matrices, we could take advantage of constructions, where orthogonality is embedded into the structure (such as random Hadamard matrices from [14]), introducing extra bias but proved to work well in practice. However, in all conducted experiments that was not necessary.

For each environment and each method we run \(k=3\) experiments corresponding to different random seeds.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Berahas, A.S., Cao, L., Choromanski, K. et al. A Theoretical and Empirical Comparison of Gradient Approximations in Derivative-Free Optimization. Found Comput Math 22, 507–560 (2022). https://doi.org/10.1007/s10208-021-09513-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10208-021-09513-z

Keywords

Mathematics Subject Classification

Navigation