A Mixed Finite Differences Scheme for Gradient Approximation

Boresta, Marco; Colombo, Tommaso; De Santis, Alberto; Lucidi, Stefano

doi:10.1007/s10957-021-01994-w

A Mixed Finite Differences Scheme for Gradient Approximation

Open access
Published: 18 February 2022

Volume 194, pages 1–24, (2022)
Cite this article

Download PDF

You have full access to this open access article

Journal of Optimization Theory and Applications Aims and scope Submit manuscript

A Mixed Finite Differences Scheme for Gradient Approximation

Download PDF

Marco Boresta¹,
Tommaso Colombo¹,
Alberto De Santis ORCID: orcid.org/0000-0001-5175-4951¹ &
…
Stefano Lucidi¹

2120 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

In this paper, we focus on the linear functionals defining an approximate version of the gradient of a function. These functionals are often used when dealing with optimization problems where the computation of the gradient of the objective function is costly or the objective function values are affected by some noise. These functionals have been recently considered to estimate the gradient of the objective function by the expected value of the function variations in the space of directions. The expected value is then approximated by a sample average over a proper (random) choice of sample directions in the domain of integration. In this way, the approximation error is characterized by statistical properties of the sample average estimate, typically its variance. Therefore, while useful and attractive bounds for the error variance can be expressed in terms of the number of function evaluations, nothing can be said on the error of a single experiment that could be quite large. This work instead is aimed at deriving an approximation scheme for linear functionals approximating the gradient, whose error of approximation can be characterized by a deterministic point of view in the case of noise-free data. The previously mentioned linear functionals are no longer considered as expected values over the space of directions, but rather as the filtered derivative of the objective function by a Gaussian kernel. By using this new approach, a gradient estimation based on a suitable linear combination of central finite differences at different step sizes is proposed and deterministic bounds that do not depend on the particular sample of points considered are computed. In the noisy setting, on the other end, the variance of the estimation error of the proposed method is showed to be strictly lower than the one of the estimation error of the Central Finite Difference scheme. Numerical experiments on a set of test functions are encouraging, showing good performances compared to those of some methods commonly used in the literature, also in the noisy setting.

Quadratic regularization methods with finite-difference gradient approximations

Article 18 May 2022

A Theoretical and Empirical Comparison of Gradient Approximations in Derivative-Free Optimization

Article 07 May 2021

Gradient-Type Methods for Optimization Problems with Polyak-Łojasiewicz Condition: Early Stopping and Adaptivity to Inexactness Parameter

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

DFO algorithms have become increasingly important since they provide a proper methodology to tackle most of the optimization problems considered in various fields of application. As reported in [4, 8, 16], typical applications fall within the simulation-based optimization problems such as policy optimization in reinforcement learning. DFO methods arise when derivative information is either unavailable, or quite costly to obtain, not to mention when only noisy sample of the objective function are available. In the latter case, it is known that most methods based on finite difference are of little use [11, 19].

One of the approaches in DFO algorithms is that of computing a proper estimate of the gradient of the objective function. Finite difference approximation schemes were already present in early times [15] and have recently been reconsidered as sample average approximations of functionals defining a "filtered version" of the objective function [2, 3, 9, 13]. These functionals arise when defining a gradient approximation as the average of the function variation along all the directions in the whole space. In the most popular methods, the average is performed by weighting the function variations along directions generated either with a uniform kernel on the unit ball [9], or with a Gaussian kernel [2]. These integrals are considered as ensemble averages over the space of the directions of differentiation, and then are approximated by sample averages over a random sample of directions, with various methods. As a general policy, the approximation error is then characterized by its statistical properties (even in the noise-free setting), the variance is expressed in terms of the number of function calculations, and nice bounds are provided to trade-off precision of the gradient estimation and computational costs. Nevertheless it is plain that the error on a single sample may be quite large, even though its variance is bounded.

In this paper, we focus on a different point of view. The functional defining a filtered version of the objective function is considered as weak derivative of the objective function rather than expected values over the space of the directions [20]. The gradient estimation is therefore obtained by considering a numerical approximation of the functional integral, and the estimation error is evaluated in a deterministic fashion. The estimate is obtained by a suitable linear combination of central finite differences at steps with increasing size. Bounds on the approximation error with the proposed method are derived, and the variance of the error in the case of noisy data is also presented.

The goodness of the approximation is experimentally evaluated by comparing the proposed method with those considered benchmarks by the literature—namely: Forward Finite Differences (FFD), Central Finite Differences (CFD) [15], Gaussian Smoothed Gradient (GSG), Central Gaussian Smoothed Gradient (cGSG) [9, 13]—over the benchmark of the Schittkowski functions [17]. Encouraging results are obtained, both in the noise-free and in the noisy setting.

The paper is organized as follows: Sect. 2 formally introduces the gradient estimation problem, highlighting the difference between the approach proposed in this article and the one of several estimates proposed in the literature. In Sect. 3, we present the proposed approximation scheme—NMXFD, with an emphasis on its link with the Finite Difference Method. A theoretical comparison between the variance of the estimation errors of the proposed method and of the CFD scheme is proposed in Sect. 4. Section 5 presents numerical results and conclusions are drawn in Sect. 6.

2 The Gradient Estimate

In this paper, we consider the following unconstrained optimization problem in the derivative free optimization (DFO) setting [6, 12]:

$$\begin{aligned} \min _{x \in R^n} f(x), \end{aligned}$$

(1)

where $f:\, R^n\mapsto R$ is a function with continuous derivative, i.e., $f\in {\mathcal {C}}^1(R^n)$, and we denote the gradient $\nabla f:\, R^n\mapsto R^n$ such that for any $x\in R^n$

$$\begin{aligned} \nabla f(x)=\begin{bmatrix} \frac{\partial f}{\partial x_1}(x)\\ \vdots \\ \frac{\partial f}{\partial x_n}(x)\end{bmatrix}. \end{aligned}$$

In this section, the problem of a numerical approximation of the gradient $\nabla f(x)$ is considered. The most popular approximation scheme is the standard finite difference method [15], but interesting alternative schemes are proposed in papers [2, 9]. A general estimate is obtained according to the following formula:

$$\begin{aligned} G_{\sigma }(x) := \frac{1}{\sigma } \int _{R^n}^{} f(x + \sigma s) \, s \, \varphi (s) \,{\text {d}}s, \end{aligned}$$

(2)

where $\varphi (s):\, R^n \mapsto R$ denotes either a standard Gaussian Kernel ${\mathcal {N}}(0, I_n )$ or a uniform kernel on the unit ball ${\mathcal {B}}(0, 1 )$, ${\text {d}}s={\text {d}}s_1\cdot {\text {d}}s_2\cdot \cdots \cdot {\text {d}}s_n$ is the volume element in $R^n$, and $\sigma >0$ is a scale parameter. The approximation error has different bounds depending on the assumptions on f (see [4]). If the function f is continuously differentiable, and its gradient is L-Lipschitz continuous for all $x \in R^n$, then

$$\begin{aligned} || G_{\sigma }(x) - \nabla f(x) || \le C_{\varphi } L \sigma , \end{aligned}$$

(3)

where $C_{\varphi }$ is a positive constant whose value depends on the kernel. If the function f is twice continuously differentiable, and its Hessian is H-Lipschitz continuous for all $x \in R^n$, then

$$\begin{aligned} || G_{\sigma }(x) - \nabla f(x) || \le C_{\varphi } H \sigma ^2. \end{aligned}$$

(4)

Both bounds (3) and (4) show that

$$\begin{aligned} \lim _{\sigma \rightarrow 0} G_{\sigma }(x) = \nabla f(x). \end{aligned}$$

We will now work out formula (2) considering the (standard) Gaussian kernel

$$\begin{aligned} \varphi (s) \sim {\mathcal {N}}(0, I_n ) = \frac{1}{(\sqrt{2 \pi })^n} \exp {\left\{ {-\frac{1}{2}\sum _{i=1}^n {s_i}^2}\right\} } = \prod _{i=1}^n \varphi (s_i) \end{aligned}$$

(5)

but the considerations that follow hold also if a uniform kernel over the unit ball is considered.

Let us consider this further notation: for any $x\in R^n$ denote by ${\bar{x}}_i\in R^{n-1}$ the following vector $ \begin{bmatrix} x_1, x_2 \, \ldots \, ,x_{i-1}, x_{i+1}, \, \ldots \, , x_n\end{bmatrix}^\mathrm{T}$. With some abuse of notation, but for sake of simplicity in the use of formulas, when addressing a given coordinate $x_i$ in a vector x let us write x as $[x_i\>\bar{x}_i]^\mathrm{T}$ and denote f(x) as $f(x_i,{\bar{x}}_i)$ and $\varphi (s)=\varphi (s_i)\varphi ({\bar{s}}_i)$, with $\varphi (\bar{s}_i)=\prod _{j\ne i}^n \varphi (s_j)$; consistently, the volume element becomes ${\text {d}}s = {\text {d}}s_i\cdot {\text {d}}{\bar{s}}_i$. In case of a vector function f(z), to address explicitly its $i-th$ entry we write it as $[(f(z))_i\>\overline{(f(z))}_i]^\mathrm{T}$. Then, estimate (2) is rewritten as follows

$$\begin{aligned} G_{\sigma }(x)&= \frac{1}{\sigma }\int _{R^n} f(x_1 + \sigma s_1, \ldots , x_n + \sigma s_n) \begin{pmatrix} s_1\\ \vdots \\ s_n\end{pmatrix} \prod _{i=1}^n \varphi (s_i)\, {\text {d}}s \end{aligned}$$

(6)

$$\begin{aligned}&= \begin{bmatrix} \frac{1}{\sigma } \int _{R^n} f(x_1 + \sigma s_1, {\bar{x}}_1 + \sigma {\bar{s}}_1) \, s_1 \,\varphi (s_1) \varphi ({\bar{s}}_1) \, {\text {d}}s_1 \, {\text {d}}{\bar{s}}_1\\ \vdots \\ \frac{1}{\sigma } \int _{R^n}f(x_i + \sigma s_i, {\bar{x}}_i + \sigma {\bar{s}}_i) \, s_i \, \varphi (s_i) \varphi ({\bar{s}}_i) \, {\text {d}}s_i \, {\text {d}}{\bar{s}}_i\\ \vdots \\ \frac{1}{\sigma } \int _{R^n} f(x_n + \sigma s_n, {\bar{x}}_n + \sigma {\bar{s}}_n) \, s_n \, \varphi (s_n) \varphi ({\bar{s}}_n) \, {\text {d}}s_n \, d\bar{s}_n \end{bmatrix} . \end{aligned}$$

(7)

Let us consider the generic entry of vector (7)

$$\begin{aligned} (G_{\sigma }(x))_i&= \frac{1}{\sigma } \int _{R^n} f(x_i + \sigma s_i, {\bar{x}}_i + \sigma {\bar{s}}_i) s_i \varphi (s_i) \varphi ({\bar{s}}_i) {\text {d}}s_i\, {\text {d}}{\bar{s}}_i. \end{aligned}$$

(8)

By the Fubini theorem, we can compute it as follows

$$\begin{aligned} (G_{\sigma }(x))_i = \int _{R^{n-1}} \varphi ({\bar{s}}_i) \Bigg ( \frac{1}{\sigma } \int _{-\infty }^{+\infty } f(x_i + \sigma s_i, \bar{x}_i + \sigma {\bar{s}}_i) s_i \varphi (s_i) {\text {d}}s_i \Bigg ){\text {d}}{\bar{s}}_i. \end{aligned}$$

(9)

The expression in parentheses is the estimate of the directional derivative of f(x) along the i-th coordinate $x_i$ and computed at the point $(x_i, {\bar{x}}_i + \sigma {\bar{s}}_i)$, i.e.,

$$\begin{aligned} g_\sigma (x_i, {\bar{x}}_i + \sigma {\bar{s}}_i) := \frac{1}{\sigma } \int _{-\infty }^{+\infty } f(x_i + \sigma s_i, {\bar{x}}_i + \sigma \bar{s}_i) s_i \varphi (s_i) {\text {d}}s_i. \end{aligned}$$

(10)

Hence, expression (8) becomes

$$\begin{aligned} (G_{\sigma }(x))_i = \frac{1}{\sigma } \int _{R^{n-1}} g_{\sigma }(x_i, {\bar{x}}_i + \sigma {\bar{s}}_i) \, \varphi ({\bar{s}}_i) \, {\text {d}}{\bar{s}}_i. \end{aligned}$$

(11)

Therefore, the generic entry of the gradient estimate $G_{\sigma }(x)$ in formula (7) is the average of function (10) weighted by a $(n-1)$-dimensional Gaussian kernel $\varphi ({\bar{s}}_i)={\mathcal {N}}(0, I_{n-1} )$ over the subspace $R^{n-1}$ of $R^n$. As a consequence, the computation of any entry of vector $G_{\sigma }(x)$ implies an integration over $R^n$. In papers [2, 3], this problem is overcome by considering that (2) is indeed an ensemble average of function $f(x + \sigma s) s$ over all the directions $s\in R^n$ weighted by the Gaussian distribution $\varphi (s)\sim {\mathcal {N}}(0, I_n )$. Therefore, we can write

$$\begin{aligned} G_{\sigma }(x) = \frac{1}{\sigma } E_{\varphi }[f(x + \sigma s) s]. \end{aligned}$$

(12)

Now the ensemble average can be well approximated by sampling a set of M independent directions $\{s_i\}$ in $R^n$ according to ${\mathcal {N}}(0, I_n )$, and considering the sample average approximation of $E_{\varphi }[f(x + \sigma s) s]$

$$\begin{aligned} G_{\sigma }(x) \simeq \frac{1}{M}\sum _{i=1}^M \frac{(f(x + \sigma s_i)-f(x))s_i}{\sigma }, \end{aligned}$$

(13)

or its simmetric version

$$\begin{aligned} G_{\sigma }(x) \simeq \frac{1}{M}\sum _{i=1}^M \frac{(f(x + \sigma s_i)-f(x-\sigma s_i))s_i}{2\sigma }. \end{aligned}$$

(14)

The same argument holds if a uniform distribution over the unit ball is considered for the ensemble average [9]. Now, only $M+1$ function computations in case of (13) or 2M in case of (14) are needed and the convergence properties of the sample estimate to the ensemble average are well established: the sample average is an unbiased estimate and its accuracy increases with increasing M. In [3], suitable expressions of the estimation error variance are found in terms of the number of samples M and the values of some smoothness parameters of function f. Therefore, very useful formulas are given that define the required sample size to obtain a chosen accuracy, with a fixed level of confidence $1-\alpha $. This is a typical statistical characterization of the error, that is robust over the whole ensemble of possible trials, but of course leaves a risk $\alpha $ to have a large error on a single experiment.

In this paper, by exploiting formula (10), the following gradient estimate is proposed

$$\begin{aligned} {\overline{G}}_\sigma (x) := \begin{bmatrix} g_\sigma (x_1, {\bar{x}}_1), \ldots ,\, g_\sigma (x_i, {\bar{x}}_i), \ldots \, ,g_\sigma (x_n, {\bar{x}}_n) \end{bmatrix}^\mathrm{T}, \end{aligned}$$

(15)

where

$$\begin{aligned} g_\sigma (x_i, \bar{x}_i) = \frac{1}{\sigma } \int _{-\infty }^{+\infty } f(x_i + \sigma s_i, {\bar{x}}_i )\, s_i \,\varphi (s_i) \, {\text {d}}s_i \end{aligned}$$

(16)

is obtained from (10) with ${\bar{s}}_i = 0 , \> i = 1,\ldots , n$. This is a different result from estimate (7) and appears to be more practical since only line integrals are involved in the formula.

The following theorem shows that estimate ${\overline{G}}_\sigma (x)$ is close to $G_\sigma (x)$ and converges to it as $\sigma $ tends to zero.

Theorem 2.1

Let $\nabla f(x)$ be Lipschitz continuous with constant L for all $x \in R^n$. Then we have that

$$\begin{aligned} || G_{\sigma }(x) - {\overline{G}}_\sigma (x) || \le L\,\sigma \,\sqrt{n(15 + 7(n-1))}. \end{aligned}$$

(17)

Proof

See Appendix for the proof.

Next theorem shows that ${\overline{G}}_\sigma (x)$ is indeed a good approximation of the true gradient $\nabla f(x)$ and converges to it as $\sigma $ tends to zero. $\square $

Theorem 2.2

Let f(x) be continuously differentiable for all $x \in R^n$. The following holds:

$$\begin{aligned} \lim _{\sigma \rightarrow 0} {\overline{G}}_\sigma (x) = \nabla f(x). \end{aligned}$$

(18)

Proof

We prove (18) component-wise. By integration by parts, we have

$$\begin{aligned} g_\sigma (x_i, {\bar{x}}_i) =&\frac{1}{\sigma } \int _{-\infty }^{+\infty } f(x_i + \sigma s_i, {\bar{x}}_i) \, s_i \, \varphi (s_i) \, {\text {d}}s_i\nonumber \\ =&\frac{1}{\sigma }\int _{-\infty }^{+\infty } \frac{\partial f(z_i, {\bar{x}}_i)}{\partial z_i}\, \frac{n z_i }{{\text {d}}s_i} \, \varphi (s_i) \, {\text {d}}s_i \nonumber \\ =&\int _{-\infty }^{+\infty } \frac{\partial f(z_i, {\bar{x}}_i)}{\partial z_i} \, \varphi (s_i) \, {\text {d}}s_i, \end{aligned}$$

(19)

where $z_i = x_i + \sigma s_i$. By changing of variable, $s_i = \frac{z_i - x_i}{\sigma } $ we obtain that

$$\begin{aligned} g_\sigma (x_i, {\bar{x}}_i) = \int _{-\infty }^{+\infty } \frac{\partial f(z_i, {\bar{x}}_i)}{\partial z_i} \,\frac{1}{\sigma }\varphi \left( \frac{z_i-x_i}{\sigma }\right) \, {\text {d}}z_i \end{aligned}$$

(20)

and therefore, taking into account that a series of Gaussians $\frac{1}{\sigma _n}\varphi (\frac{z_i-x_i}{\sigma _n})$ with $\sigma _n\rightarrow 0$ defines a $\delta $-dirac distribution centered in $x_i$ [10], we have that

$$\begin{aligned} \lim _{\sigma \rightarrow 0} g_\sigma (x_i, {\bar{x}}_i) = \frac{\partial f(x)}{\partial x_i}. \end{aligned}$$

(21)

$\square $

Any entry of (15) is a weak definition of the derivative of f(x) along $x_i$ [10]. Note that (19) is well defined even though f(x) is not differentiable at $(x_i,\,{\bar{x}}_i)$.^{Footnote 1}

3 A New Estimate of the Gradient

We consider the functional $g_\sigma (x_i, {\bar{x}}_i)$ which is the $i_{th}$ component of the gradient estimate (15) and, for the sake of simplicity, we write in a single formula the result of (19) and (20).

$$\begin{aligned} g_\sigma (x_i, {\bar{x}}_i) =&\frac{1}{\sigma } \int _{-\infty }^{+\infty } f(x_i + \sigma s_i, {\bar{x}}_i) \, s_i \, \varphi (s_i) \, {\text {d}}s_i\nonumber \\ =&\int _{-\infty }^{+\infty } \frac{\partial f(z_i, \bar{x}_i)}{\partial z_i} \,\frac{1}{\sigma }\varphi \left( \frac{z_i-x_i}{\sigma }\right) \, {\text {d}}z_i. \end{aligned}$$

(22)

Note that $\frac{1}{\sigma }\varphi (\frac{z_i-x_i}{\sigma })$ is ${\mathcal {N}}(x_i, \sigma ^2)$. Our goal consists in finding a numerical approximation of the first integral in (22). To do that, we compute the integral in a finite range, namely between -S and S

$$\begin{aligned} \tilde{g}_\sigma (x_i, {\bar{x}}_i)&:= \frac{1}{\sigma } \int _{-S}^{+S} f(x_i + \sigma s_i, {\bar{x}}_i) \, s_i \, \varphi (s_i) \, {\text {d}}s_i \nonumber \\&= - \frac{1}{\sigma } \int _{-S}^{+S} f(x_i + \sigma s_i, {\bar{x}}_i) \, \varphi '(s_i) \, {\text {d}}s_i . \end{aligned}$$

(23)

For S sufficiently big the error between (22) and (23) is negligible due to the fast decreasing of the Gaussian to infinity. The definite integral in (23) can be approximated by a quadrature formula, e.g., Trapezoidal Rule [1]. Dividing the interval $[-S, S]$ in 2m sub-intervals, each of size $h = \frac{S}{m}$ we obtain:

$$\begin{aligned} \tilde{g}_\sigma (x_i, {\bar{x}}_i)= & {} - \frac{h}{2\sigma } \bigg [ \bigg ( f(x_i - \sigma S, {\bar{x}}_i)\, \varphi '(-S) + f(x_i + \sigma S, \bar{x}_i) \, \varphi '(S) \nonumber \\&+2 \sum _{j = 1}^{2m-1} f(x_i + \sigma (-S + j h), {\bar{x}}_i)\, \varphi '(-S + j h) \bigg )\bigg ] \nonumber \\&+ \frac{ h^2\,S}{6 \sigma } \frac{{\text {d}}}{{\text {d}}\tau ^2} f(x_i + \sigma \tau , {\bar{x}}_i) \, \varphi '(\tau ) \tau \in [-S, S]. \end{aligned}$$

(24)

It is well known that, under very general conditions, the trapezoidal quadrature formula (24) has an error that is ${\mathcal {O}}(1/m^2)$ [5]. Indeed, once $\sigma $ and S are chosen, we can easily check this property in our case. Let

$$\begin{aligned} \epsilon _\sigma (\tau ,m)&= \frac{ h^2\,S}{6 \, \sigma } \frac{{\text {d}}}{{\text {d}}\tau ^2} f(x_i + \sigma \tau , {\bar{x}}_i) \, \varphi '(\tau ) \tau \in [-S, S]. \\&= \frac{S^3}{6 \, \sigma \, m^2} \frac{{\text {d}}}{{\text {d}}\tau ^2} f(x_i + \sigma \tau , {\bar{x}}_i) \, \varphi '(\tau ) \tau \in [-S, S]. \nonumber \end{aligned}$$

(25)

Note that the derivatives of a guassian kernel $|\varphi ^{(k)}(\tau )|$, up to the third order, are all less than 1 in absolute value for any $\tau $, and decrease rapidly as $\tau $ increases. Therefore, for f sufficiently smooth in $(x_i\pm \sigma \,S)$, let

$$\begin{aligned} K(x_i) = \max \left( |f(x_i + \sigma \tau , {\bar{x}}_i)|,\left| \frac{{\text {d}}}{{\text {d}}\tau } f(x_i + \sigma \tau , {\bar{x}}_i)\right| ,\left| \frac{{\text {d}}^2}{{\text {d}}\tau ^2} f(x_i + \sigma \tau , {\bar{x}}_i)\right| \right) . \end{aligned}$$

We can write:

$$\begin{aligned} |\epsilon _\sigma (\tau ,m)| \le \frac{ h^2\,S}{6 \sigma } K(x_i) = \frac{ S^3 }{6 \sigma \, m^2} K(x_i),\quad \tau \in [-S, +S]. \end{aligned}$$

Let us rewrite (24) as follows

$$\begin{aligned} \tilde{g}_\sigma (x_i, {\bar{x}}_i) = \bar{g}_\sigma (x_i, {\bar{x}}_i) + \epsilon _\sigma (\tau ,m). \end{aligned}$$

The larger the number of function evaluation m, the smaller the error term $\epsilon _\sigma (\tau ,m)$. On the other hand, $\bar{g}_\sigma (x_i)$ can be interpreted as a combination of finite differences with some coefficients. Keeping in mind that $\varphi '(t) = -\varphi '(-t)$ and that $\varphi '(0) = 0$, after some simple algebra we can write:

$$\begin{aligned} \bar{g}_\sigma (x_i, {\bar{x}}_i)&= -\frac{h}{2\sigma }\bigg [|\varphi '(m\,h)| \bigg ( f(x_i - \sigma \,m\,h, {\bar{x}}_i) - f(x_i + \sigma \,m\,h, {\bar{x}}_i)\bigg ) \\&\quad + 2 \sum _{j = 1}^{m-1} |\varphi '(jh)| \bigg (f(x_i - \sigma jh, \bar{x}_i) - f(x_i + \sigma jh, {\bar{x}}_i)\bigg )\bigg ] \end{aligned}$$

from which

$$\begin{aligned} \bar{g}_\sigma (x_i, {\bar{x}}_i)&= \frac{h}{2\sigma }\bigg [|\varphi '(m\,h)|\, 2\sigma \,m\,h \,\frac{f(x_i + \sigma \,m\,h, {\bar{x}}_i) - f(x_i - \sigma \,m\,h, {\bar{x}}_i)}{2\sigma \,m\,h} \nonumber \\&\quad +2\sum _{j = 1}^{m-1} |\varphi '(jh)|\,2\sigma \,j\,h \, \frac{f(x_i + \sigma jh, {\bar{x}}_i) - f(x_i - \sigma jh, {\bar{x}}_i)}{2\sigma \,j\,h} \bigg ]. \end{aligned}$$

(26)

It is clear that $\bar{g}_\sigma (x_i, {\bar{x}}_i)$ is a linear combination of finite difference approximations, with different step sizes; for $\sigma h \rightarrow 0$, each one converges to the true value of the partial derivative ${\partial f(x_i, {\bar{x}}_i)}/{\partial x_i}$. Therefore, the estimate $\bar{g}_\sigma (x_i, {\bar{x}}_i)$ converges to the true value only if the sum of its coefficients equals one. For this reason, it is advisable to normalize the coefficients of the linear combination in (26) to eliminate the estimate bias for $\sigma $ finite. To this aim, let C be the sum of all the coefficients:

$$\begin{aligned} \left. \begin{array}{l} C = \sum _{j = 1}^{m} a^\prime _j, \\ a^\prime _j = 2\,j\,h^2\,|\varphi ^\prime (j h)|,\quad j=1,\ldots ,m-1,\\ a^\prime _m = m\,h^2\,|\varphi ^\prime (m h)|,\\ \end{array}\right\} . \end{aligned}$$

(27)

We can then write the normalized version of (26) as:

$$\begin{aligned} {\hat{g}}_\sigma (x_i, {\bar{x}}_i) = \sum _{j = 1}^{m} a_j\, \frac{ f(x_i + \sigma \,j\,h, {\bar{x}}_i) - f(x_i - \sigma \,j\,h, {\bar{x}}_i)}{2\sigma \,j\,h} \end{aligned}$$

(28)

where

$$\begin{aligned} a_j = \, \frac{a^\prime _j}{C},\qquad \sum _{j = 1}^{m} a_j = 1. \end{aligned}$$

(29)

For $\sigma $ small enough the normalization of the coefficients may not be necessary, the distorsion of the estimate being negligible. Let us now evaluate the error bound corresponding to estimate (28), from here on referred to as NMXFD (Normalized Mixed Finite Difference).

Theorem 3.1

Let f(x) be twice continuously differentiable and its Hessian be H-Lipschitz for all $x\in R^n$. Consider the gradient approximation obtained by (28)

$$\begin{aligned} {\widehat{G}}_\sigma (x) = \left[ \, {\hat{g}}_\sigma (x_1),\ldots {\hat{g}}_\sigma (x_n)\,\right] ^\mathrm{T} . \end{aligned}$$

(30)

We have that

$$\begin{aligned} \Vert {\widehat{G}}_\sigma (x) - \nabla f(x)\Vert \le \sqrt{n}\>\frac{H \sigma ^2\,S^2 }{6} . \end{aligned}$$

(31)

Proof

Any single finite difference term in (28) has an error with respect to the true value ${\partial f(x_i, {\bar{x}}_i)}/{\partial x_i}$ whose bound depends on the step size and on the regularity properties of function f. From [4], we have that

$$\begin{aligned} \left| \frac{ f(x_i + \sigma \,j\,h, {\bar{x}}_i) - f(x_i - \sigma \,j\,h, {\bar{x}}_i)}{2\sigma \,j\,h} - \frac{\partial f(x_i, {\bar{x}}_i)}{\partial x_i}\right| \le \frac{H \sigma ^2 (jh)^2}{6} \end{aligned}$$

(32)

for $j=1,\ldots ,m$. Therefore, since $\sum _{j = 1}^{m} a_j = 1$, and $a_j>0$, $j=1,\ldots ,m$, we can write

$$\begin{aligned} \left| {\hat{g}}_\sigma (x_i) - \frac{\partial f(x_i, \bar{x}_i)}{\partial x_i}\right|&= \left| {\hat{g}}_\sigma (x_i) - \sum _{j=1}^m a_j\,\frac{\partial f(x_i, {\bar{x}}_i)}{\partial x_i}\right| \\&\le \sum _{j = 1}^{m} a_j\, \left| \frac{ f(x_i + \sigma \,j\,h, {\bar{x}}_i) - f(x_i - \sigma \,j\,h, {\bar{x}}_i)}{2\sigma \,j\,h}-\frac{\partial f(x_i, {\bar{x}}_i)}{\partial x_i} \right| \\&\le \frac{H \sigma ^2\,h^2 }{6}\left( \sum _{j=1}^{m} a_j \,j^2\right) \le \frac{H \sigma ^2\,h^2\,m^2 }{6} = \frac{H \sigma ^2\,S^2 }{6}, \end{aligned}$$

which applied to all entries of ${\widehat{G}}_\sigma (x)-\nabla f(x)$, proves the theorem. $\square $

Here we used the equality $m\,h=S$ that implies that the error bound does not depend on the number of function evaluations.

4 Estimation Error with Noisy Data

Let us now evaluate how the performance of the gradient estimate NMXFD (30) here referred to as ${\hat{G}}_\sigma ^{{\text {MXF}}}(x)$ compares with that of the Central Finite Differences (CFD), taking also into account the presence of an additive noise affecting the sampled function values f(x). Let $\{e_i\}$ be the canonical base of $R^n$, then we can write:

$$\begin{aligned} {\hat{G}}_\sigma ^{{\text {MXF}}}(x) = \sum _{i=1}^n {\hat{g}}_\sigma (x_i)\, e_i \end{aligned}$$

(33)

with the same notation we can easily write the gradient estimate according to the CFD scheme here denoted as ${\hat{G}}_\sigma ^{{\text {CFD}}}(x)$:

$$\begin{aligned} {\hat{G}}_\sigma ^{{\text {CFD}}}(x) = \sum _{i=1}^n \frac{ f(x_i + \sigma \,h, {\bar{x}}_i) - f(x_i - \sigma \,h, \bar{x}_i)}{2\sigma \,h}\,e_i = \sum _{i=1}^n \delta f_\sigma (x_i)\, e_i. \end{aligned}$$

(34)

Let $\{\epsilon _i\}$ denote a discrete random field modeling the additive noise on the sampled function values with the following properties: $\epsilon _i \sim N(0,\lambda ^2)$ and $E[\epsilon _i \,\epsilon _j] = 0$ for $i\ne j$. We now compute the estimation errors for the two schemes and compare them in terms of accuracy (mean value) and precision (variance). The accuracy evaluates the estimate bias, i.e., the systematic source of the error, like the limited the number N of function evaluations used to build the estimate. The precision is the dispersion of the estimation error around its mean value and evaluates the variability of the statistic source of the error.

The CFD scheme

According to (34), a number $N = 2 n$ of function evaluations is considered to obtain

$$\begin{aligned} {\hat{G}}_\sigma ^{{\text {CFD}}}(x) = \sum _{i=1}^n \delta f_\sigma (x_i)\, e_i +\sum _{i=1}^n \frac{\epsilon _i^+ - \epsilon _i^-}{2\,\sigma \,h} \, e_i \end{aligned}$$

with $\epsilon _i^\pm $ denoting the noise on the function values $f_\sigma (x_i\pm \sigma \,h,{\bar{x}}_i)$. Let

$$\begin{aligned} e_{{\text {CFD}}}(x) = {\hat{G}}_\sigma ^{{\text {CFD}}}(x) - \nabla f(x) \end{aligned}$$

be the estimation error. We can see that

$$\begin{aligned} E[e_{\text {\tiny {CFD}}}(x)] = \sum _{i=1}^n \delta f_\sigma (x_i)\, e_i - \nabla f(x) \end{aligned}$$

and

$$\begin{aligned} var[e_{{\text {CFD}}}(x)] = n\, \frac{2 \lambda ^2}{4\, \sigma ^2 \,h^2} = \frac{n\,\lambda ^2}{2\sigma ^2 \,h^2} \end{aligned}$$

(35)

where var[z], $z \in R^n$ with $E[z] = 0$, indicates the trace of the covariance matrix $E[z\,z^\mathrm{T}]$. Now, for functions f as in theorem (3.1), let us consider the property (32), with $j=1$, for all the components of $E[e_{{\text {CFD}}}(x)]$. We obtain that

$$\begin{aligned} \Vert E[e_{{\text {CFD}}}(x)]\Vert \le \sqrt{n}\,\frac{H\,\sigma ^2\,h^2}{6}. \end{aligned}$$

Therefore, as the increment $\sigma h\rightarrow 0$, the error goes to zero as well on average, but its variance increases without bound as ${\mathcal {O}}\left( 1/(\sigma h)^2\right) $.

The NMXFD scheme

In this case, according to (33), a number $N = 2 m\,n$ of function evaluations is considered to obtain

$$\begin{aligned} {\hat{G}}_\sigma ^{{\text {MXF}}}(x) = \sum _{i=1}^n {\hat{g}}_\sigma (x_i)\, e_i + \sum _{i=1}^n\left( \sum _{j = 1}^{m} a_j\, \frac{\epsilon _{i,j}^+ - \epsilon _{i,j}^-}{2\sigma \,j\,h}\right) \,e_i \end{aligned}$$

with $\epsilon _{i,j}^\pm $ denoting the error terms on the function values $f(x_i\pm \sigma \,jh, {\bar{x}}_i)$, $i=1,\ldots ,n$, $j=1,\ldots ,m$. For the estimation error

$$\begin{aligned} e_{{\text {MXF}}}(x) = {\hat{G}}_\sigma ^{{\text {MXF}}}(x) - \nabla f(x), \end{aligned}$$

we readily obtain that

$$\begin{aligned} E\left[ e_{{\text {MXF}}}(x)\right]&= \sum _{i=1}^n {\hat{g}}_\sigma (x_i)\, e_i-\nabla f(x) \nonumber \\ var\left[ e_{{\text {MXF}}}(x)\right]&= \frac{n\,\lambda ^2}{2\sigma ^2h^2}\left( \sum _{j=1}^{m} \frac{a_j^2}{j^2}\right) . \end{aligned}$$

(36)

Under the assumptions of theorem (3.1), and taking into account (31), we obtain

$$\begin{aligned} \Vert E\left[ e_{{\text {MXF}}}(x)\right] \Vert \le \sqrt{n}\>\frac{H \sigma ^2\,m^2\,h^2 }{6}. \end{aligned}$$

(37)

As for the error variance, two interesting results can be proved.

Proposition 4.1

For any $m>1$, the variance of the estimation error of the NMXFD scheme is strictly lower than the variance of the estimation error of the CFD scheme, i.e.,

$$\begin{aligned} var\left[ e_{{\text {MXF}}}(x)\right] < var\left[ e_{{\text {CFD}}}(x)\right] \end{aligned}$$

(38)

in any $x\in R^n$ and for any $\sigma $, h.

Proof

The sum of squares $\sum _{j= 1}^{m} a_j^2$ is strictly less then 1 since the coefficients $a_j$, $j=1,\ldots ,m$, are all positive and their sum is 1. Therefore, from (36) we obtain that

$$\begin{aligned} var\left[ e_{{\text {MXF}}}(x)\right] = \frac{n \lambda ^2}{2 \sigma ^2\,h^2} \sum _{j = 1}^{m} \frac{a_j^2}{j^2} < \frac{n \lambda ^2}{2 \sigma ^2\,h^2} = var\left[ e_{{\text {CFD}}}(x)\right] . \end{aligned}$$

(39)

$\square $

Now we further show that $var\left[ e_{{\text {MXF}}}(x)\right] $ goes to zero as N increases.

Proposition 4.2

For any $x\in R^n$, the variance of the estimation error of the NMXFD scheme has the following asymptotic behavior

$$\begin{aligned} var\left[ e_{{\text {MXF}}}(x)\right] \sim {\mathcal {O}}\left( \frac{1}{N}\right) . \end{aligned}$$

(40)

Proof

By taking into account relations (27), we have that

$$\begin{aligned} C&= m\,h^2\left( |\varphi ^\prime (m h)|+2\sum _{j=1}^{m-1}\frac{j}{m}|\varphi ^\prime (j h)|\right) \nonumber \\&\le 2m\,h\,\frac{h}{2}\left( |\varphi ^\prime (m h)|+2\sum _{j=1}^{m-1}|\varphi ^\prime (j h)|\right) . \end{aligned}$$

(41)

Let us denote with $I_{\varphi ^\prime }^{(1)}(m)$ the following quantity

$$\begin{aligned} I_{\varphi ^\prime }^{(1)}(m) = \frac{h}{2}\left( |\varphi ^\prime (m h)|+2\sum _{j=1}^{m-1}|\varphi ^\prime (j h)|\right) \end{aligned}$$

that is the trapezoidal quadrature formula for the integral

$$\begin{aligned} \int _0^S |\varphi ^\prime (t)|\,{\text {d}}t = \frac{1}{\sqrt{2\pi }}\left( 1-e^{-\frac{S^2}{2}}\right) . \end{aligned}$$

Due to the ${\mathcal {O}}(1/N^2) $ property of the error of the trapezoidal rule, we have that

$$\begin{aligned} \bigg |I_{\varphi ^\prime }^{(1)}(m) - \frac{1}{\sqrt{2\pi }}\left( 1-e^{-\frac{S^2}{2}}\right) \bigg |= {\mathcal {O}}(1/N^2). \end{aligned}$$

Therefore, from (41), we easily obtain that

$$\begin{aligned} \bigg | C-\frac{2m\,h}{\sqrt{2\pi }}\left( 1-e^{-\frac{S^2}{2}}\right) \bigg |\le & {} 2m\,h\,\bigg |I_{\varphi ^\prime }^{(1)}(m) - \frac{1}{\sqrt{2\pi }}\left( 1-e^{-\frac{S^2}{2}}\right) \bigg | \nonumber \\= & {} {\mathcal {O}}(1/N^2) \end{aligned}$$

(42)

so that C is a bounded quantity as $N=2m\,n$ increases (by increasing m), taking into account that $mh=S$. Now, according to the relations (29) we can write

$$\begin{aligned} \sum _{j=1}^m \frac{a_j^2}{j^2}&= \frac{1}{C^2}\left( \frac{m^2\,h^4|\varphi ^\prime (m h)|^2}{m^2}+\sum _{j=1}^{m-1}\frac{4\,j^2\,h^4\,|\varphi ^\prime (j h)|^2}{j^2} \right) \nonumber \\&= \frac{h^4}{C^2}\left( |\varphi ^\prime (m h)|^2 +2\sum _{j=1}^{m-1} 2|\varphi ^\prime (j h)|^2 \right) \nonumber \\&\le \frac{2\,h^3}{C^2}\frac{h}{2}\left( 2|\varphi ^\prime (m h)|^2 +2\sum _{j=1}^{m-1} 2|\varphi ^\prime (j h)|^2 \right) .\nonumber \\ \end{aligned}$$

Define now $I_{\varphi ^\prime }^{(2)}(m)$ as follows

$$\begin{aligned} I_{\varphi ^\prime }^{(2)}(m) = \frac{h}{2}\left( 2|\varphi ^\prime (m h)|^2 +2\sum _{j=1}^{m-1} 2|\varphi ^\prime (j h)|^2 \right) . \end{aligned}$$

It is the trapezoidal quadrature rule for the integral

$$\begin{aligned} 2\int _0^S |\varphi ^\prime (t)|^2\,{\text {d}}t = \sqrt{\pi }\, {{\,\mathrm{erf}\,}}(S) - S e^{-S^2} = \varPhi (S), \end{aligned}$$

where ${{\,\mathrm{erf}\,}}(z)=\frac{2}{\sqrt{\pi }}\int _0^z e^{-t^2}\,{\text {d}}t$ is the Gauss error function. Hence, for the usual property of the error, we can write

$$\begin{aligned} \bigg |I_{\varphi ^\prime }^{(2)}(m) - \varPhi (S)\bigg | = {\mathcal {O}}(1/N^2). \end{aligned}$$

Therefore, we obtain that

$$\begin{aligned} var\left[ e_{{\text {MXF}}}(x)\right] =&\frac{n\,\lambda ^2}{2\sigma ^2\,h^2}\left( \sum _{j=1}^{m} \frac{a_j^2}{j^2}\right) \le \frac{n\,\lambda ^2}{2\sigma ^2\,h^2} \frac{2\,h^3}{C^2}I_{\varphi ^\prime }^{(2)}(m) \\ \le&\frac{n\,\lambda ^2}{\sigma ^2}\,\frac{h}{C^2}\left( \big |I_{\varphi ^\prime }^{(2)}(m)- \varPhi (S)\big | + \big |\varPhi (S) \big |\right) . \\ \end{aligned}$$

Now recalling that $m\,h = S $, and that $N=2m\,n$, we can write

$$\begin{aligned} var\left[ e_{{\text {MXF}}}(x)\right] \le&\frac{n\,\lambda ^2}{\sigma ^2}\,\frac{S}{m\,C^2}\left( \big |I_{\varphi ^\prime }^{(2)}(m)- \varPhi (S)\big | + \big |\varPhi (S) \big |\right) \\ \le&\frac{2}{N}\,\frac{n^2\,\lambda ^2\,S}{\sigma ^2\,C^2}\left( \big |I_{\varphi ^\prime }^{(2)}(m)- \varPhi (S)\big | + \big |\varPhi (S) \big |\right) ,\\ \end{aligned}$$

which along with (42), proves the proposition. $\square $

5 Numerical Experiments

We tested our method for estimating the gradient by comparing its performance with those of other methods on 69 functions from the Schittkowski test set [17].

For each function, we did the following: we generated a random starting point $x^0$ and minimized the function using the quasi-Newton method of Broyden, Fletcher, Goldfarb and Shanno (BFGS) [14], finding the optimal point $x^*$ with $\nabla f(x^*) \approx 0$. We then identified the first instance of a point $x^k$ where

$$\begin{aligned} \frac{\Vert \nabla f(x^k)\Vert }{\Vert \nabla f(x^0)\Vert } \le \alpha \end{aligned}$$

for each of the following values of $\alpha $: $10^{0}, 10^{-1},10^{-2},10^{-3},10^{-4},10^{-5},10^{-6}$. In this way, we generated seven different buckets, one for each $\alpha $, of 69 different points, one for each function. Bucket i indicates the one associated to $\alpha = 10^{-i}$. Bucket 0 is therefore the one with the points that are farther from the optimal solution and bucket 6 is the one with points closer to the optimal solution.

Then, for each point we computed the gradient approximations obtained with the Normalized MiXed Finite Differences scheme (NMXFD) and with those considered benchmarks by the literature, namely: Forward Finite Differences (FFD), Central Finite Differences (CFD), Gaussian Smoothed Gradient (GSG), Central Gaussian Gmoothed Gradient (cGSG) as defined in [4]. Different tables will summarize the results of this comparison.

The tables show, for different values of the number of function evaluations (N) and different buckets (B), the median value of the log of the relative approximation error over all the 69 points in each bucket.

We define relative approximation error as

$$\begin{aligned} \eta = \frac{\Vert g(x)- \nabla f(x)\Vert }{\Vert \nabla f(x)\Vert }, \end{aligned}$$

where g(x) is the generic gradient estimate. The number of function evaluations N is expressed in the following tables as a function of the number of dimensions n. FFD and CFD schemes only allow for a specific value of N ($n+1$ and 2n, respectively). In GSG and in cGSG, N is linked to the number of direction sampled to build the gradient approximation ($N=(M+1)$ in (13) and $N = 2\,M$ in (14)). In the NMFXD scheme, the value of N is linked to the value of m in formula (28). In particular, we have that $N = 2$ mn. In each table, the lowest entry for every bucket is highlighted in bold, and the second lowest is italic.

5.1 Noise-Free Setting

For the noise-free setting, we report three different tables obtained using a different value of $\sigma $ (shared by all the schemes) to compute the gradient approximation (Tables 1, 2, 3).

Table 1 Median log of relative error with $\sigma = 10^{-2}$

Full size table

Table 2 Median log of relative error with $\sigma = 10^{-5}$

Full size table

Table 3 Median log of relative error with $\sigma = 10^{-8}$

Full size table

It is possible to notice that in a noise-free setting, lower values of $\sigma $ tend to yield to better results, as one would expect from the theory. The closer the point is to the minimum value of a function, the harder it is to obtain an accurate estimate of its gradient, unless $\sigma $ is very small. As a matter of fact, for points belonging to lower index buckets—thus far from the minimum of the function, the value $\sigma = 10^{-5}$ yields the better performances, while accurate estimates of the gradient of points closer to the minimum value of a function require using of a lower value of $\sigma $. We can also see that the error of the proposed method, NMXFD, is of the same order of magnitude of that of CFD, and almost always better than that of the other methods.

In our experiments, we have also produced gradient estimates using two more methods:

by removing the normalization of the coefficients in the computation of NMXFD, i.e., implementing the gradient approximation as in (26).
by computing the estimate as the raw average of central finite differences at different stepsizes, that is (28) with $a_j = \frac{1}{m}$.

Both of these methods performed consistently worse than NMXFD, and they have not been reported in the tables for brevity. Still, the better performances of NMXFD over the raw average of central finite differences seem to confirm that the rationale behind the choice of coefficients used to weight the CFDs in the proposed approach is promising from a computational point of view.

5.2 Noisy Setting

We also show results of the noisy scenario, where the noise term is described in Sect. 4 and has $\lambda = 0.001$. The estimation procedure is slightly different from the one of the noise-free setting. In Table 4, the median log of the relative errors $\eta _i$ of the 69 different Schittkowski function is reported. Each $\eta _i$ is computed as the average of 100 relative approximation errors, resulting from 100 independent noise realizations. The rationale behind this choice was to mitigate the dependence of the results from one particular noise realization. Results are shown in Table 4, where the gradient estimates are obtained with $\sigma = 0.01$.

Table 4 Median log of relative error with $\sigma = 10^{-2}$, noisy setting

Full size table

Table 4 shows that NMXFD performs better than the other schemes in presence of noise, although reasonably low relative approximation errors are obtained only for the first three buckets. For the other ones, the error $\eta $ increases significantly. This is due to the fact that the denominator of $\eta $ gets smaller as we move to points close to the minimum value of the function, while the variance of the approximation error does not change across different buckets. Just like in the noise-free setting, increasing the number of function evaluations allows to increase the precision of all the schemes, as expected from the theory.

Different values of $\sigma $ for estimating the gradient ($10^{-1}$ , $10^{-3}$, $10^{-4}$) have also been used. The associated tables have not been reported for brevity, since they yielded to the same conclusions and since the performances for almost every method and every bucket with those values of $\sigma $ are significantly worse. This can be inferred from the theory, since the value of $\sigma $ influences the bias and the variance of the estimate error in opposite directions, as we can see from (36) and (37) in Sect. 4.

The numerical experiments show the good performances of the proposed method when compared with those of the standard methods commonly used in the literature. In particular, the performances of NMXFD are comparable with those of CFD in absence of noise and better with noisy data and are better than those of other schemes in both scenarios.

The results seem to confirm the idea that performing a combination of finite differences in the noisy setting increases the quality of the gradient estimation. In this line, the simplest combination possible is the average of a number m of multiple CFDs (mCFD) computed over repeated measures

$$\begin{aligned} {\hat{G}}_\sigma ^{{\text {mCFD}}}(x) = \frac{1}{m} \sum _{k=1}^m {\hat{G}}_{\sigma ,k}^{{\text {CFD}}}(x) \end{aligned}$$

(43)

where ${\hat{G}}_{\sigma ,k}^{{\text {CFD}}}(x)$ is the CFD in (34) computed at the same points, but with a different independent realization k of the noise. This formula, obviously, reduces the error variance of CFD by 1/m, therefore it becomes interesting to see if

$$\begin{aligned} var\left[ e_{{\text {MXF}}}(x)\right] = \frac{n \lambda ^2}{2 \sigma ^2\,h^2} \sum _{j = 1}^{m} \frac{a_j^2}{j^2} < \frac{1}{m}\frac{n \lambda ^2}{2 \sigma ^2\,h^2} = var\left[ e_{{\text {mCFD}}}(x)\right] . \end{aligned}$$

(44)

Because of the complicated structure of the coefficients $a_j$ a formal proof of (44) can be involved. In Table 5, we report a numerical verification of (44) for increasing values of m, with a uniform sampling within the range [-S, S] with $S = mh = 3$ to compute coefficients $a_j$.

Table 5 Variance reduction coefficient on increasing m for NMXFD (1st column) and mCFD (2nd column)

Full size table

For $m = 1$, the reduction of the variance of the two methods is the same. For all $m > 2$, we can see that the reduction of the error variance of NMXFD is greater than that of mCFD.

In Table 6, we finally report the comparison of the median log of relative error between ${\hat{G}}_\sigma ^{{\text {MXF}}}$ and ${\hat{G}}_\sigma ^{{\text {mCFD}}}$ on increasing noise levels $\lambda $, all computed with a value $\sigma $ of 0.01 and always using the same function evaluation budget. We do not report the performances of other methods for brevity, since they confirm the same conclusions provided by Table 4.

Table 6 Median log of relative error with $\sigma = 10^{-2}$, different values of $\lambda $

Full size table

Table 6 shows that the basic combination ${\hat{G}}_\sigma ^{{\text {mCFD}}}$ is indeed a good gradient approximation due to the effect of the average that reduces the error variance. As the noise level increases, ${\hat{G}}_\sigma ^{{\text {MXF}}}$ tends to be better than ${\hat{G}}_\sigma ^{{\text {mCFD}}}$. This supports the idea that a good gradient approximation depends on both the coefficients of the linear combination and the sampling points where the differences are computed. In this respect, the analysis developed in Sect. 3 to define the new gradient estimate, provides a guide to design a more efficient estimate, depending on the following points:

the parameter S that determines the range of integration in integral (23);
the integration formula used to approximate integral (23);
the filter parameter $\sigma $;
the sampling strategy of the function within the integration range $(-S, S)$.

In this early investigation, we heuristically tried several values for the parameters S and $\sigma $, without trying different integration formulas or sampling criteria. The choice of $\sigma $ may be difficult and affects the quality of the approximation. When the noise level is known, there are some strategies to make a proper choice of $\sigma $ as in [18]. When the noise level is not known, the choice of this parameter becomes harder and represents an open question to be further investigated, along with the other points in the list above, to improve the performances of NMXFD.

Data availability statement: Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.

6 Conclusions

In this paper, a novel scheme to estimate the gradient of a function is proposed. It is based on linear functionals defining a filtered version of the objective function. Unlike standard methods where the approximation error is characterized from a statistical point of view and therefore may be quite large on a given experiment, one advantage of the proposed scheme relies on a deterministic characterization of the approximation error in the noise-free setting.

The other advantage lies in its behavior when function evaluations are affected by noise. In fact, the variance of the estimation error of the proposed method is showed to be strictly lower than that of the Central Finite Difference scheme and diminishes as the number of function evaluations increases. The suitable linear combination of finite differences seems to have a filtering role in the case of noisy functions, thus resulting in a more robust estimator.

Numerical experiments on a significant benchmark given by the 69 Schittkowski functions show the good performances of the proposed method when compared with those of the standard methods commonly used in the literature. In particular, the performances of NMXFD are comparable with those of CFD in absence of noise and better with noisy data and seem to be better than those of other schemes in both scenarios. Moreover, we also show the comparison with NMXFD and the average of repeated CFD, thus using the same budget of function evaluations. As the noise level increases, NMXFD tends to perform better than all the other schemes.

This supports the idea that the theory developed to propose this new scheme can be a suitable framework to design gradient estimates with noisy data. The gradient estimate proposed in this paper can be seen as a first design attempt. A future study could be dedicated to the investigation of the best gradient estimates in this framework, along with the analysis of the impact of the obtained gradient approximation when used in optimization algorithms.

Notes

Any $L_1$ function satisfying (19), in place of $\frac{\partial f(z_i, \bar{x}_i)}{\partial z_i}$, is a weak derivative of f(x) along $x_i$.

References

Atkinson, K.E.: An Introduction to Numerical Analysis. Wiley, New York (2008)
Google Scholar
Balasubramanian, K., Ghadimi, S.: Zeroth-order nonconvex stochastic optimization: Handling constraints, high dimensionality, and saddle points. Foundations of Computational Mathematics pp. 1–42 (2021)
Berahas, A.S., Cao, L., Choromanski, K., Scheinberg, K.: Linear interpolation gives better gradients than Gaussian smoothing in derivative-free optimization. arXiv preprint arXiv:1905.13043 (2019)
Berahas, A.S., Cao, L., Choromanski, K., Scheinberg, K.: A theoretical and empirical comparison of gradient approximations in derivative-free optimization. Foundations of Computational Mathematics, pp. 1–54 (2021)
Boyd, J.P.: Chebyshev and Fourier Spectral Methods. Springer, Berlin (2001)
MATH Google Scholar
Conn, A.R., Scheinberg, K., Vicente, L.N.: Geometry of interpolation sets in derivative free optimization. Math. Program. 111(1–2), 141–172 (2008)
MathSciNet MATH Google Scholar
Cramér, H.: Mathematical Methods of Statistics, vol. 43. Princeton University Press, Princeton (1999)
MATH Google Scholar
Fazel, M., Ge, R., Kakade, S., Mesbahi, M.: Global convergence of policy gradient methods for the linear quadratic regulator. In: International Conference on Machine Learning, pp. 1467–1476. PMLR (2018)
Flaxman, A.D., Kalai, A.T., McMahan, H.B.: Online convex optimization in the bandit setting: gradient descent without a gradient. arXiv:0408.007 (2004)
Gel’fand, I.M., Shilov, G.E.: Generalized Functions, Volume 2: Spaces of Fundamental and Generalized Functions, vol. 261. American Mathematical Soc. (2016)
Kolda, T.G., Lewis, R.M., Torczon, V.: Optimization by direct search: new perspectives on some classical and modern methods. SIAM Rev. 45(3), 385–482 (2003)
Article MathSciNet Google Scholar
Larson, J., Menickelly, M., Wild, S.M.: Derivative-free optimization methods. Acta Numer. 28, 287–404 (2019)
Article MathSciNet Google Scholar
Nesterov, Y., Spokoiny, V.: Random gradient-free minimization of convex functions. Found. Comput. Math. 17(2), 527–566 (2017)
Article MathSciNet Google Scholar
Nocedal, J., Wright, S.J.: Sequential quadratic programming. Numer. Optim. pp. 529–562 (2006)
Polyak, B.T.: Introduction to Optimization, vol. 1. Inc., Publications Division, New York (1987)
MATH Google Scholar
Salimans, T., Ho, J., Chen, X., Sidor, S., Sutskever, I.: Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864 (2017)
Schittkowski, K.: More Test Examples for Nonlinear Programming Codes, vol. 282. Springer, Berlin (2012)
MATH Google Scholar
Shi, H.J.M., Xuan, M.Q., Oztoprak, F., Nocedal, J.: On the numerical performance of derivative-free optimization methods based on finite-difference approximations. arXiv preprint arXiv:2102.09762 (2021)
Wild, S.M., Regis, R.G., Shoemaker, C.A.: Orbit: optimization by radial basis function interpolation in trust-regions. SIAM J. Sci. Comput. 30(6), 3197–3219 (2008)
Article MathSciNet Google Scholar
Ziemer, W.P.: Weakly Differentiable Functions: Sobolev Spaces and Functions of Bounded Variation, vol. 120. Springer, Berlin (2012)
MATH Google Scholar

Download references

Open Access

This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Author information

Authors and Affiliations

Department of Computer, Control and Management Engineering Antonio Ruberti, Sapienza University of Rome, Via Ariosto 25, 00185, Rome, Italy
Marco Boresta, Tommaso Colombo, Alberto De Santis & Stefano Lucidi

Authors

Marco Boresta
View author publications
You can also search for this author in PubMed Google Scholar
Tommaso Colombo
View author publications
You can also search for this author in PubMed Google Scholar
Alberto De Santis
View author publications
You can also search for this author in PubMed Google Scholar
Stefano Lucidi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alberto De Santis.

Additional information

Communicated by Gianni Di Pillo.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Proof of Theorem (2.1)

we have that

$$\begin{aligned} || G_{\sigma }(x) - {\overline{G}}_\sigma (x) ||^2=\sum _{i=1}^n \left( (G_{\sigma }(x))_i - ({\overline{G}}_\sigma (x))_i\right) ^2 \end{aligned}$$

where $(G_{\sigma }(x))_i$ is given by (11)

$$\begin{aligned} (G_{\sigma }(x))_i = \int _{R^{n-1}} g_\sigma (x_i,\,{\bar{x}}_i + \sigma {\bar{s}}_i)\,\varphi ({\bar{s}}_i)\,{\text {d}}{\bar{s}}_i \end{aligned}$$

and $({\overline{G}}_\sigma (x))_i = g_\sigma (x_i,\,{\bar{x}}_i )$, by (16). We can write

$$\begin{aligned} \left( (G_{\sigma }(x))_i - ({\overline{G}}_\sigma (x))_i\right) ^2 =&\left( \int _{R^{n-1}} g_\sigma (x_i,\,{\bar{x}}_i + \sigma {\bar{s}}_i)\,\varphi ({\bar{s}}_i)\,{\text {d}}{\bar{s}}_i - g_\sigma (x_i,\,{\bar{x}}_i )\right) ^2 \nonumber \\ =&\left( \int _{R^{n-1}} (g_\sigma (x_i,\,{\bar{x}}_i + \sigma \bar{s}_i)-g_\sigma (x_i,\,{\bar{x}}_i ))\,\varphi ({\bar{s}}_i)\,{\text {d}}{\bar{s}}_i \right) ^2 \end{aligned}$$

(45)

where the last equality holds since $\int _{R^{n-1}} \varphi (\bar{s}_i)\,{\text {d}}{\bar{s}}_i = 1$. Now, the integrand in (45) has the following expression

$$\begin{aligned}&g_\sigma (x_i,\,{\bar{x}}_i + \sigma {\bar{s}}_i)-g_\sigma (x_i,\,{\bar{x}}_i ) \nonumber \\&\quad =\frac{1}{\sigma }\int _{-\infty }^\infty \left( f(x_i + \sigma s_i,\,{\bar{x}}_i + \sigma {\bar{s}}_i)-f(x_i + \sigma s_i,\,\bar{x}_i)\right) \,s_i\,\varphi (s_i) \, d s_i, \end{aligned}$$

(46)

and for the argument of the integral we can write

$$\begin{aligned}&f(x_i + \sigma s_i,\,{\bar{x}}_i + \sigma {\bar{s}}_i)-f(x_i + \sigma s_i,\,{\bar{x}}_i)\nonumber \\&\quad = \left( f(x_i + \sigma s_i,\,{\bar{x}}_i + \sigma {\bar{s}}_i)-f(x_i,\,{\bar{x}}_i)\right) -\left( f(x_i + \sigma s_i,\,{\bar{x}}_i)-f(x_i,\,{\bar{x}}_i)\right) \nonumber \\&\quad = \,\nabla f(x^\prime )^\mathrm{T}\,\sigma s -(\nabla f(x_i^{\prime \prime }, \bar{x_i}))_i\,\sigma \,s_i \nonumber \\&\quad =\,(\nabla f(x^\prime ))_i\,\sigma \,s_i+\overline{(\nabla f(x^\prime ))}_i^\mathrm{T}\,\sigma \,{\bar{s}}_i-(\nabla f(x_i^{\prime \prime }, \bar{x_i}))_i\,\sigma \,s_i \end{aligned}$$

(47)

with $x^\prime \in (x,x+\sigma s)$ and $x_i^{\prime \prime } \in (x_i,x_i+\sigma s_i)$.

We further have that

$$\begin{aligned} \overline{(\nabla f(x^\prime ))}_i = \overline{(\nabla f(x^\prime ))}_i - \overline{(\nabla f(x))}_i + \overline{(\nabla f(x))}_i \end{aligned}$$

(48)

Now substituting (47) and (48) into (46), we obtain that

$$\begin{aligned}&g_\sigma (x_i,\,{\bar{x}}_i + \sigma {\bar{s}}_i)-g_\sigma (x_i,\,{\bar{x}}_i ) \nonumber \\&\quad = \int _{-\infty }^\infty \left[ (\nabla f(x^\prime ))_i - (\nabla f(x_i^{\prime \prime }, \bar{x_i}))_i\right] \,s_i^2\,\varphi (s_i) \, {\text {d}}s_i \nonumber \\&\qquad + \int _{-\infty }^\infty \left( \overline{(\nabla f(x^\prime ))}_i -\overline{(\nabla f(x))}_i + \overline{(\nabla f(x))}_i \right) ^\mathrm{T}\bar{s_i}\,s_i\,\varphi (s_i) \, {\text {d}}s_i. \end{aligned}$$

(49)

By the Lipschitz property of the gradient, and recalling that

$$\begin{aligned} \int _{-\infty }^\infty \overline{(\nabla f(x))}_i^\mathrm{T}\,\bar{s_i}\,s_i\,\varphi (s_i) \, {\text {d}}s_i = 0 \end{aligned}$$

we have:

$$\begin{aligned}&\left( g_\sigma (x_i,\,{\bar{x}}_i + \sigma {\bar{s}}_i)-g_\sigma (x_i,\,{\bar{x}}_i )\right) ^2 \nonumber \\&\quad \le \int _{-\infty }^{\infty } L^2 \sigma ^2 \, \Vert s\Vert ^2 \, s_i^4\,\varphi (s_i)\, {\text {d}}s_i \nonumber \\&\qquad + \int _{-\infty }^{\infty } L^2 \sigma ^2 \, \Vert s\Vert ^2 \,\Vert \bar{s_i}\Vert ^2 \, s_i^2\,\varphi (s_i)\, {\text {d}}s_i \end{aligned}$$

(50)

We can finally substitute (50) into (45) obtaining:

$$\begin{aligned}&\left( (G_{\sigma }(x))_i - ({\overline{G}}_\sigma (x))_i\right) ^2 \nonumber \\&\quad \le L^2 \sigma ^2 \, \int _{R^{n-1}} \int _{-\infty }^{\infty } (s_i^2 + \Vert \bar{s}_i\Vert ^2)\, s_i^4\,\varphi (s_i)\, {\text {d}}s_i \> \> \varphi (\bar{s_i}) \, {\text {d}}{\bar{s_i}} + \nonumber \\&\qquad + L^2 \sigma ^2 \, \int _{R^{n-1}} \int _{-\infty }^{\infty } (s_i^2 + \Vert \bar{s}_i\Vert ^2)\, \bar{s_i}^2\,\varphi (s_i)\, {\text {d}}s_i \> \> \varphi (\bar{s_i}) \, {\text {d}}{\bar{s_i}}. \end{aligned}$$

(51)

For the first term in (51), we obtain that

$$\begin{aligned}&\int _{R^{n-1}} \int _{-\infty }^{\infty } (s_i^2 + \Vert \bar{s}_i\Vert ^2)\, s_i^4\,\varphi (s_i)\, {\text {d}}s_i \> \> \varphi (\bar{s_i}) \, {\text {d}}{\bar{s_i}} \nonumber \\&\quad = \int _{R^{n-1}} \int _{-\infty }^{\infty } s_i^6\,\varphi (s_i)\, {\text {d}}s_i \> \> \varphi (\bar{s_i}) \, {\text {d}}{\bar{s_i}} \nonumber \\&\qquad + \int _{R^{n-1}} \int _{-\infty }^{\infty } \Vert \bar{s}_i\Vert ^2\, s_i^4\,\varphi (s_i)\, {\text {d}}s_i \> \> \varphi (\bar{s_i}) \, {\text {d}}{\bar{s_i}}\nonumber \\&\quad = 15 + 3(n-1). \end{aligned}$$

(52)

By similar computations, the second term in (51) becomes

$$\begin{aligned}&\int _{R^{n-1}} \int _{-\infty }^{\infty } (s_i^2 + \Vert \bar{s}_i\Vert ^2)\, \Vert \bar{s_i}\Vert ^2\,\varphi (s_i)\, {\text {d}} s_i \> \> \varphi (\bar{s_i}) \, {\text {d}}{\bar{s_i}}\nonumber \\&\quad = \int _{R^{n-1}} \int _{-\infty }^{\infty } s_i^2\, \Vert \bar{s_i}\Vert ^2\,\varphi (s_i)\, {\text {d}} s_i \> \> \varphi (\bar{s_i}) \, {\text {d}} {\bar{s_i}}\nonumber \\&\qquad +\int _{R^{n-1}} \int _{-\infty }^{\infty } \Vert \bar{s}_i\Vert ^4 \,\varphi (s_i)\, {\text {d}}s_i \> \> \varphi (\bar{s_i}) \, {\text {d}} {\bar{s_i}}\nonumber \\&\quad = (n-1)+3(n-1) = 4(n-1). \end{aligned}$$

(53)

In (52) and (53), we used the property [7] that for a zero mean Gaussian z with variance $\sigma ^2$:

$$\begin{aligned} E[z^{d}] = {\left\{ \begin{array}{ll} ({d}-1)!!\, \sigma ^2 , &{} \text {for } d \text { even } \\ 0 , &{} \text {for } {d} \text { odd }, \end{array}\right. } \end{aligned}$$

where $(d-1)\,!! = (d-1)(d-3)\cdots 3\cdot 1$ and that for any $z \sim {\mathcal {N}}(0,I_{n-1})$

$$\begin{aligned} \int _{R^{n-1}} \Vert z\Vert ^2 \varphi (z) \, {\text {d}}z = \int _{R^{n-1}} \sum _{i=1}^{n-1} z_i^2 \varphi (z) \, {\text {d}} z = n-1. \end{aligned}$$

By substituting (52) and (53) in (51), we finally obtain that

$$\begin{aligned} \left( (G_{\sigma }(x))_i - ({\overline{G}}_\sigma (x))_i\right) ^2 \le L^2\,\sigma ^2\,(15 + 3(n-1)+4(n-1)), \end{aligned}$$

which, applied to all the entries, proves the theorem.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Boresta, M., Colombo, T., De Santis, A. et al. A Mixed Finite Differences Scheme for Gradient Approximation. J Optim Theory Appl 194, 1–24 (2022). https://doi.org/10.1007/s10957-021-01994-w

Download citation

Received: 17 November 2020
Accepted: 26 December 2021
Published: 18 February 2022
Issue Date: July 2022
DOI: https://doi.org/10.1007/s10957-021-01994-w

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A Mixed Finite Differences Scheme for Gradient Approximation

Abstract

Similar content being viewed by others

Quadratic regularization methods with finite-difference gradient approximations

A Theoretical and Empirical Comparison of Gradient Approximations in Derivative-Free Optimization

Gradient-Type Methods for Optimization Problems with Polyak-Łojasiewicz Condition: Early Stopping and Adaptivity to Inexactness Parameter

1 Introduction

2 The Gradient Estimate

Theorem 2.1

Proof

Theorem 2.2

Proof

3 A New Estimate of the Gradient

Theorem 3.1

Proof

4 Estimation Error with Noisy Data

Proposition 4.1

Proof

Proposition 4.2

Proof

5 Numerical Experiments

5.1 Noise-Free Setting

5.2 Noisy Setting

6 Conclusions

Notes

References

Open Access

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Proof of Theorem (2.1)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Mixed Finite Differences Scheme for Gradient Approximation

Abstract

Similar content being viewed by others

Quadratic regularization methods with finite-difference gradient approximations

A Theoretical and Empirical Comparison of Gradient Approximations in Derivative-Free Optimization

Gradient-Type Methods for Optimization Problems with Polyak-Łojasiewicz Condition: Early Stopping and Adaptivity to Inexactness Parameter

1 Introduction

2 The Gradient Estimate

Theorem 2.1

Proof

Theorem 2.2

Proof

3 A New Estimate of the Gradient

Theorem 3.1

Proof

4 Estimation Error with Noisy Data

Proposition 4.1

Proof

Proposition 4.2

Proof

5 Numerical Experiments

5.1 Noise-Free Setting

5.2 Noisy Setting

6 Conclusions

Notes

References

Open Access

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

Proof of Theorem (2.1)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation