Abstract
In this paper, we focus on the linear functionals defining an approximate version of the gradient of a function. These functionals are often used when dealing with optimization problems where the computation of the gradient of the objective function is costly or the objective function values are affected by some noise. These functionals have been recently considered to estimate the gradient of the objective function by the expected value of the function variations in the space of directions. The expected value is then approximated by a sample average over a proper (random) choice of sample directions in the domain of integration. In this way, the approximation error is characterized by statistical properties of the sample average estimate, typically its variance. Therefore, while useful and attractive bounds for the error variance can be expressed in terms of the number of function evaluations, nothing can be said on the error of a single experiment that could be quite large. This work instead is aimed at deriving an approximation scheme for linear functionals approximating the gradient, whose error of approximation can be characterized by a deterministic point of view in the case of noisefree data. The previously mentioned linear functionals are no longer considered as expected values over the space of directions, but rather as the filtered derivative of the objective function by a Gaussian kernel. By using this new approach, a gradient estimation based on a suitable linear combination of central finite differences at different step sizes is proposed and deterministic bounds that do not depend on the particular sample of points considered are computed. In the noisy setting, on the other end, the variance of the estimation error of the proposed method is showed to be strictly lower than the one of the estimation error of the Central Finite Difference scheme. Numerical experiments on a set of test functions are encouraging, showing good performances compared to those of some methods commonly used in the literature, also in the noisy setting.
Similar content being viewed by others
1 Introduction
DFO algorithms have become increasingly important since they provide a proper methodology to tackle most of the optimization problems considered in various fields of application. As reported in [4, 8, 16], typical applications fall within the simulationbased optimization problems such as policy optimization in reinforcement learning. DFO methods arise when derivative information is either unavailable, or quite costly to obtain, not to mention when only noisy sample of the objective function are available. In the latter case, it is known that most methods based on finite difference are of little use [11, 19].
One of the approaches in DFO algorithms is that of computing a proper estimate of the gradient of the objective function. Finite difference approximation schemes were already present in early times [15] and have recently been reconsidered as sample average approximations of functionals defining a "filtered version" of the objective function [2, 3, 9, 13]. These functionals arise when defining a gradient approximation as the average of the function variation along all the directions in the whole space. In the most popular methods, the average is performed by weighting the function variations along directions generated either with a uniform kernel on the unit ball [9], or with a Gaussian kernel [2]. These integrals are considered as ensemble averages over the space of the directions of differentiation, and then are approximated by sample averages over a random sample of directions, with various methods. As a general policy, the approximation error is then characterized by its statistical properties (even in the noisefree setting), the variance is expressed in terms of the number of function calculations, and nice bounds are provided to tradeoff precision of the gradient estimation and computational costs. Nevertheless it is plain that the error on a single sample may be quite large, even though its variance is bounded.
In this paper, we focus on a different point of view. The functional defining a filtered version of the objective function is considered as weak derivative of the objective function rather than expected values over the space of the directions [20]. The gradient estimation is therefore obtained by considering a numerical approximation of the functional integral, and the estimation error is evaluated in a deterministic fashion. The estimate is obtained by a suitable linear combination of central finite differences at steps with increasing size. Bounds on the approximation error with the proposed method are derived, and the variance of the error in the case of noisy data is also presented.
The goodness of the approximation is experimentally evaluated by comparing the proposed method with those considered benchmarks by the literature—namely: Forward Finite Differences (FFD), Central Finite Differences (CFD) [15], Gaussian Smoothed Gradient (GSG), Central Gaussian Smoothed Gradient (cGSG) [9, 13]—over the benchmark of the Schittkowski functions [17]. Encouraging results are obtained, both in the noisefree and in the noisy setting.
The paper is organized as follows: Sect. 2 formally introduces the gradient estimation problem, highlighting the difference between the approach proposed in this article and the one of several estimates proposed in the literature. In Sect. 3, we present the proposed approximation scheme—NMXFD, with an emphasis on its link with the Finite Difference Method. A theoretical comparison between the variance of the estimation errors of the proposed method and of the CFD scheme is proposed in Sect. 4. Section 5 presents numerical results and conclusions are drawn in Sect. 6.
2 The Gradient Estimate
In this paper, we consider the following unconstrained optimization problem in the derivative free optimization (DFO) setting [6, 12]:
where \(f:\, R^n\mapsto R\) is a function with continuous derivative, i.e., \(f\in {\mathcal {C}}^1(R^n)\), and we denote the gradient \(\nabla f:\, R^n\mapsto R^n\) such that for any \(x\in R^n\)
In this section, the problem of a numerical approximation of the gradient \(\nabla f(x)\) is considered. The most popular approximation scheme is the standard finite difference method [15], but interesting alternative schemes are proposed in papers [2, 9]. A general estimate is obtained according to the following formula:
where \(\varphi (s):\, R^n \mapsto R\) denotes either a standard Gaussian Kernel \({\mathcal {N}}(0, I_n )\) or a uniform kernel on the unit ball \({\mathcal {B}}(0, 1 )\), \({\text {d}}s={\text {d}}s_1\cdot {\text {d}}s_2\cdot \cdots \cdot {\text {d}}s_n\) is the volume element in \(R^n\), and \(\sigma >0\) is a scale parameter. The approximation error has different bounds depending on the assumptions on f (see [4]). If the function f is continuously differentiable, and its gradient is LLipschitz continuous for all \(x \in R^n\), then
where \(C_{\varphi }\) is a positive constant whose value depends on the kernel. If the function f is twice continuously differentiable, and its Hessian is HLipschitz continuous for all \(x \in R^n\), then
Both bounds (3) and (4) show that
We will now work out formula (2) considering the (standard) Gaussian kernel
but the considerations that follow hold also if a uniform kernel over the unit ball is considered.
Let us consider this further notation: for any \(x\in R^n\) denote by \({\bar{x}}_i\in R^{n1}\) the following vector \( \begin{bmatrix} x_1, x_2 \, \ldots \, ,x_{i1}, x_{i+1}, \, \ldots \, , x_n\end{bmatrix}^\mathrm{T}\). With some abuse of notation, but for sake of simplicity in the use of formulas, when addressing a given coordinate \(x_i\) in a vector x let us write x as \([x_i\>\bar{x}_i]^\mathrm{T}\) and denote f(x) as \(f(x_i,{\bar{x}}_i)\) and \(\varphi (s)=\varphi (s_i)\varphi ({\bar{s}}_i)\), with \(\varphi (\bar{s}_i)=\prod _{j\ne i}^n \varphi (s_j)\); consistently, the volume element becomes \({\text {d}}s = {\text {d}}s_i\cdot {\text {d}}{\bar{s}}_i\). In case of a vector function f(z), to address explicitly its \(ith\) entry we write it as \([(f(z))_i\>\overline{(f(z))}_i]^\mathrm{T}\). Then, estimate (2) is rewritten as follows
Let us consider the generic entry of vector (7)
By the Fubini theorem, we can compute it as follows
The expression in parentheses is the estimate of the directional derivative of f(x) along the ith coordinate \(x_i\) and computed at the point \((x_i, {\bar{x}}_i + \sigma {\bar{s}}_i)\), i.e.,
Hence, expression (8) becomes
Therefore, the generic entry of the gradient estimate \(G_{\sigma }(x)\) in formula (7) is the average of function (10) weighted by a \((n1)\)dimensional Gaussian kernel \(\varphi ({\bar{s}}_i)={\mathcal {N}}(0, I_{n1} )\) over the subspace \(R^{n1}\) of \(R^n\). As a consequence, the computation of any entry of vector \(G_{\sigma }(x)\) implies an integration over \(R^n\). In papers [2, 3], this problem is overcome by considering that (2) is indeed an ensemble average of function \(f(x + \sigma s) s\) over all the directions \(s\in R^n\) weighted by the Gaussian distribution \(\varphi (s)\sim {\mathcal {N}}(0, I_n )\). Therefore, we can write
Now the ensemble average can be well approximated by sampling a set of M independent directions \(\{s_i\}\) in \(R^n\) according to \({\mathcal {N}}(0, I_n )\), and considering the sample average approximation of \(E_{\varphi }[f(x + \sigma s) s]\)
or its simmetric version
The same argument holds if a uniform distribution over the unit ball is considered for the ensemble average [9]. Now, only \(M+1\) function computations in case of (13) or 2M in case of (14) are needed and the convergence properties of the sample estimate to the ensemble average are well established: the sample average is an unbiased estimate and its accuracy increases with increasing M. In [3], suitable expressions of the estimation error variance are found in terms of the number of samples M and the values of some smoothness parameters of function f. Therefore, very useful formulas are given that define the required sample size to obtain a chosen accuracy, with a fixed level of confidence \(1\alpha \). This is a typical statistical characterization of the error, that is robust over the whole ensemble of possible trials, but of course leaves a risk \(\alpha \) to have a large error on a single experiment.
In this paper, by exploiting formula (10), the following gradient estimate is proposed
where
is obtained from (10) with \({\bar{s}}_i = 0 , \> i = 1,\ldots , n\). This is a different result from estimate (7) and appears to be more practical since only line integrals are involved in the formula.
The following theorem shows that estimate \({\overline{G}}_\sigma (x)\) is close to \(G_\sigma (x)\) and converges to it as \(\sigma \) tends to zero.
Theorem 2.1
Let \(\nabla f(x)\) be Lipschitz continuous with constant L for all \(x \in R^n\). Then we have that
Proof
See Appendix for the proof.
Next theorem shows that \({\overline{G}}_\sigma (x)\) is indeed a good approximation of the true gradient \(\nabla f(x)\) and converges to it as \(\sigma \) tends to zero. \(\square \)
Theorem 2.2
Let f(x) be continuously differentiable for all \(x \in R^n\). The following holds:
Proof
We prove (18) componentwise. By integration by parts, we have
where \(z_i = x_i + \sigma s_i\). By changing of variable, \(s_i = \frac{z_i  x_i}{\sigma } \) we obtain that
and therefore, taking into account that a series of Gaussians \(\frac{1}{\sigma _n}\varphi (\frac{z_ix_i}{\sigma _n})\) with \(\sigma _n\rightarrow 0\) defines a \(\delta \)dirac distribution centered in \(x_i\) [10], we have that
\(\square \)
Any entry of (15) is a weak definition of the derivative of f(x) along \(x_i\) [10]. Note that (19) is well defined even though f(x) is not differentiable at \((x_i,\,{\bar{x}}_i)\).^{Footnote 1}
3 A New Estimate of the Gradient
We consider the functional \(g_\sigma (x_i, {\bar{x}}_i)\) which is the \(i_{th}\) component of the gradient estimate (15) and, for the sake of simplicity, we write in a single formula the result of (19) and (20).
Note that \(\frac{1}{\sigma }\varphi (\frac{z_ix_i}{\sigma })\) is \({\mathcal {N}}(x_i, \sigma ^2)\). Our goal consists in finding a numerical approximation of the first integral in (22). To do that, we compute the integral in a finite range, namely between S and S
For S sufficiently big the error between (22) and (23) is negligible due to the fast decreasing of the Gaussian to infinity. The definite integral in (23) can be approximated by a quadrature formula, e.g., Trapezoidal Rule [1]. Dividing the interval \([S, S]\) in 2m subintervals, each of size \(h = \frac{S}{m}\) we obtain:
It is well known that, under very general conditions, the trapezoidal quadrature formula (24) has an error that is \({\mathcal {O}}(1/m^2)\) [5]. Indeed, once \(\sigma \) and S are chosen, we can easily check this property in our case. Let
Note that the derivatives of a guassian kernel \(\varphi ^{(k)}(\tau )\), up to the third order, are all less than 1 in absolute value for any \(\tau \), and decrease rapidly as \(\tau \) increases. Therefore, for f sufficiently smooth in \((x_i\pm \sigma \,S)\), let
We can write:
Let us rewrite (24) as follows
The larger the number of function evaluation m, the smaller the error term \(\epsilon _\sigma (\tau ,m)\). On the other hand, \(\bar{g}_\sigma (x_i)\) can be interpreted as a combination of finite differences with some coefficients. Keeping in mind that \(\varphi '(t) = \varphi '(t)\) and that \(\varphi '(0) = 0\), after some simple algebra we can write:
from which
It is clear that \(\bar{g}_\sigma (x_i, {\bar{x}}_i)\) is a linear combination of finite difference approximations, with different step sizes; for \(\sigma h \rightarrow 0\), each one converges to the true value of the partial derivative \({\partial f(x_i, {\bar{x}}_i)}/{\partial x_i}\). Therefore, the estimate \(\bar{g}_\sigma (x_i, {\bar{x}}_i)\) converges to the true value only if the sum of its coefficients equals one. For this reason, it is advisable to normalize the coefficients of the linear combination in (26) to eliminate the estimate bias for \(\sigma \) finite. To this aim, let C be the sum of all the coefficients:
We can then write the normalized version of (26) as:
where
For \(\sigma \) small enough the normalization of the coefficients may not be necessary, the distorsion of the estimate being negligible. Let us now evaluate the error bound corresponding to estimate (28), from here on referred to as NMXFD (Normalized Mixed Finite Difference).
Theorem 3.1
Let f(x) be twice continuously differentiable and its Hessian be HLipschitz for all \(x\in R^n\). Consider the gradient approximation obtained by (28)
We have that
Proof
Any single finite difference term in (28) has an error with respect to the true value \({\partial f(x_i, {\bar{x}}_i)}/{\partial x_i}\) whose bound depends on the step size and on the regularity properties of function f. From [4], we have that
for \(j=1,\ldots ,m\). Therefore, since \(\sum _{j = 1}^{m} a_j = 1\), and \(a_j>0\), \(j=1,\ldots ,m\), we can write
which applied to all entries of \({\widehat{G}}_\sigma (x)\nabla f(x)\), proves the theorem. \(\square \)
Here we used the equality \(m\,h=S\) that implies that the error bound does not depend on the number of function evaluations.
4 Estimation Error with Noisy Data
Let us now evaluate how the performance of the gradient estimate NMXFD (30) here referred to as \({\hat{G}}_\sigma ^{{\text {MXF}}}(x)\) compares with that of the Central Finite Differences (CFD), taking also into account the presence of an additive noise affecting the sampled function values f(x). Let \(\{e_i\}\) be the canonical base of \(R^n\), then we can write:
with the same notation we can easily write the gradient estimate according to the CFD scheme here denoted as \({\hat{G}}_\sigma ^{{\text {CFD}}}(x)\):
Let \(\{\epsilon _i\}\) denote a discrete random field modeling the additive noise on the sampled function values with the following properties: \(\epsilon _i \sim N(0,\lambda ^2)\) and \(E[\epsilon _i \,\epsilon _j] = 0\) for \(i\ne j\). We now compute the estimation errors for the two schemes and compare them in terms of accuracy (mean value) and precision (variance). The accuracy evaluates the estimate bias, i.e., the systematic source of the error, like the limited the number N of function evaluations used to build the estimate. The precision is the dispersion of the estimation error around its mean value and evaluates the variability of the statistic source of the error.
The CFD scheme
According to (34), a number \(N = 2 n\) of function evaluations is considered to obtain
with \(\epsilon _i^\pm \) denoting the noise on the function values \(f_\sigma (x_i\pm \sigma \,h,{\bar{x}}_i)\). Let
be the estimation error. We can see that
and
where var[z], \(z \in R^n\) with \(E[z] = 0\), indicates the trace of the covariance matrix \(E[z\,z^\mathrm{T}]\). Now, for functions f as in theorem (3.1), let us consider the property (32), with \(j=1\), for all the components of \(E[e_{{\text {CFD}}}(x)]\). We obtain that
Therefore, as the increment \(\sigma h\rightarrow 0\), the error goes to zero as well on average, but its variance increases without bound as \({\mathcal {O}}\left( 1/(\sigma h)^2\right) \).
The NMXFD scheme
In this case, according to (33), a number \(N = 2 m\,n\) of function evaluations is considered to obtain
with \(\epsilon _{i,j}^\pm \) denoting the error terms on the function values \(f(x_i\pm \sigma \,jh, {\bar{x}}_i)\), \(i=1,\ldots ,n\), \(j=1,\ldots ,m\). For the estimation error
we readily obtain that
Under the assumptions of theorem (3.1), and taking into account (31), we obtain
As for the error variance, two interesting results can be proved.
Proposition 4.1
For any \(m>1\), the variance of the estimation error of the NMXFD scheme is strictly lower than the variance of the estimation error of the CFD scheme, i.e.,
in any \(x\in R^n\) and for any \(\sigma \), h.
Proof
The sum of squares \(\sum _{j= 1}^{m} a_j^2\) is strictly less then 1 since the coefficients \(a_j\), \(j=1,\ldots ,m\), are all positive and their sum is 1. Therefore, from (36) we obtain that
\(\square \)
Now we further show that \(var\left[ e_{{\text {MXF}}}(x)\right] \) goes to zero as N increases.
Proposition 4.2
For any \(x\in R^n\), the variance of the estimation error of the NMXFD scheme has the following asymptotic behavior
Proof
By taking into account relations (27), we have that
Let us denote with \(I_{\varphi ^\prime }^{(1)}(m)\) the following quantity
that is the trapezoidal quadrature formula for the integral
Due to the \({\mathcal {O}}(1/N^2) \) property of the error of the trapezoidal rule, we have that
Therefore, from (41), we easily obtain that
so that C is a bounded quantity as \(N=2m\,n\) increases (by increasing m), taking into account that \(mh=S\). Now, according to the relations (29) we can write
Define now \(I_{\varphi ^\prime }^{(2)}(m)\) as follows
It is the trapezoidal quadrature rule for the integral
where \({{\,\mathrm{erf}\,}}(z)=\frac{2}{\sqrt{\pi }}\int _0^z e^{t^2}\,{\text {d}}t\) is the Gauss error function. Hence, for the usual property of the error, we can write
Therefore, we obtain that
Now recalling that \(m\,h = S \), and that \(N=2m\,n\), we can write
which along with (42), proves the proposition. \(\square \)
5 Numerical Experiments
We tested our method for estimating the gradient by comparing its performance with those of other methods on 69 functions from the Schittkowski test set [17].
For each function, we did the following: we generated a random starting point \(x^0\) and minimized the function using the quasiNewton method of Broyden, Fletcher, Goldfarb and Shanno (BFGS) [14], finding the optimal point \(x^*\) with \(\nabla f(x^*) \approx 0\). We then identified the first instance of a point \(x^k\) where
for each of the following values of \(\alpha \): \(10^{0}, 10^{1},10^{2},10^{3},10^{4},10^{5},10^{6}\). In this way, we generated seven different buckets, one for each \(\alpha \), of 69 different points, one for each function. Bucket i indicates the one associated to \(\alpha = 10^{i}\). Bucket 0 is therefore the one with the points that are farther from the optimal solution and bucket 6 is the one with points closer to the optimal solution.
Then, for each point we computed the gradient approximations obtained with the Normalized MiXed Finite Differences scheme (NMXFD) and with those considered benchmarks by the literature, namely: Forward Finite Differences (FFD), Central Finite Differences (CFD), Gaussian Smoothed Gradient (GSG), Central Gaussian Gmoothed Gradient (cGSG) as defined in [4]. Different tables will summarize the results of this comparison.
The tables show, for different values of the number of function evaluations (N) and different buckets (B), the median value of the log of the relative approximation error over all the 69 points in each bucket.
We define relative approximation error as
where g(x) is the generic gradient estimate. The number of function evaluations N is expressed in the following tables as a function of the number of dimensions n. FFD and CFD schemes only allow for a specific value of N (\(n+1\) and 2n, respectively). In GSG and in cGSG, N is linked to the number of direction sampled to build the gradient approximation (\(N=(M+1)\) in (13) and \(N = 2\,M\) in (14)). In the NMFXD scheme, the value of N is linked to the value of m in formula (28). In particular, we have that \(N = 2\) mn. In each table, the lowest entry for every bucket is highlighted in bold, and the second lowest is italic.
5.1 NoiseFree Setting
For the noisefree setting, we report three different tables obtained using a different value of \(\sigma \) (shared by all the schemes) to compute the gradient approximation (Tables 1, 2, 3).
It is possible to notice that in a noisefree setting, lower values of \(\sigma \) tend to yield to better results, as one would expect from the theory. The closer the point is to the minimum value of a function, the harder it is to obtain an accurate estimate of its gradient, unless \(\sigma \) is very small. As a matter of fact, for points belonging to lower index buckets—thus far from the minimum of the function, the value \(\sigma = 10^{5}\) yields the better performances, while accurate estimates of the gradient of points closer to the minimum value of a function require using of a lower value of \(\sigma \). We can also see that the error of the proposed method, NMXFD, is of the same order of magnitude of that of CFD, and almost always better than that of the other methods.
In our experiments, we have also produced gradient estimates using two more methods:

by removing the normalization of the coefficients in the computation of NMXFD, i.e., implementing the gradient approximation as in (26).

by computing the estimate as the raw average of central finite differences at different stepsizes, that is (28) with \(a_j = \frac{1}{m}\).
Both of these methods performed consistently worse than NMXFD, and they have not been reported in the tables for brevity. Still, the better performances of NMXFD over the raw average of central finite differences seem to confirm that the rationale behind the choice of coefficients used to weight the CFDs in the proposed approach is promising from a computational point of view.
5.2 Noisy Setting
We also show results of the noisy scenario, where the noise term is described in Sect. 4 and has \(\lambda = 0.001\). The estimation procedure is slightly different from the one of the noisefree setting. In Table 4, the median log of the relative errors \(\eta _i\) of the 69 different Schittkowski function is reported. Each \(\eta _i\) is computed as the average of 100 relative approximation errors, resulting from 100 independent noise realizations. The rationale behind this choice was to mitigate the dependence of the results from one particular noise realization. Results are shown in Table 4, where the gradient estimates are obtained with \(\sigma = 0.01\).
Table 4 shows that NMXFD performs better than the other schemes in presence of noise, although reasonably low relative approximation errors are obtained only for the first three buckets. For the other ones, the error \(\eta \) increases significantly. This is due to the fact that the denominator of \(\eta \) gets smaller as we move to points close to the minimum value of the function, while the variance of the approximation error does not change across different buckets. Just like in the noisefree setting, increasing the number of function evaluations allows to increase the precision of all the schemes, as expected from the theory.
Different values of \(\sigma \) for estimating the gradient (\(10^{1}\) , \(10^{3}\), \(10^{4}\)) have also been used. The associated tables have not been reported for brevity, since they yielded to the same conclusions and since the performances for almost every method and every bucket with those values of \(\sigma \) are significantly worse. This can be inferred from the theory, since the value of \(\sigma \) influences the bias and the variance of the estimate error in opposite directions, as we can see from (36) and (37) in Sect. 4.
The numerical experiments show the good performances of the proposed method when compared with those of the standard methods commonly used in the literature. In particular, the performances of NMXFD are comparable with those of CFD in absence of noise and better with noisy data and are better than those of other schemes in both scenarios.
The results seem to confirm the idea that performing a combination of finite differences in the noisy setting increases the quality of the gradient estimation. In this line, the simplest combination possible is the average of a number m of multiple CFDs (mCFD) computed over repeated measures
where \({\hat{G}}_{\sigma ,k}^{{\text {CFD}}}(x)\) is the CFD in (34) computed at the same points, but with a different independent realization k of the noise. This formula, obviously, reduces the error variance of CFD by 1/m, therefore it becomes interesting to see if
Because of the complicated structure of the coefficients \(a_j\) a formal proof of (44) can be involved. In Table 5, we report a numerical verification of (44) for increasing values of m, with a uniform sampling within the range [S, S] with \(S = mh = 3\) to compute coefficients \(a_j\).
For \(m = 1\), the reduction of the variance of the two methods is the same. For all \(m > 2\), we can see that the reduction of the error variance of NMXFD is greater than that of mCFD.
In Table 6, we finally report the comparison of the median log of relative error between \({\hat{G}}_\sigma ^{{\text {MXF}}}\) and \({\hat{G}}_\sigma ^{{\text {mCFD}}}\) on increasing noise levels \(\lambda \), all computed with a value \(\sigma \) of 0.01 and always using the same function evaluation budget. We do not report the performances of other methods for brevity, since they confirm the same conclusions provided by Table 4.
Table 6 shows that the basic combination \({\hat{G}}_\sigma ^{{\text {mCFD}}}\) is indeed a good gradient approximation due to the effect of the average that reduces the error variance. As the noise level increases, \({\hat{G}}_\sigma ^{{\text {MXF}}}\) tends to be better than \({\hat{G}}_\sigma ^{{\text {mCFD}}}\). This supports the idea that a good gradient approximation depends on both the coefficients of the linear combination and the sampling points where the differences are computed. In this respect, the analysis developed in Sect. 3 to define the new gradient estimate, provides a guide to design a more efficient estimate, depending on the following points:

the parameter S that determines the range of integration in integral (23);

the integration formula used to approximate integral (23);

the filter parameter \(\sigma \);

the sampling strategy of the function within the integration range \((S, S)\).
In this early investigation, we heuristically tried several values for the parameters S and \(\sigma \), without trying different integration formulas or sampling criteria. The choice of \(\sigma \) may be difficult and affects the quality of the approximation. When the noise level is known, there are some strategies to make a proper choice of \(\sigma \) as in [18]. When the noise level is not known, the choice of this parameter becomes harder and represents an open question to be further investigated, along with the other points in the list above, to improve the performances of NMXFD.
Data availability statement: Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.
6 Conclusions
In this paper, a novel scheme to estimate the gradient of a function is proposed. It is based on linear functionals defining a filtered version of the objective function. Unlike standard methods where the approximation error is characterized from a statistical point of view and therefore may be quite large on a given experiment, one advantage of the proposed scheme relies on a deterministic characterization of the approximation error in the noisefree setting.
The other advantage lies in its behavior when function evaluations are affected by noise. In fact, the variance of the estimation error of the proposed method is showed to be strictly lower than that of the Central Finite Difference scheme and diminishes as the number of function evaluations increases. The suitable linear combination of finite differences seems to have a filtering role in the case of noisy functions, thus resulting in a more robust estimator.
Numerical experiments on a significant benchmark given by the 69 Schittkowski functions show the good performances of the proposed method when compared with those of the standard methods commonly used in the literature. In particular, the performances of NMXFD are comparable with those of CFD in absence of noise and better with noisy data and seem to be better than those of other schemes in both scenarios. Moreover, we also show the comparison with NMXFD and the average of repeated CFD, thus using the same budget of function evaluations. As the noise level increases, NMXFD tends to perform better than all the other schemes.
This supports the idea that the theory developed to propose this new scheme can be a suitable framework to design gradient estimates with noisy data. The gradient estimate proposed in this paper can be seen as a first design attempt. A future study could be dedicated to the investigation of the best gradient estimates in this framework, along with the analysis of the impact of the obtained gradient approximation when used in optimization algorithms.
Notes
Any \(L_1\) function satisfying (19), in place of \(\frac{\partial f(z_i, \bar{x}_i)}{\partial z_i}\), is a weak derivative of f(x) along \(x_i\).
References
Atkinson, K.E.: An Introduction to Numerical Analysis. Wiley, New York (2008)
Balasubramanian, K., Ghadimi, S.: Zerothorder nonconvex stochastic optimization: Handling constraints, high dimensionality, and saddle points. Foundations of Computational Mathematics pp. 1–42 (2021)
Berahas, A.S., Cao, L., Choromanski, K., Scheinberg, K.: Linear interpolation gives better gradients than Gaussian smoothing in derivativefree optimization. arXiv preprint arXiv:1905.13043 (2019)
Berahas, A.S., Cao, L., Choromanski, K., Scheinberg, K.: A theoretical and empirical comparison of gradient approximations in derivativefree optimization. Foundations of Computational Mathematics, pp. 1–54 (2021)
Boyd, J.P.: Chebyshev and Fourier Spectral Methods. Springer, Berlin (2001)
Conn, A.R., Scheinberg, K., Vicente, L.N.: Geometry of interpolation sets in derivative free optimization. Math. Program. 111(1–2), 141–172 (2008)
Cramér, H.: Mathematical Methods of Statistics, vol. 43. Princeton University Press, Princeton (1999)
Fazel, M., Ge, R., Kakade, S., Mesbahi, M.: Global convergence of policy gradient methods for the linear quadratic regulator. In: International Conference on Machine Learning, pp. 1467–1476. PMLR (2018)
Flaxman, A.D., Kalai, A.T., McMahan, H.B.: Online convex optimization in the bandit setting: gradient descent without a gradient. arXiv:0408.007 (2004)
Gel’fand, I.M., Shilov, G.E.: Generalized Functions, Volume 2: Spaces of Fundamental and Generalized Functions, vol. 261. American Mathematical Soc. (2016)
Kolda, T.G., Lewis, R.M., Torczon, V.: Optimization by direct search: new perspectives on some classical and modern methods. SIAM Rev. 45(3), 385–482 (2003)
Larson, J., Menickelly, M., Wild, S.M.: Derivativefree optimization methods. Acta Numer. 28, 287–404 (2019)
Nesterov, Y., Spokoiny, V.: Random gradientfree minimization of convex functions. Found. Comput. Math. 17(2), 527–566 (2017)
Nocedal, J., Wright, S.J.: Sequential quadratic programming. Numer. Optim. pp. 529–562 (2006)
Polyak, B.T.: Introduction to Optimization, vol. 1. Inc., Publications Division, New York (1987)
Salimans, T., Ho, J., Chen, X., Sidor, S., Sutskever, I.: Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864 (2017)
Schittkowski, K.: More Test Examples for Nonlinear Programming Codes, vol. 282. Springer, Berlin (2012)
Shi, H.J.M., Xuan, M.Q., Oztoprak, F., Nocedal, J.: On the numerical performance of derivativefree optimization methods based on finitedifference approximations. arXiv preprint arXiv:2102.09762 (2021)
Wild, S.M., Regis, R.G., Shoemaker, C.A.: Orbit: optimization by radial basis function interpolation in trustregions. SIAM J. Sci. Comput. 30(6), 3197–3219 (2008)
Ziemer, W.P.: Weakly Differentiable Functions: Sobolev Spaces and Functions of Bounded Variation, vol. 120. Springer, Berlin (2012)
Open Access
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Gianni Di Pillo.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Proof of Theorem (2.1)
we have that
where \((G_{\sigma }(x))_i\) is given by (11)
and \(({\overline{G}}_\sigma (x))_i = g_\sigma (x_i,\,{\bar{x}}_i )\), by (16). We can write
where the last equality holds since \(\int _{R^{n1}} \varphi (\bar{s}_i)\,{\text {d}}{\bar{s}}_i = 1\). Now, the integrand in (45) has the following expression
and for the argument of the integral we can write
with \(x^\prime \in (x,x+\sigma s)\) and \(x_i^{\prime \prime } \in (x_i,x_i+\sigma s_i)\).
We further have that
Now substituting (47) and (48) into (46), we obtain that
By the Lipschitz property of the gradient, and recalling that
we have:
We can finally substitute (50) into (45) obtaining:
For the first term in (51), we obtain that
By similar computations, the second term in (51) becomes
In (52) and (53), we used the property [7] that for a zero mean Gaussian z with variance \(\sigma ^2\):
where \((d1)\,!! = (d1)(d3)\cdots 3\cdot 1\) and that for any \(z \sim {\mathcal {N}}(0,I_{n1})\)
By substituting (52) and (53) in (51), we finally obtain that
which, applied to all the entries, proves the theorem.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Boresta, M., Colombo, T., De Santis, A. et al. A Mixed Finite Differences Scheme for Gradient Approximation. J Optim Theory Appl 194, 1–24 (2022). https://doi.org/10.1007/s1095702101994w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s1095702101994w