A Mixed Finite Differences Scheme for Gradient Approximation

In this paper, we focus on the linear functionals deﬁning an approximate version of the gradient of a function. These functionals are often used when dealing with optimization problems where the computation of the gradient of the objective function is costly or the objective function values are affected by some noise. These functionals have been recently considered to estimate the gradient of the objective function by the expected value of the function variations in the space of directions. The expected value is then approximated by a sample average over a proper (random) choice of sample directions in the domain of integration. In this way, the approximation error is characterized by statistical properties of the sample average estimate, typically its variance. Therefore, while useful and attractive bounds for the error variance can be expressed in terms of the number of function evaluations, nothing can be said on the error of a single experiment that could be quite large. This work instead is aimed at deriving an approximation scheme for linear functionals approximating the gradient, whose error of approximation can be characterized by a deterministic point of view in the case of noise-free data. The previously mentioned linear functionals are no longer considered as expected values over the space of directions, but rather as the ﬁltered derivative of the objective function by a Gaussian kernel. By using this new approach, a gradient estimation based on a suitable linear combination of central


Introduction
DFO algorithms have become increasingly important since they provide a proper methodology to tackle most of the optimization problems considered in various fields of application. As reported in [4,8,16], typical applications fall within the simulationbased optimization problems such as policy optimization in reinforcement learning. DFO methods arise when derivative information is either unavailable, or quite costly to obtain, not to mention when only noisy sample of the objective function are available. In the latter case, it is known that most methods based on finite difference are of little use [11,19].
One of the approaches in DFO algorithms is that of computing a proper estimate of the gradient of the objective function. Finite difference approximation schemes were already present in early times [15] and have recently been reconsidered as sample average approximations of functionals defining a "filtered version" of the objective function [2,3,9,13]. These functionals arise when defining a gradient approximation as the average of the function variation along all the directions in the whole space. In the most popular methods, the average is performed by weighting the function variations along directions generated either with a uniform kernel on the unit ball [9], or with a Gaussian kernel [2]. These integrals are considered as ensemble averages over the space of the directions of differentiation, and then are approximated by sample averages over a random sample of directions, with various methods. As a general policy, the approximation error is then characterized by its statistical properties (even in the noise-free setting), the variance is expressed in terms of the number of function calculations, and nice bounds are provided to trade-off precision of the gradient estimation and computational costs. Nevertheless it is plain that the error on a single sample may be quite large, even though its variance is bounded.
In this paper, we focus on a different point of view. The functional defining a filtered version of the objective function is considered as weak derivative of the objective function rather than expected values over the space of the directions [20]. The gradient estimation is therefore obtained by considering a numerical approximation of the functional integral, and the estimation error is evaluated in a deterministic fashion. The estimate is obtained by a suitable linear combination of central finite differences at steps with increasing size. Bounds on the approximation error with the proposed method are derived, and the variance of the error in the case of noisy data is also presented.
The goodness of the approximation is experimentally evaluated by comparing the proposed method with those considered benchmarks by the literature-namely: Forward Finite Differences (FFD), Central Finite Differences (CFD) [15], Gaussian Smoothed Gradient (GSG), Central Gaussian Smoothed Gradient (cGSG) [9,13]over the benchmark of the Schittkowski functions [17]. Encouraging results are obtained, both in the noise-free and in the noisy setting.
The paper is organized as follows: Sect. 2 formally introduces the gradient estimation problem, highlighting the difference between the approach proposed in this article and the one of several estimates proposed in the literature. In Sect. 3, we present the proposed approximation scheme-NMXFD, with an emphasis on its link with the Finite Difference Method. A theoretical comparison between the variance of the estimation errors of the proposed method and of the CFD scheme is proposed in Sect. 4. Section 5 presents numerical results and conclusions are drawn in Sect. 6.

The Gradient Estimate
In this paper, we consider the following unconstrained optimization problem in the derivative free optimization (DFO) setting [6,12]: where f : R n → R is a function with continuous derivative, i.e., f ∈ C 1 (R n ), and we denote the gradient ∇ f : R n → R n such that for any In this section, the problem of a numerical approximation of the gradient ∇ f (x) is considered. The most popular approximation scheme is the standard finite difference method [15], but interesting alternative schemes are proposed in papers [2,9]. A general estimate is obtained according to the following formula: where ϕ(s) : R n → R denotes either a standard Gaussian Kernel N (0, I n ) or a uniform kernel on the unit ball B(0, 1), ds = ds 1 · ds 2 · · · · · ds n is the volume element in R n , and σ > 0 is a scale parameter. The approximation error has different bounds depending on the assumptions on f (see [4]). If the function f is continuously differentiable, and its gradient is L-Lipschitz continuous for all x ∈ R n , then where C ϕ is a positive constant whose value depends on the kernel. If the function f is twice continuously differentiable, and its Hessian is H-Lipschitz continuous for all x ∈ R n , then Both bounds (3) and (4) show that We will now work out formula (2) considering the (standard) Gaussian kernel but the considerations that follow hold also if a uniform kernel over the unit ball is considered. Let us consider this further notation: for any x ∈ R n denote byx i ∈ R n−1 the following vector T . With some abuse of notation, but for sake of simplicity in the use of formulas, when addressing a given coordinate ; consistently, the volume element becomes ds = ds i · ds i . In case of a vector function f (z), to address explicitly its i − th entry we write it as Then, estimate (2) is rewritten as follows Let us consider the generic entry of vector (7) By the Fubini theorem, we can compute it as follows The expression in parentheses is the estimate of the directional derivative of f (x) along the i-th coordinate x i and computed at the point (x i ,x i + σs i ), i.e., Hence, expression (8) becomes Therefore, the generic entry of the gradient estimate G σ (x) in formula (7) is the average of function (10) weighted by a (n − 1)-dimensional Gaussian kernel ϕ(s i ) = N (0, I n−1 ) over the subspace R n−1 of R n . As a consequence, the computation of any entry of vector G σ (x) implies an integration over R n . In papers [2,3], this problem is overcome by considering that (2) is indeed an ensemble average of function f (x +σ s)s over all the directions s ∈ R n weighted by the Gaussian distribution ϕ(s) ∼ N (0, I n ). Therefore, we can write Now the ensemble average can be well approximated by sampling a set of M independent directions {s i } in R n according to N (0, I n ), and considering the sample average or its simmetric version The same argument holds if a uniform distribution over the unit ball is considered for the ensemble average [9]. Now, only M +1 function computations in case of (13) or 2M in case of (14) are needed and the convergence properties of the sample estimate to the ensemble average are well established: the sample average is an unbiased estimate and its accuracy increases with increasing M. In [3], suitable expressions of the estimation error variance are found in terms of the number of samples M and the values of some smoothness parameters of function f . Therefore, very useful formulas are given that define the required sample size to obtain a chosen accuracy, with a fixed level of confidence 1 − α. This is a typical statistical characterization of the error, that is robust over the whole ensemble of possible trials, but of course leaves a risk α to have a large error on a single experiment.
In this paper, by exploiting formula (10), the following gradient estimate is proposed where is obtained from (10) withs i = 0, i = 1, . . . , n. This is a different result from estimate (7) and appears to be more practical since only line integrals are involved in the formula.
The following theorem shows that estimate G σ (x) is close to G σ (x) and converges to it as σ tends to zero.
Proof See Appendix for the proof. Next theorem shows that G σ (x) is indeed a good approximation of the true gradient ∇ f (x) and converges to it as σ tends to zero.

Theorem 2.2 Let f (x)
be continuously differentiable for all x ∈ R n . The following holds: Proof We prove (18) component-wise. By integration by parts, we have where and therefore, taking into account that a series of Gaussians 1 Any entry of (15) is a weak definition of the derivative of f (x) along x i [10]. Note that (19)

A New Estimate of the Gradient
We consider the functional g σ (x i ,x i ) which is the i th component of the gradient estimate (15) and, for the sake of simplicity, we write in a single formula the result of (19) and (20). Note Our goal consists in finding a numerical approximation of the first integral in (22). To do that, we compute the integral in a finite range, namely between -S and S For S sufficiently big the error between (22) and (23) is negligible due to the fast decreasing of the Gaussian to infinity. The definite integral in (23) can be approximated by a quadrature formula, e.g., Trapezoidal Rule [1]. Dividing the interval [−S, S] in 2m sub-intervals, each of size h = S m we obtain: 1 Any L 1 function satisfying (19), in place of It is well known that, under very general conditions, the trapezoidal quadrature formula (24) has an error that is O(1/m 2 ) [5]. Indeed, once σ and S are chosen, we can easily check this property in our case. Let Note that the derivatives of a guassian kernel |ϕ (k) (τ )|, up to the third order, are all less than 1 in absolute value for any τ , and decrease rapidly as τ increases. Therefore, We can write: Let us rewrite (24) as follows The larger the number of function evaluation m, the smaller the error term σ (τ, m). On the other hand,ḡ σ (x i ) can be interpreted as a combination of finite differences with some coefficients. Keeping in mind that ϕ (t) = −ϕ (−t) and that ϕ (0) = 0, after some simple algebra we can write: It is clear thatḡ σ (x i ,x i ) is a linear combination of finite difference approximations, with different step sizes; for σ h → 0, each one converges to the true value of the We can then write the normalized version of (26) as: For σ small enough the normalization of the coefficients may not be necessary, the distorsion of the estimate being negligible. Let us now evaluate the error bound corresponding to estimate (28), from here on referred to as NMXFD (Normalized Mixed Finite Difference).

Theorem 3.1 Let f (x) be twice continuously differentiable and its Hessian be H -
Lipschitz for all x ∈ R n . Consider the gradient approximation obtained by (28) We have that Proof Any single finite difference term in (28) has an error with respect to the true value ∂ f (x i ,x i )/∂ x i whose bound depends on the step size and on the regularity properties of function f . From [4], we have that for j = 1, . . . , m. Therefore, since m j=1 a j = 1, and a j > 0, j = 1, . . . , m, we can write which applied to all entries of G σ (x) − ∇ f (x), proves the theorem.
Here we used the equality m h = S that implies that the error bound does not depend on the number of function evaluations.

Estimation Error with Noisy Data
Let us now evaluate how the performance of the gradient estimate N M X F D (30) here referred to asĜ MXF σ (x) compares with that of the Central Finite Differences (C F D), taking also into account the presence of an additive noise affecting the sampled function values f (x). Let {e i } be the canonical base of R n , then we can write: with the same notation we can easily write the gradient estimate according to the CFD scheme here denoted asĜ CFD σ (x): Let { i } denote a discrete random field modeling the additive noise on the sampled function values with the following properties: i ∼ N (0, λ 2 ) and E[ i j ] = 0 for i = j. We now compute the estimation errors for the two schemes and compare them in terms of accuracy (mean value) and precision (variance). The accuracy evaluates the estimate bias, i.e., the systematic source of the error, like the limited the number N of function evaluations used to build the estimate. The precision is the dispersion of the estimation error around its mean value and evaluates the variability of the statistic source of the error.

The CFD scheme
According to (34), a number N = 2n of function evaluations is considered to obtain be the estimation error. We can see that Therefore, as the increment σ h → 0, the error goes to zero as well on average, but its variance increases without bound as O 1/(σ h) 2 .

The NMXFD scheme
In this case, according to (33), a number N = 2m n of function evaluations is considered to obtainĜ we readily obtain that Under the assumptions of theorem (3.1), and taking into account (31), we obtain As for the error variance, two interesting results can be proved.
in any x ∈ R n and for any σ , h.

Proof
The sum of squares m j=1 a 2 j is strictly less then 1 since the coefficients a j , j = 1, . . . , m, are all positive and their sum is 1. Therefore, from (36) we obtain that Now we further show that var [e MXF (x)] goes to zero as N increases.

Proposition 4.2 For any x ∈ R n , the variance of the estimation error of the N M X F D scheme has the following asymptotic behavior
Proof By taking into account relations (27), we have that Let us denote with I Due to the O(1/N 2 ) property of the error of the trapezoidal rule, we have that Therefore, from (41), we easily obtain that so that C is a bounded quantity as N = 2m n increases (by increasing m), taking into account that mh = S. Now, according to the relations (29) we can write m j=1 a 2 Define now I (2) ϕ (m) as follows It is the trapezoidal quadrature rule for the integral where erf(z) = 2 √ π z 0 e −t 2 dt is the Gauss error function. Hence, for the usual property of the error, we can write Therefore, we obtain that Now recalling that m h = S, and that N = 2m n, we can write which along with (42), proves the proposition.

Numerical Experiments
We tested our method for estimating the gradient by comparing its performance with those of other methods on 69 functions from the Schittkowski test set [17]. For each function, we did the following: we generated a random starting point x 0 and minimized the function using the quasi-Newton method of Broyden, Fletcher, Goldfarb and Shanno (BFGS) [14], finding the optimal point x * with ∇ f (x * ) ≈ 0. We then identified the first instance of a point x k where for each of the following values of α: 10 0 , 10 −1 , 10 −2 , 10 −3 , 10 −4 , 10 −5 , 10 −6 . In this way, we generated seven different buckets, one for each α, of 69 different points, one for each function. Bucket i indicates the one associated to α = 10 −i . Bucket 0 is therefore the one with the points that are farther from the optimal solution and bucket 6 is the one with points closer to the optimal solution. Then, for each point we computed the gradient approximations obtained with the Normalized MiXed Finite Differences scheme (NMXFD) and with those considered  We define relative approximation error as where g(x) is the generic gradient estimate. The number of function evaluations N is expressed in the following tables as a function of the number of dimensions n. FFD and CFD schemes only allow for a specific value of N (n + 1 and 2n, respectively).
In G SG and in cG SG, N is linked to the number of direction sampled to build the gradient approximation (N = (M + 1) in (13) and N = 2 M in (14)). In the NMFXD scheme, the value of N is linked to the value of m in formula (28). In particular, we have that N = 2 mn. In each table, the lowest entry for every bucket is highlighted in bold, and the second lowest is italic.

Noise-Free Setting
For the noise-free setting, we report three different tables obtained using a different value of σ (shared by all the schemes) to compute the gradient approximation (Tables 1,  2, 3). It is possible to notice that in a noise-free setting, lower values of σ tend to yield to better results, as one would expect from the theory. The closer the point is to the minimum value of a function, the harder it is to obtain an accurate estimate of its  gradient, unless σ is very small. As a matter of fact, for points belonging to lower index buckets-thus far from the minimum of the function, the value σ = 10 −5 yields the better performances, while accurate estimates of the gradient of points closer to the minimum value of a function require using of a lower value of σ . We can also see that the error of the proposed method, NMXFD, is of the same order of magnitude of that of CFD, and almost always better than that of the other methods.
In our experiments, we have also produced gradient estimates using two more methods: • by removing the normalization of the coefficients in the computation of NMXFD, i.e., implementing the gradient approximation as in (26). • by computing the estimate as the raw average of central finite differences at different stepsizes, that is (28) with a j = 1 m . Both of these methods performed consistently worse than NMXFD, and they have not been reported in the tables for brevity. Still, the better performances of NMXFD over the raw average of central finite differences seem to confirm that the rationale behind the choice of coefficients used to weight the CFDs in the proposed approach is promising from a computational point of view.

Noisy Setting
We also show results of the noisy scenario, where the noise term is described in Sect. 4 and has λ = 0.001. The estimation procedure is slightly different from the one of the noise-free setting. In Table 4, the median log of the relative errors η i of the 69 different Schittkowski function is reported. Each η i is computed as the average of 100 relative approximation errors, resulting from 100 independent noise realizations. The rationale behind this choice was to mitigate the dependence of the results from one particular noise realization. Results are shown in Table 4, where the gradient estimates are obtained with σ = 0.01. Table 4 shows that NMXFD performs better than the other schemes in presence of noise, although reasonably low relative approximation errors are obtained only for the first three buckets. For the other ones, the error η increases significantly. This is due to the fact that the denominator of η gets smaller as we move to points close to the minimum value of the function, while the variance of the approximation error does not change across different buckets. Just like in the noise-free setting, increasing the number of function evaluations allows to increase the precision of all the schemes, as expected from the theory.
Different values of σ for estimating the gradient (10 −1 , 10 −3 , 10 −4 ) have also been used. The associated tables have not been reported for brevity, since they yielded to the same conclusions and since the performances for almost every method and every bucket with those values of σ are significantly worse. This can be inferred from the  The numerical experiments show the good performances of the proposed method when compared with those of the standard methods commonly used in the literature. In particular, the performances of NMXFD are comparable with those of CFD in absence of noise and better with noisy data and are better than those of other schemes in both scenarios.
The results seem to confirm the idea that performing a combination of finite differences in the noisy setting increases the quality of the gradient estimation. In this line, the simplest combination possible is the average of a number m of multiple CFDs (mC F D) computed over repeated measureŝ whereĜ CFD σ,k (x) is the CFD in (34) computed at the same points, but with a different independent realization k of the noise. This formula, obviously, reduces the error variance of CFD by 1/m, therefore it becomes interesting to see if Because of the complicated structure of the coefficients a j a formal proof of (44) can be involved. In Table 5, we report a numerical verification of (44) for increasing values of m, with a uniform sampling within the range [-S, S] with S = mh = 3 to compute coefficients a j .
For m = 1, the reduction of the variance of the two methods is the same. For all m > 2, we can see that the reduction of the error variance of NMXFD is greater than that of mCFD. In Table 6, we finally report the comparison of the median log of relative error betweenĜ MXF σ andĜ mCFD σ on increasing noise levels λ, all computed with a value σ of 0.01 and always using the same function evaluation budget. We do not report the performances of other methods for brevity, since they confirm the same conclusions provided by Table 4. Table 6 shows that the basic combinationĜ mCFD σ is indeed a good gradient approximation due to the effect of the average that reduces the error variance. As the noise level increases,Ĝ MXF σ tends to be better thanĜ mCFD σ . This supports the idea that a good gradient approximation depends on both the coefficients of the linear combination and the sampling points where the differences are computed. In this respect, the analysis developed in Sect. 3 to define the new gradient estimate, provides a guide to design a more efficient estimate, depending on the following points: -the parameter S that determines the range of integration in integral (23); -the integration formula used to approximate integral (23); -the filter parameter σ ; -the sampling strategy of the function within the integration range (−S, S).
In this early investigation, we heuristically tried several values for the parameters S and σ , without trying different integration formulas or sampling criteria. The choice of σ may be difficult and affects the quality of the approximation. When the noise level is known, there are some strategies to make a proper choice of σ as in [18]. When the noise level is not known, the choice of this parameter becomes harder and represents an open question to be further investigated, along with the other points in the list above, to improve the performances of NMXFD.
Data availability statement: Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.

Conclusions
In this paper, a novel scheme to estimate the gradient of a function is proposed. It is based on linear functionals defining a filtered version of the objective function. Unlike standard methods where the approximation error is characterized from a statistical point of view and therefore may be quite large on a given experiment, one advantage of the proposed scheme relies on a deterministic characterization of the approximation error in the noise-free setting.
The other advantage lies in its behavior when function evaluations are affected by noise. In fact, the variance of the estimation error of the proposed method is showed to be strictly lower than that of the Central Finite Difference scheme and diminishes as the number of function evaluations increases. The suitable linear combination of finite differences seems to have a filtering role in the case of noisy functions, thus resulting in a more robust estimator.
Numerical experiments on a significant benchmark given by the 69 Schittkowski functions show the good performances of the proposed method when compared with those of the standard methods commonly used in the literature. In particular, the performances of NMXFD are comparable with those of CFD in absence of noise and better with noisy data and seem to be better than those of other schemes in both scenarios. Moreover, we also show the comparison with NMXFD and the average of repeated CFD, thus using the same budget of function evaluations. As the noise level increases, NMXFD tends to perform better than all the other schemes.
This supports the idea that the theory developed to propose this new scheme can be a suitable framework to design gradient estimates with noisy data. The gradient estimate proposed in this paper can be seen as a first design attempt. A future study could be dedicated to the investigation of the best gradient estimates in this framework, along with the analysis of the impact of the obtained gradient approximation when used in optimization algorithms.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
By the Lipschitz property of the gradient, and recalling that T is i s i ϕ(s i ) ds i = 0 we have: We can finally substitute (50) into (45) obtaining: For the first term in (51), we obtain that