Hermite least squares optimization: a modification of BOBYQA for optimization with limited derivative information

Derivative-free optimization tackles problems, where the derivatives of the objective function are unknown. However, in practical optimization problems, the derivatives of the objective function are often not available with respect to all optimization variables, but for some. In this work we propose the Hermite least squares optimization method: an optimization method, specialized for the case that some partial derivatives of the objective function are available and others are not. The main goal is to reduce the number of objective function calls compared to state of the art derivative-free solvers, while the convergence properties are maintained. The Hermite least squares method is a modification of Powell’s derivative-free BOBYQA algorithm. But instead of (underdetermined) interpolation for building the quadratic subproblem in each iteration, the training data is enriched with first and—if possible—second order derivatives and then least squares regression is used. Proofs for global convergence are discussed and numerical results are presented. Further, the applicability is verified for a realistic test case in the context of yield optimization. Numerical tests show that the Hermite least squares approach outperforms classic BOBYQA if half or more partial derivatives are available. In addition, it achieves more robustness and thus better performance in case of noisy objective functions.


Introduction
In optimization we typically distinguish between gradient based and derivativefree optimization (DFO) approaches.If gradients are available, they provide helpful information about the descent in specific points and thus, can improve the performance of the algorithm.With performance of the optimization algorithm we refer to the ability of reaching the optimal solution and the computational effort in order to reach this solution.However, in practice (e.g. in engineering problem settings using blackbox codes) gradients are often not available.In this research we are interested in optimization problems where the objective function is expensive to evaluate, e.g., involving finite element simulations or Monte Carlo analysis.Compared to evaluating such an objective function, the algebraic cost of the optimization solver itself is negligible.Hence, the computing effort of the methods considered in this work is measured by the number of objective function calls during the optimization.Thus, we focus on the possibility of reducing this number by using all information we have.
DFO methods are often divided into direct and model-based search algorithms, and into stochastic and deterministic algorithms, cf.Conn et al. (2009); Rios and Sahinidis (2013).In deterministic model-based algorithms, the objective function is approximated by a surrogate model and evaluations of this surrogate model are considered to determine the search direction.One example are derivative-free trust region methods, to which belong BOBYQA (using interpolation as surrogate model) and the proposed Hermite least squares method (using least squares regression as surrogate model).Alternatively or additionally, approximations of the derivatives can be developed and used-at the cost of additional computing effort.A global deterministic model-based method is Kriging or Bayesian optimization (Currin et al. 1988), which employs stochastic processes as interpolation model.On the other hand, direct search algorithms evaluate the original objective function.Deterministic examples are the Nelder-Mead simplex algorithm (Nelder and Mead 1965) and the DIRECT algorithm (DIvide a hyper-RECTangle) Jones et al. (1993).Stochastic methods run random search steps, see e.g.simulated annealing (Kirkpatrick et al. 1983), particle swarm (Kennedy and Eberhart 1948) and genetic algorithms (Audet and Hare 2017).For more details, further references and numerical comparisons of DFO methods, we refer to Conn et al. (2009) and Rios and Sahinidis (2013).
In case of multidimensional optimization it often occurs that for some directions the partial derivatives are available and for others not.In Sect.6 we will provide a practical example in the context of yield optimization.In this work we provide an optimization strategy well suited to benefit from the known derivatives, without requiring derivatives (or approximations of derivatives) for each direction.
Trust region methods are commonly used for solving gradient based and derivativefree nonlinear optimization problems.These methods are based on approximating the objective function quadratically in each iteration and solving this subproblem in a trust region in order to obtain the new iterate solution for the original problem.In sequential quadratic programming (SQP) gradients are used to build the quadratic approximation based on second order Taylor expansions (Ulbrich and Ulbrich 2012, Chap. 19).For DFO Powell proposed the BOBYQA (Bound constrained Optimization BY Quadratic Approximation) method using polynomial interpolation for building

Problem setting
Even though SQP is able to handle general nonlinear constraints, BOBYQA, as indicated by its name, only accepts bound constraints.Powell also proposed two more methods, LINCOA (LINearly Constrained Optimization Algorithm), which allows linear constraints, and COBYLA (Constrained Optimization BY Linear Approximations), which allows general constraints but uses only linear approximations (Powell 2015(Powell , 1994)).However, we focus on BOBYQA and thus, we consider a bound constrained optimization problem with multiple optimization variables, i.e., for a function f : R n → R, an optimization variable x ∈ R n and lower and upper bounds x lb , x ub ∈ R n the optimization problem reads (1) We assume that the derivatives with respect to some directions x i , i ∈ I = {1, . . ., n} are known, others are not.We denote the index set of known first order derivative directions by I d ⊆ I, such that the set of available first order derivatives is defined by In order to define the set of known second order derivatives, we introduce the tuple set I 2d ⊆ I × I.Then, the set of available second order derivatives is given by For the sake of simplicity we will focus on the practically relevant case of first order derivatives.However, the proposed method can be straightforwardly adjusted for using the second order derivatives, cf.Sect.5.3 and Appendix A. Since we build a quadratic approximation, higher order derivatives are not of concern.In the remainder of this paper, for better readability and without limitation of generality we assume that the x i are ordered such that we can define as the index set of directions for which we consider the first partial derivative to be known.

Model-based search derivative-free optimization
Powell's BOBYQA algorithm is a widely used algorithm in the field of DFO (Powell 2009).The original implementation is in Fortran.Cartis et al. published a Python implementation called PyBOBYQA (Cartis et al. 2019(Cartis et al. , 2021)).It contains some simplifications and several modifications (e.g. for noisy data and global optimization), but Powell's main ideas remain unchanged.In this work, on the programming side, we use PyBOBYQA as a basis and add some new features to it.While the original BOBYQA method (and also PyBOBYQA) are efficient in practice, it cannot be proven that they converge globally, i.e., that they converge from an arbitrary starting point to a stationary point (Conn et al. 2009, Chap. 10.3).Conn et al. reformulated the BOBYQA method in (Conn et al. 2009, Chap. 11.3), maintaining the main concept, but enabling a proof of convergence-on the cost of practical efficiency and of bound constraints.
For the theoretical considerations in this work, we take Conn's reformulation as a basis.Before we come to our modifications for mixed gradient information, we recall the basics of DFO methods and BOBYQA.

Notation
Let m(x) be a polynomial of degree d with x ∈ R n and let = {φ 0 (x), . . ., φ q (x)} be a basis in P d n and q 1 = q + 1.Further, we define the vector of basis polynomials evaluated in x by (x) := (φ 0 (x), . . ., φ q (x)) .The training data set is denoted by T = {(y 0 , f (y 0 )), . . ., (y p , f (y p ))} (5) and p 1 = p + 1.The original BOBYQA method is based on interpolation, however, we formulate the problem more generally as interpolation or least squares regression problem.The system matrix and the right hand side of the interpolation or regression problem are then given by If p 1 = q 1 the system matrix M is quadratic and v ∈ R q 1 = p 1 solves the interpolation problem If p 1 > q 1 the system matrix M is in R p 1 ×q 1 and v ∈ R q 1 , this leads to an overdetermined interpolation problem and can be solved with least squares regression If the matrix M in ( 8) is non-singular, the linear system (8) has a unique solution.Then, following (Conn et al. 2009, Chap. 3), the corresponding training data set is said to be poised.Analogously, if the matrix M in (9) has full column rank, the linear system (9) has a unique solution.And, following (Conn et al. 2009, Chap. 4), the corresponding training data set is said to be poised for polynomial least-squares regression.When talking about training data sets in the following, we always assume them to be poised, if not specifically noted otherwise.Although many of the results hold for any choice of a basis, in the following we use the monomial basis of degree d = 2. Thus, if not mentioned differently, for the remainder of this paper, is defined by the (n + 1)(n + 2)/2-dimensional basis

3-poisedness
Many DFO algorithms are model based.Thus, in order achieve well behavior of the optimization strategy or to even guarantee global convergence, we have to ensure that the model is good enough.In gradient based methods, typically some Taylor expansion error bounds are considered.In DFO methods, which are based on interpolation or regression, the quality of the model depends on the quality of the training data set.This leads us to the introduction of the term -poisedness.We recall the definitions of -poisedness from (Conn et al. 2009, Def. 3.6, 4.7, 5.3).
Definition 1 ( -poisedness in the interpolation sense) Given a poised interpolation problem as defined in Sect.3.1.Let B ⊂ R n and > 0. Then the training data set T is -poised in B (in the interpolation sense) if and only if Remark 1 Note that in case of using the monomial basis, the equality in Def. 1 can be rewritten as M l(x) = (x) and the l i (x), i = 0, . . ., p are uniquely defined by the Lagrange polynomials and can be obtained by solving where e i ∈ R p 1 denotes the i-th unit vector and the elements of λ i are the coefficients of the polynomial l i , which will be evaluated at x.
Definition 2 ( -poisedness in the regression sense) Given a poised regression problem as defined in Sect.3.1.Let B ⊂ R n and > 0. Then the training data set T is -poised in B (in the regression sense) if and only if Remark 2 Note that the l i (x), i = 0, . . ., p are not uniquely defined since the system in ( 13) is underdetermined.However, the minimum norm solution corresponds to the Lagrange polynomials (in the regression sense), cf.(Conn et al. 2009, Def. 4.4).
Analogously to remark 1 they can be computed by solving and using the entries of λ i as coefficients for the polynomial l i .
It can be shown, that the poisedness constant , or rather 1/ can be interpreted as the distance to singularity of the matrix M, or as the distance to linear dependency of the vectors (y i ), i = 0, . . ., p, respectively (Conn et al. 2009, Chap. 3).

BOBYQA
The following description of BOBYQA's main ideas follows the one in Cartis et al. (2021).The BOBYQA algorithm is a trust region method based on a (typically underdetermined) quadratic interpolation model.The training data set defined in (5) contains the objective function evaluations for each sample point y i , i = 0, . . ., p, and the size of the training data set is given by |T | ∈ [n + 2, (n + 1)(n + 2)/2].Note that setting |T | = n + 1 is possible in PyBOBYQA, but then a fully linear (not quadratic) interpolation model is applied.
Let x (k) denote the solution at the k-th iteration.At each iteration a local quadratic model is built, i.e., fulfilling the interpolation conditions For |T | = (n + 1)(n + 2)/2 the interpolation problem is fully determined.For |T | < (n + 1)(n + 2)/2 the remaining degrees of freedom are set by minimizing the distance between the current and the last approximation of the Hessian H in the matrix Frobenius norm, i.e., min c (k) ,g (k) ,H (k) where typically H (−1) = 0 n×n .Once the quadratic model is built, the trust region subproblem is solved, where (k) > 0 denotes the trust region radius.Then, having the optimal solution x opt of (18) in the k-th iteration calculated, we check if the decrease in the objective function is sufficient.Therefore, the ratio between actual decrease and expected decrease is calculated.If the ratio r (k) is sufficiently large, the step is accepted (x (k+1) = x opt ) and the trust region radius increased ( (k+1) > (k) ).Otherwise, the step is rejected (x (k+1) = x (k) ) and the trust region radius decreased ( (k+1) < (k) ).
An important question is now, how to maintain the training data set.An accepted solution is added to T , i.e., T = T ∪ {(y add , f (y add ))} with y add = x opt , and since |T | is fixed, another data point has to leave the training data set.Hereby, the aim is to achieve the best possible quality of the model.We know from Sect.3.2 that the quality of the model depends on the training data set and can be expressed by the poisedness constant .Thus, the decision which point is replaced depends on its impact on the -poisedness.Let l i , i = 0, . . ., p be the Lagrange polynomials obtained by evaluating ( 12) and let Then, the point y i go is replaced by the new iterate y add .This means, that the point with the worst (largest) value of the corresponding Lagrange polynomial, evaluated at the new iterate solution, is going to be replaced, i.e., the updated training data set is built by Please note that ( 20) is the formula used in Powell's and Cartis' implementations, however, in Powell's work (Powell 2009, eq. (6.1)) in numerator and denominator exponent 2 is used instead of exponent 4. Regardless of the subproblem's optimal solution, sometimes points are exchanged, only in order to improve the training data set.Let y i be the training data point which shall be replaced, e.g. the point furthest from the current optimal solution.Then we consider the corresponding Lagrange polynomial l i and choose a new point by solving For more details we refer to the original work by Powell (2009).

Solving the linear system
For the Hermite least squares method we modify BOBYQA's linear system resulting from the interpolation conditions ( 16).From this solution, the coefficients c (k) , g (k)  and H (k) of the quadratic subproblem (15) are determined.Therefore, we recall how the linear system is build in PyBOBYQA (Cartis et al. (2019).We denote the current optimal solution by x opt ∈ T .Without limitation of generality we assume that x opt = y p .The uniquely solvable system (i.e.p 1 = q 1 = (n + 1)(n + 2)/2) reads as follows with and where H (k) is a vector in R (n 2 +n)/2 containing the lower triangular and the diagonal elements of the diagonal matrix H (k) .The constant part is set to c The case n +2 ≤ p 1 < (n +1)(n +2)/2 is not of further interest for the construction of the Hermite least squares system.Hence, we refrain from a detailed description here and refer to Cartis et al. (2019).

Convergence
The convergence theory for gradient based optimization algorithms like SQP is typically based on error bounds of the Taylor expansion, in order to show the decreasing error between the model m (k) (x) and the function f (x), between the corresponding derivatives.In DFO the poisedness constant can be used to formulate a Taylor type error bound.In Conn et al. (2008a) an error bound is given by where G is a constant depending only on the function f and d is the polynomial degree of the approximation model (i.e.here d = 2).Thus, in order to apply the convergence theory of gradient based methods to DFO methods, it is required to keep uniformly bounded for all training data sets used within the algorithm.The algorithm in (Conn et al. 2009, Algo. 11.2) is a modified version of Powell's DFO method, such that global convergence can be proven.However, in contrast to the original BOBYQA method, bound constraints are not considered in Conn's version.Hence, Conn's algorithm is rather an adaptation of Powell's UOBYQA (Unconstrained Optimization BY Quadratic Approximation) method (Powell 2002).Since UOBYQA is similar to the BOBYQA method described in this section (but without considering constraints), we refrain from a full description.For more details, we refer to the original work by Powell (Powell 2002), or the convergent modification by Conn (Conn et al. 2009, Chap. 11.3).
For the global convergence proof of Conn's version, the -poisedness plays an important role.One key aspect is the fact, that the interpolation set is -poised in each step of the optimization algorithm -or, within a final number of steps, it can be transformed into a -poised set (using the so-called model improvement algorithm (Conn et al. 2009, Algo. 6.3)).Since Powell's and Cartis' versions allow bound constraints, the convergence proof from (Conn et al. 2009) cannot be simply applied.Even if bound constraints were not provided in these algorithms, global convergence could not be proven, since the -poisedness of the interpolation set is not always guaranteed.They apply strategies to update the training data set, which hopefully reduce the poisedness constant , but they do not provide bounds (Conn et al. 2008a).Therefore, they would require -poisedness checks more often and re-evaluations of the whole training data set in situations of poorly balanced training data sets, i.e., an usage of the model improvement algorithm.However, to feature bound constraints and for the benefit of less computational effort and thus, efficiency, they abdicate on provable convergence and rely on heuristics, when and how to check and improve -poisedness.

Hermite least squares method
In Hermite interpolation a linear system is solved in order to find a polynomial approximation of a function, considering function values and partial derivative values in given training data points, cf.(Hermann 2011, Chap. 6.6) or Sauer and Xu (1995).In the following we will build such a system, but with more information than required for a uniquely solvable Hermite interpolation and solve it with least squares regression.Thus, we call the optimization approach based on this technique Hermite least squares optimization.
As mentioned in Sect. 2 we assume that we know some partial derivatives of the objective function f : R n → R, i.e., we can calculate them with negligible computational effort compared to the effort of evaluating the objective function itself.We focus on the information in (2), i.e., we neglect the second order derivatives.Our Hermite training data set is then given by We want to use this additional information in order to improve the quadratic model.BOBYQA's simple interpolation ( 22) is extended with derivative information yielding least squares regression.First we introduce this approach starting with a uniquely solvable interpolation problem, i.e., the number of training data points p 1 coincides with the dimension of the basis q 1 .Then, we allow to reduce the number of training data points such that the initial interpolation problem is underdetermined, i.e., p 1 < q 1 .While for p 1 = q 1 global convergence can be ensured (in the unconstrained case), we observe superior performance for p 1 < q 1 , see numerical results in Sect.5.1.

Build upon interpolation (p
First, we consider a training data set with |T I | = p 1 ≡ q 1 , i.e., we could solve an interpolation problem as in ( 22) based only on function evaluations in order to obtain the quadratic model in (15).Instead, we provide additionally derivative information for the first n kd partial derivatives of each training data point, i.e., we consider T H with |T H | = p 1 (1+n kd ), where p 1 is the number of data points and |T H | denotes the number of information.We extend the system with the gradient information available in form of additional lines for the system matrix and the right hand side and obtain where M I and b I are defined in ( 23) and ( 24), respectively, and the entries of the submatrices H and b (k) H , k = 1, . . ., n kd , are given by Solving the overdetermined linear system using least squares regression yields a quadratic model for the trust region subproblem (v (k) defined as in ( 25)).The formulation of the system matrix M H and the right hand side b H in case of second order derivatives is given in Appendix A. An optional step is the weighting of the least squares information, cf.weighted regression (Björck 1996).Information belonging to a training data point close to the current solution could be given more weight, information belonging to a training data point far from the current solution could be given less weight.However, since we could not observe significant improvements in the numerical tests, weighting has not been further investigated in this work.
In the following, we will discuss how good the quadratic model resulting from solving (31) is.Therefore, we state the following theorem, which generalizes theorem 4.1 in Conn et al. (2008b).Proof The additional information can be added in form of additional lines to the system matrix and the right hand side.Thus, we can set the system matrix and the right hand side of the regression problem corresponding to T R to Since T I is -poised in the interpolation sense, by Definition 1 holds where Then Theorem 1 shows, that it is enough to ensure that a subset of the regression data set is -poised in the interpolation sense, and then we can deduce that the regression data set is at least -poised in the regression sense.Thus, although there is no analogue of the model improvement algorithm for the regression case (Conn et al. 2009, Chap. 6), we can apply the model improvement algorithm (Conn et al. 2009, Algo. 6.3) for interpolation and the optimization algorithm (Conn et al. 2009, Algo. 11.2).This ensures -poisedness in the interpolation sense of a subset with | | = (n+1)(n+2)/2 points, and then we build the quadratic model with least squares regression.And since l i R (x) = 0 for i > |T I |, the type of additional information in T R has no impact on the proof -as long as the matrix M R has full column rank.This implies, that instead of additional data points and their function evaluations, we can also add derivative information according to (28), and our training data set remains at least -poised.The proof of convergence from (Conn et al. 2009) (holding for the unconstrained case) remains unaffected.In practice we expect faster convergence due to better quadratic models.

Build upon underdetermined interpolation (p 1 < q 1 )
For fully determined quadratic interpolation, a large set of training data points is required ( p 1 = |T I | = (n + 1)(n + 2)/2), such that the linear system is uniquely solvable.Since in Hermite least squares we have additional gradient information we can reduce the number of training data points and still have a determined or overdetermined regression system.The number of rows in the Hermite least squares system is given by p 1 (1 + n kd ) − 1 and has to be larger than the number of columns, i.e., q = | | − 1 = (n + 1)(n + 2)/2 − 1, cf. (28).Thus, the required number of training data points in the Hermite least squares is only This allows the Hermite least squares system to be built in (28-31), with the only difference that p 1 < q 1 .Since the regression data set does not contain a subset of -poised interpolation points anymore, the model improvement algorithm cannot be applied to a subset.Thus, even in the unconstrained case, the scheme of the formal convergence proof from (Conn et al. 2009) cannot be transferred.In the next subsection we will discuss how to build and maintain the training data set, taking into account the derivative information.

3-poisedness for Hermite least squares
We aim to include the derivative information into the updating procedure of the training data set.We start with the Hermite interpolation setting introduced in Sect.4.1, i.e., initially we set implying that the system matrix M H is quadratic and hence, ( 31) is a uniquely solvable Hermite interpolation problem.We adapt the definition ofpoisedness to the Hermite interpolation case.However, the following definition does not guarantee the required error bounds for provable convergence (cf.(Conn et al. 2009, Chap. 6.1)).This leads to an approach, without formal convergence proof such as the common BOBYQA implementations.
We will define Lagrange-type polynomials for Hermite interpolation and show that they solve (38).
Definition 4 (Lagrange-type polynomial for Hermite interpolation) Let M H ∈ R q 1 ×q 1 be a Hermite interpolation matrix with respect to the basis as defined in ( 28) and e i ∈ R q 1 the i-th unit vector.Let λ i solve Then, the polynomial built with the coefficients of λ i and the basis defines the i-th Lagrange-type polynomial for Hermite interpolation. 123 Lemma 1 Let t(x) = t 0 (x), . . ., t q (x) be as in (40), (x) defined as in Sect.3.1 and M H from (28).Then, t(x) solves M H l(x) = (x), i.e., t(x) ≡ l(x).
Transposing yields which is per definition equivalent to Thus, t solves (38).
It holds that (uniquely defined) polynomial solving the Hermite interpolation problem M H v = b H with M H ∈ R q 1 ×q and b H ∈ R q 1 can be written as where p 1 = p + 1 is the number of training data points and q 1 = (1 + n kd ) p 1 .Now, let us investigate Hermite least squares as introduced in Sect.4.2, i.e., the case |T H | > q 1 , implying Extending the concept above, the Lagrange-type polynomials for Hermite least squares are obtained by solving instead of (39).We maintain the training data set based on -poisedness in the Hermite least squares sense.This means, we maximize (20) over the first p 1 Lagrange polynomials and replace the chosen data point with all corresponding information (i.e.function value and derivative information) by the new data point.

Scaling
In this section we will discuss some preconditioning steps for solving the linear equations systems.In PyBOBYQA (Cartis et al. 2019) the system is scaled in the following way: instead of the system is solved, where L and R are diagonal matrices of the same dimension as M. Each training data point entry y i is scaled by the factor 1/ , where is the trust region radius of the current step (for simplicity of notation we omit the index k for the current iteration for both the trust region radius and the system).Thus, the scaling matrices in PyBOBYQA (with p 1 = q 1 ) are given by i.e., the for the linear part are scaled by 1/ , the columns for the quadratic part by 1/ .Preserving the same scaling scheme, the following left scaling matrix is obtained for the Hermite least squares approach while the right scaling matrix remains unchanged as in (48).

Numerical tests
A test set of 29 nonlinear, bound constrained optimization problems with 2, 3, 4, 5 and 10 dimensions has been evaluated and compared.The complete test set and the detailed results can be found at GitHub (Fuhrländer and Schöps 2022).As reference solution we consult the solution of PyBOBYQA (Cartis et al. 2019) using the default setting for the size of the training data set, i.e., p 1 = 2n +1, in the following referenced to as PyBOBYQA.For all remaining parameters we also apply the default settings.Another reference solution is SQP, where the unknown derivatives are calculated with finite differences.Therefore, the SciPy implementation of the SLSQP algorithm from (Kraft et al. 1988) has been used.Please note that this is a completely different implementation, thus a direct comparison should be handled with caution.In the Hermite least squares approach, numerical tests showed that is a reasonable choice for the number of training data points.We vary the number of known derivative directions n kd and always assume that these derivative directions are then available for all training data points.
In the tests we are interested in two aspects: 1) do we find an optimal solution and 2) how much computing effort is needed.For 1) we check if we find the same solution as the reference methods.Please note that we considered only test functions for which the reference PyBOBYQA method was able to find the optimal solution.For 2) we compare the total number of objective function evaluations during the whole optimization process.

Results
Before we evaluate the complete test set, we analyze the accuracy of the quadratic model m (k) (x) which is built in each iteration.Let us consider the Rosenbrock function in R 2 , given by Fig. 1 Error the model m (k) in each iteration compared to second order Taylor expansion, cf. ( 52 We assume ∂ f /∂ x 1 to be unknown and ∂ f /∂ x 2 to be available.The second order Taylor expansion T 2 f (x; x (k) ) in x (k) is considered as the reference model.We investigate the error between this Taylor reference and the quadratic model of PyBOBYQA and of Hermite least squares, respectively, evaluated by using the L 2 -norm In Fig. 1, the resulting error is plotted over the number of iterations.We observe that the error of the quadratic model decreases first for the Hermite least squares model.After 20 iterations, the error remains below 2.5 • 10 −7 .For PyBOBYQA the error is also reduced to this magnitude, but it takes 65 iterations.The errors of the quadratic models reflect the performance of the different optimization methods.Both find the same optimal solution.Hermite least squares is more efficient, it terminates after 43 iterations.The reference method PyBOBYQA terminates after 106 iterations.Here, the number of objective function calls is proportional to the number of iterations.
Test set In the following, the test set from (Fuhrländer and Schöps 2022) is evaluated.
The results are consistent with the observations regarding error and performance for the Rosenbrock function, which have been described above.In Figs. 2 and 3 the numerical results for the test set are visualized.In Fig. 2 the arithmetic mean of the number of objective function calls is considered, in Fig. 3 the geometric mean, respectively.We compare PyBOBYQA with Hermite least squares and vary the number of known derivatives n kd .For example, in Fig. 2b for Hermite least squares with n kd = 2 we average over all 3-dimensional test problems, solved with Hermite least squares, with three cases each, i.e., 1) For the 10-dimensional test problems we tested three random permutations of known derivatives per n kd .
For n = 4 and n kd = 1 some instances did not terminate within the limit of 2000 objective function calls using Hermite least squares.Hence, this case is left out in Figs.2c and 3c, respectively.However, the corresponding results are reported in Fuhrländer and Schöps (2022).We observe that if less than the half of the derivative directions are known, i.e., n kd < 1 2 n, the Hermite least squares method is not reliably better than BOBYQA.However, if at least the half of the derivatives are known, we can significantly save computing effort by the proposed Hermite method.For n kd ≥ 1 2 n, with Hermite least squares the number of objective function calls can be reduced by 34 %−80 % compared to PyBOBYQA, depending on dimension n and number of known derivatives n kd , see Fig. 2. One exception is the case n = 10 and n kd = 5, where we can only see a reduction by 3 %.Although, we observe that there are instances of some test problems for which the number of objective function calls increase (cf.Fuhrländer and Schöps (2022)), in average the computational effort can be significantly reduced.The optimal solution has been found in all considered cases.As expected, we observe that the more gradient information we have, the less objective function evaluations are needed within one optimization run using Hermite least squares.We can conclude that, assuming about the half or more of the partial derivatives are known, using the Hermite least squares approach instead of the classic PyBOBYQA method reduces the computational effort significantly.
In the numerical tests, we also compared the reference BOBYQA and the proposed Hermite modification with SQP.While for Hermite least squares we took PyBOBYQA from Cartis et al. (2019) as a basis and included the required modifications for the Hermite approach, the SQP method from SciPy based on Kraft et al. (1988) is a different implementation, using for example different ways to solve the quadratic subproblem.However, we could observe that in almost all cases, the SQP method required a lower number of objective function calls than PyBOBYQA in order to find the optimal solution, and in most cases also less than Hermite least squares.Please note that in the SQP method we provided the known derivatives and only calculated the remaining ones with finite differences.

Noisy data
Let us consider the Rosenbrock function from (51) and investigate the performance of the different methods under noise.In accordance to Cartis et al. (2019), for that purpose we add random statistical noise to the objective function value and the derivative values by multiplying the results with the factor 1 + ξ , where ξ is a uniformly distributed random variable, i.e., ξ ∼ U(−10 −2 , 10 −2 ).Again, for SQP and Hermite least squares we assume ∂ f /∂ x 1 to be unknown and ∂ f /∂ x 2 to be available.
The optimal solution of ( 51) is We start the optimization with First we apply the Hermite least squares method.It terminates after only 37 function calls with the optimal solution

Hermite least squares with second order derivatives
Again we consider the Rosenbrock function ( 51) as test function and investigate the usage of second order derivatives, according to the formulation in Appendix A. In this example we observe that the usage of second derivatives additionally to first derivatives slightly reduces the computing effort.The results are given in Table 1 .
Since second order derivatives are rarely available in practice, so we do not further extend the numerical tests.

A practical example: yield optimization
In this section we discuss a practical example where the case of known and unknown gradients occur.In the field of uncertainty quantification, yield optimization is a common task (Graeb 2007).In the design process of a device, e.g.antennas, electrical machines or filters, geometry and material parameters are chosen such that predefined requirements are fulfilled.However, in the manufacturing process, there are often uncertainties which lead to a deviation in the optimized design parameters and this may cause a violation of the requirements.The aim of yield estimation is the quantification of the impact of this uncertainty.The yield defines the probability, that the device still fulfills the performance requirements, under consideration of the manufacturing uncertainties.Thus, the natural goal is to maximize the yield.Please note, the task of yield maximization is equivalent to the task of failure probability minimization.We will formally introduce the yield and discuss the task of yield optimization with an example from the field of electrical engineering: a simple dielectrical waveguide as depicted in Fig. 5.The model of the waveguide originates from (Loukrezis 2019), and was used for yield optimization previously, e.g. in Fuhrländer et al. (2020).
The waveguide has four design parameters, which shall be modified.Two uncertain geometry parameters: the length of the inlay p 1 and the length of the offset p 2 .And two deterministic material parameters: d 1 with impact on the relative permittivity and d 2 with impact on the relative permeability.The uncertain parameters are modeled (66) Since p 1 and p 2 in (64) are only contained in the probability density function the derivative with respect to the mean values of the uncertain parameters is given by And since the probability density function of the Gaussian distribution is an exponential function, it is continuously differentiable, thus the derivatives with respect to p 1 and p 2 can be calculated easily.Further, according to Graeb ( 2007), the derivative of the MC yield estimator with respect to the uncertain parameters is given by where p j, is the mean value of all sample points of p j lying inside the safe domain.This implies that there are not only closed-form expressions of these derivatives, but also numerical expressions which require only the evaluation of the objective function (which is anyway necessary), but no further computational effort.On the other hand, the deterministic variables are contained in the indicator function in (64).Thus, the corresponding partial derivatives are not considered as available.This leads to the situation that two partial derivatives are available, and two are unknown.The Hermite least squares approach described above can be applied and compared with standard BOBYQA and SQP.There are two possibilities for the generation of the Monte Carlos sample set: a) the same sample set is used in each iteration and just shifted according to the current mean value (no noise) and b) the sample set is generated newly each time the mean value is changed (noise).Depending on the size of the sample set, the accuracy can be controlled.In the following we investigate three different settings: We start with the no noise setting and compare the different optimization methods with respect to the optimal value reached and the number of objective function calls needed.The initial yield value is Y (0) MC = 42.8 %.The results are visualized in Fig. 6.While the optimal yield values are similar (best for Hermite l.s. and SQP), the computational effort varies significantly.SQP performs best with only 36 objective function calls, Hermite l.s. is at the second position with 50 % more, then PyBOBYQA with 100 % more.This coincides with our findings in Sect. 5.There we could also observe that Hermite least squares performs best (excluding SQP).In the next step we evaluate the noisy settings.The results for the low noise setting are visualized in Fig. 7a, and for the high noise setting in Fig. 7b, respectively.
As in the no noise setting, PyBOBYQA and the Hermite least squares find the optimal solution and the number of objective function calls does not change significantly when noise is added.While in the low noise setting SQP finds a good optimal solution with the lowest number of objective function calls, in the high noise setting SQP loses its advantage in terms of computational effort, and at the same time breaks down in finding an optimum.
In summary, for this practical example, the Hermite least squares method performs best in terms of solution quality and computational effort.Further we observe that the interpolation and regression based methods can handle the noise, while SQP based on finite differences may not find the optimum anymore.

Conclusion
In this paper, we address the issue that in an optimization problem, some partial derivatives of the objective function are available and others are not.Based on Powell's derivative-free solver BOBYQA, we have developed two the Hermite least squares optimization method.Besides function evaluations, there we use the available first and second order derivatives of a training data set to build a quadratic approximation of the original objective function.In each iteration, this quadratic subproblem is solved in a trust region by least squares regression, and the training data set is updated.
Global convergence of the Hermite least squares method can be proven under the same assumptions as in Conn's BOBYQA version, i.e., for problems without bound constraints.In the Hermite least squares method, additionally a comparatively high number of interpolation points ( p 1 = q 1 ) is required for the proof.However, in practice, decreasing the number of interpolation points leads to higher performance regarding the computational effort and thus to higher practical applicability.Numerical tests on 30 test problems including a practical example in the field of yield optimization have been performed.If half or more partial derivatives are available, the Hermite least squares approach outperforms (Py)BOBYQA in terms of computational effort by maintaining the ability of finding the optimal solution.Depending on the dimension and the amount of known derivative directions the number of objective function calls can be reduced by a factor up to five.Further, the proposed method is particularly stable with respect to noisy objective functions.In case of noisy data Hermite least squares finds the optimal solution more reliably and quickly than (Py)BOBYQA or gradient based solvers as sequential quadratic programming (SQP) using finite differences for calculating missing derivatives.
point by M i 2d , the corresponding right hand side b i 2d , respectively.They are given by with M i 2d ∈ R (n 2 2kd +n 2kd )/2×q and Utilizing that the basis is defined by (10) and assuming the second order derivatives are available for all directions, i.e., n 2kd = n, M i 2d can be simplified to i.e., the linear part vanishes and the quadratic part is given by the identity matrix.

Theorem 1
Given a poised training data set T I and the monomial basis with |T I | = | |, and B ⊂ R n .Let M I be the corresponding system matrix of the interpolation problem and b I the right hand side, respectively.Let T R ⊃ T I be a training set containing further information.If T I is -poised in B in the interpolation sense, then T R is at least -poised in B in the regression sense.

Definition 3 (
-poisedness in the Hermite interpolation sense) Given a poised Hermite interpolation problem as defined above with p 1 training data points, the training data set T H and the monomial basis with |T H | = p 1 (1 + n kd ) = q 1 = | |.Let B ⊂ R n and > 0. Then the training data set T H is -poised in B (in the Hermite interpolation sense) if and only if Fig.1Error the model m(k) in each iteration compared to second order Taylor expansion, cf.(52) with δ = 0.01.Comparison between proposed Hermite least squares and reference method PyBOBYQA s. n kd = 3 Hermite l.s.n kd = 5 Hermite l.s.n kd = 7 Hermite l.s.n kd = function calls (arithmetic mean) (e) 10-dimensional test problems.

Fig. 2
Fig.2Arithmetic mean of the number of function evaluations for all test problems solved with the reference method PyBOBYQA and Hermite least squares (Hermite l.s.) with varying number of known derivatives n kd s. n kd = 3 Hermite l.s.n kd = 5 Hermite l.s.n kd = 7 Hermite l.s.n kd = function calls (arithmetic mean) (e) 10-dimensional test problems.

Fig. 3
Fig.3Geometric mean of the number of function evaluations for all test problems solved with the reference method PyBOBYQA and Hermite least squares (Hermite l.s.) with varying number of known derivatives n kd

Fig. 5
Fig. 5 Model of a simple waveguide with dielectrical inlay and two geometry parameters p 1 and p 2 p 1 p 2 1. no noise: same sample set (a) and N MC = 2500 2. low noise: new sample sets (b) and N MC = 2500 3. high noise: new sample sets (b) and N MC = 100

Fig. 6
Fig.6Results for yield optimization in the no noise setting, solved with the reference PyBOBYQA method, the reference SQP method and Hermite least squares (Hermite l.s.)

Fig. 7
Fig. 7Results for yield optimization for noisy settings, solved with the reference PyBOBYQA method, the reference SQP method and Hermite least squares (Hermite l.s.)