Gauss–Newton-type methods for bilevel optimization

This article studies Gauss–Newton-type methods for over-determined systems to find solutions to bilevel programming problems. To proceed, we use the lower-level value function reformulation of bilevel programs and consider necessary optimality conditions under appropriate assumptions. First, under strict complementarity for upper- and lower-level feasibility constraints, we prove the convergence of a Gauss–Newton-type method in computing points satisfying these optimality conditions under additional tractable qualification conditions. Potential approaches to address the shortcomings of the method are then proposed, leading to alternatives such as the pseudo or smoothing Gauss–Newton-type methods for bilevel optimization. Our numerical experiments conducted on 124 examples from the recently released Bilevel Optimization LIBrary (BOLIB) compare the performance of our method under different scenarios and show that it is a tractable approach to solve bilevel optimization problems with continuous variables.

We aim to solve the bilevel programming problem min x, y F(x, y) s.t.G(x, y) 0, y ∈ S(x) := arg min y {f(x, y) : g(x, y) 0}, where F : R n × R m → R, f : R n × R m → R, G : R n × R m → R q , and g : R n × R m → R p .As usual, we refer to F (resp.f) as upper-level (resp.lower-level) objective function and G (resp.g) stands for upper-level (resp.lower-level) constraint function.Solving problem (1.1) is very difficult because of the implicit nature of the lower-level optimal solution mapping S : R n ⇒ R m defined in (1.1).
There are several ways to deal with the complex nature of problem (1.1).One popular technique is to replace the lower-level problem with its Karush-Kuhn-Tucker (KKT) conditions.With this formulation, bilevel programming problems are strongly linked to MPECs (mathematical programs with equilibrium constraints), see, e.g., [6], which are not necessarily easy to handle due in part to the extra variables representing the lower-level Lagrangian multipliers.Interested readers are referred to [1,5,14] and references therein, for results and methods based on this transformation.In this paper, we are going to use the lower-level value function reformulation (LLVF) min x, y F(x, y) s.t.G(x, y) 0, g(x, y) 0, f(x, y) ϕ(x), (1.2) where the optimal value function is defined by ϕ(x) := inf {f(x, y) | g(x, y) 0 } , ( to transform problem (1.1) into a single-level optimization problem.As illustrated in [9], this approach can provide tractable opportunities to develop second order algorithms for the bilevel optimization problem, as it does not involve first order derivatives for lower-level problem, as in the context of the KKT reformulation.
There are recent studies on solution methods for bilevel programs, based on the LLVF reformulation.For example, [17,18,20,23,28] develop global optimisation techniques for (1.1) based on (1.2)- (1.3).[19,29,30] propose algorithms computing stationary points for (1.2)- (1.3), in the case where the upper-level and lower-level feasible sets do not depend on the lower-level and upper-level variable, respectively.[9] is the first paper to propose a Newton-type method for the LLVF reformulation for programs.Numerical results there show that the approach can be very successful.The system of equation build there is square while our method in this paper is based a non-square and overdetermined system of equations.Hence, the need to develop Gauss-Newton-type techniques to capture certain classes of bilevel optimization stationarity points.
One of the main problems in solving (1.2) is that its feasible points systematically fail many constraint qualifications (see, e.g., [3]).To deal with this issue, we will use the partial calmness condition [31], to shift the value function constraint f(x, y) ϕ(x) to the upper-level objective function, as a penalty term with parameter λ.The other major problem with the LLVF reformulation is that ϕ is typically non-differentiable.This will be handled by using upper estimates of the subdifferential of the function; see, e.g., [3,4,5,31].Our Gauss-Newton-type scheme proposed in this paper is based on a relatively simple system of optimality conditions, which depends on λ.
To transform this optimality conditions into a system of equations, we substitute the corresponding complementarity conditions by the standard Fischer-Burmeister function [8].To deal with the non-differentiability of the Fischer-Burmeister function, we consider two approaches in this paper.The first one is to assume strict complementarity for the constraints involved in the upper-and lower-level feasible sets.As second option to avoid non-differentiability, we investigate a smoothing technique by adding a perturbation in the Fischer-Burmeister function.
Another important aspect of the aforementioned system of equations is that it is overdetermined.Since overdetermined systems have non-square Jacobian, we cannot use a classical Newton-type method as in [9].Gauss-Newton and Newton-type methods with Moore-Penrose pseudo inverse are both introduced in Section 3. It will be shown that these methods are well-defined for solving bilevel programs from the perspective of the LLVF reformulation (1.2).In particular, our framework ensuring that the Gauss-Newton method for bilevel optimization is well-defined, does not require any assumption on the lower-level objective function.A strong link between these two methods is then discussed to show that they should perform very similarly for most problems.Based on this relationship, we also expect the Newton method with pseudo inverse to be more robust, the evidence of which we will show in the numerical implementation in Section 5.
We present results of extensive experiments for testing the methods and comparing with Matlab built-in function fsolve.In Section 5 the results are compared with known solutions of the problems to check if obtained stationary points are optimal solutions of the problems or not.For 124 tested problems we obtain more than 80% of satisfying solutions in the sense of recovering known solutions with < 10% error, or obtaining better ones by all methods with CPU time being less than half a second.The number of recovered solutions as well as the performance profiles and feasibility check show that Gauss-Newton and Newton method with pseudo inverse outperform fsolve.It is worth mentioning that it is not typical to conduct such a number of experiments in the literature on testing solution methods for bilevel programming.The conjecture of the similarity of the performance of the tested methods is verified numerically, also showing that Newton's method with pseudo inverse is indeed more robust than the classical Gauss-Newton method.The technique for choosing the penalization parameter λ is a heuristic that might depend on the structure of the problem.
Let us start with some definitions required to state the main theorem of this section.Define full convexity of the lower-level as convexity of lower-level objective and all lower-level constraints with respect to all variables (x, y).Further on, a feasible point (x, ȳ) ∈ R n × R m is said to be lower-level regular if there exists direction d ∈ R m such that (2.1) One can recognize that this is equivalent to MFCQ holding for the lower-level constraints.Similarly, for (x, ȳ) ∈ R n × R m satisfying the upper-level inequality constraints G(x, ȳ), (x, ȳ) is upper-level regular if there exists a direction d ∈ R n+m such that Finally, to write the necessary optimality conditions for problem (1.2), it is standard to use the following partial calmness concept [31]: Definition 2.1.Let (x, ȳ) be a local optimal solution of problem (1.2).This problem is partially calm at (x, ȳ) if there exists λ > 0 and a neighbourhood U of (x, ȳ, 0) such that 3) being partially calm at a local optimal solution (x, ȳ) is equivalent to the existence of a parameter λ > 0 such that (x, ȳ) is also a local optimal solution of problem min x, y It is clear that this is a penalization of only the constraint f(x, y) − ϕ(x) 0 with the penalty parameter λ.Hence, problem (2.3) is a usually labelled as a partial exact penalization of problem (1.2)- (1.3).With this reformulation it is now reasonable to assume standard constraint qualifications to derive optimality conditions.Based on this, we have the following result, see, e.g., [3,4,5,31], based on a particular estimate of the subdifferential of ϕ (1.3).Theorem 2.2.Let (x, ȳ) be a local optimal solution to (1.2)- (1.3), where all function are assumed to be differentiable, ϕ is finite around x and lower-level problem is fully convex.Further assume that the problem is partially calm at (x, ȳ), the lower-level regularity is satisfied at (x, ȳ) and upper-level regularity holds at x. Then there exist λ 0, and Lagrange multipliers u, v, w such that w 0, g(x, ȳ) 0, w T g(x, ȳ) = 0. (2.9) Depending on the assumptions made, we can obtain optimality conditions different from the above.The details of different stationarity concepts can be found in the latter references, as well as in [32].Weaker assumptions will typically lead to more general conditions.However, making stronger assumptions allows us to obtain systems that are easier to handle.For instance, it is harder to deal with more general conditions introduced in Theorem 3.5 of [3] or Theorem 3.1 of [4] because of the presence of the convex hull in the corresponding estimated of the subdifferential of ϕ [3,4,5,31].The other advantage of (2.4)-(2.9) is that, unlike the system studied in [9], these conditions do not require to introduce a new lower-level variable.
The above optimality conditions involve the presence of complementarity conditions (2.7)-(2.9),which result from inequality constraints present in (1.2)- (1.3).In order to reformulate the complementarity conditions in the form of a system of equations, we are going use the concept of NCP-functions; see, e.g., [26].The function φ : R 2 → R is said to be a NCP-function if we have φ(a, b) = 0 ⇐⇒ a 0, b 0, ab = 0.
In this paper, we use φ(a, b) := √ a 2 + b 2 − a − b, known as the Fischer-Burmeister function [8].This leads to the reformulation of the optimality conditions (2.4)-(2.9)into the system of equations: where we have z := (x, y, u, v, w) and ) and w 2 + g(x, y) 2 − w + g(x, y) are defined as in (2.11).The superscript λ is used to emphasize the fact that this number is a parameter and not a variable for equation (2.10).One can easily check that this system made of n + 2m + p + q + p real-valued equations and n + m + p + q + p variables.Clearly, this means that (2.10) is an over-determined system and the Jacobian of Υ λ (z), when it exists, is a non-square matrix.
In Algorithm 3.2, denotes the tolerance and K is the maximum number of iterations.It is clear from (3.1) that for the algorithm to be well-defined, the matrix ∇Υ λ (z) T ∇Υ λ (z) needs to be non-singular.In the next subsection, we provide tractable conditions ensuring that this is possible.
. Nonsingularity of ∇Υ λ (z) T ∇Υ λ (z) and Convergence To proceed, first start by noting the following, for any matrix A with more rows than columns, the matrix A T A has full rank if and only if the columns of A are linearly independent.This result is important as full rank of A T A is equivalent to invertibility of A T A. As ∇Υ λ (z) is a (n + 2m + 2p + q) × (n + m + 2p + q) matrix with m more rows than columns, the linear independence of its columns ensures that ∇Υ λ (z) T ∇Υ λ (z) is non-singular.It therefore suffices for us to provide conditions guarantying the linear independence of the columns of ∇Υ λ (z).
Based on the result above, we can now state the convergence theorem for our Gauss-Newton Algorithm 3.2.To proceed, first note that by implementing the Gauss-Newton method to solve (2.10), leads to a solution to the least-square problem where we define N := n + m + 2p + q.The direction of the Newton method for problem (3.19) can be written as where is the term that is omitted in the Gauss-Newton direction (3.1).It is well known that the Gauss-Newton method converges with the same rate as Newton method if the term T (z) is small enough in comparison with the term ∇Υ λ (z) T ∇Υ λ (z); see, e.g., [7,21,24].This is the basis of the following convergence result of Algorithm 3.2.Theorem 3.5.Let the assumptions in Theorem 3.3 hold and suppose that z is a local optimal solution of problem (3.19) and {z k } be a sequence of points generated by Algorithm 3.2 converging to z.Furthermore, assuming that ∇ 2 L λ and ∇(∇ y )L are well-defined and Lipschitz continuous in the neighbourhood of z, Proof.Start by recalling that under the assumptions of Theorem 3.3, the matrix ∇Υ λ (z) is of full column rank.Hence, it is positive definite.Furthermore, under the strict complementarity condition (made in Theorem 3.3) and well-definiteness of derivatives of ∇ 2 L λ and ∇(∇ y L(z)) in the neighbourhood of z, all components of z → Υ λ (z) and z → ∇Υ λ (z) are differentiable near z.
Hence the term T (z) is well-defined near z.Further on, Lipschitz continuity of derivatives of ∇ 2 L λ (z) and ∇(∇ y L(z)) implies Lipschitz continuity of ∇Υ λ (z) T ∇Υ λ (z) and of T (z).Hence the term ∇Υ λ (z) T ∇Υ λ (z) + T (z) is Lipschitz continuous in the neighbourhood of z.Finally, since ∇Υ λ (z) is differentiable near z, the product ∇Υ λ (z) T ∇Υ λ (z) is also differentiable near z.We also know that ∇Υ λ (z) T ∇Υ λ (z) is non-singular in the neighbourhood of z under assumptions of Theorem 3.3.By the Inverse Function Theorem, differentiability and non-singularity of ∇Υ λ (z) T ∇Υ λ (z) is sufficient to state that ∇Υ λ (z) T ∇Υ λ (z) is a diffeomorphism, and hence has a smooth and differentiable inverse (∇Υ λ (z) T ∇Υ λ (z)) −1 in the neighbourhood of z.Smoothness and differentiability of With Theorem 3.5 it is easy to see that Gauss-Newton method converges quadratically if T (z) = 0 and Q-Linearly if T (z) is small relative to ∇Υ λ (z) T ∇Υ λ (z).Such properties could be satisfied for small residuals problems and for the problems that are not too nonlinear.For the small residuals problems we have that the components Υ λ i (z) are small for all i, which makes the term T (z) small.For the problems with not too much nonlinearity the components ∇ 2 Υ λ i (z) are small for all i, which also results in small T (z).If it turns out that we can obtain an exact solution Υ λ (z) = 0, then T (z) = 0, and we have quadratic convergence.It is worth noting that in general we cannot always have Υ λ i (z) = 0 for all i as the system is overdetermined, but minimizing the sum of the squares of Υ λ i (z) we obtain a solution point z, at which N+m i=1 (Υ λ i (z)) 2 is as small as possible.If the problem has small residuals, then small Υ λ i (z) are naturally obtained by implementing Algorithm 3.2 as this is designed to minimize N+m i=1 Υ λ i (z) 2 .In terms of having small components ∇ 2 Υ λ i (z) we observe that ∇ 2 Υ λ (z) will involve many third derivatives of F(x, y), G(x, y), f(x, y) and g(x, y).Hence, if the original problem is not too nonlinear, the Hessian of the system (2.10) should be small.As a result if there exists reasonable solution with Υ λ i (z) ≈ 0 for all i or if the original problem (1.1) is not too nonlinear, then the Gauss-Newton for Bilevel Programming converges Q-linearly.
The first drawback of Algorithm 3.2 is the requirement of strict complementarity in Assumption 3.1, to help ensure the differentiability of the function Υ λ .Assumption 3.1 is very strong and in the context of the pool of test problems used for our numerical experiments in Section 5, for example, it did not hold at the last iteration for at least one value of λ for the total of 54 out of the 124 problems.If one wants to avoid the strict complementarity assumption, one option is to use smoothing technique for Fischer-Burmeister function; this will be discussed in Section 4. Before we move to this, it is worth mentioning that a second issue faced by our Algorithm 3.2 is the requirement that the matrix ∇Υ λ (z) T ∇Υ λ (z) be nonsingular at each iteration.To deal with this, one option is a Newton step, where the generalized inverse of ∇Υ λ (z) T ∇Υ λ (z), which always exists, is calculated.This is briefly discussed in the next subsection.
. Newton method with Moore-Penrose pseudo inverse Indeed, one of the most challenging aspects of the Gauss-Newton step in Algorithm 3.2 is the computation of the inverse of the matrix ∇Υ λ (z k ) T ∇Υ λ (z k ), as this quantity might not exist at some iterations.To deal with situations where the inverse of the matrix does not exist, various concepts of generalized inverse have been used in the context of the Newton method; see, e.g., [22] for related details.Although we do not directly compute ∇Υ λ (z k ) T ∇Υ λ (z k ) −1 in our implementation of Algorithm 3.2 in Section 5, we would like to compare the pure Gauss-Newton-type method presented in the previous subsection and the Newton method with Moore-Penrose pseudo inverse.Hence, we present the later approach here and its relationship to Algorithm 3.2.
For an arbitrary matrix A ∈ R m×n , its Moore-Penrose pseudo inverse (see, e.g., [13]) is defined by where VΣ + U represents a singular value decomposition of A, where Σ + corresponds to the pseudo-inverse of Σ that can be given by if A has full column rank, we have an additional property that Based on this definition, an iteration of the Newton method with pseudo inverse for system (2.10) can be then stated as We are now going to refer to (3.20) as iteration of the Pseudo-Newton method.The Pseudo-Newton method for bilevel programming can be defined in the same fashion as Algorithm 3.2 with the difference that direction would be given by Clearly, the Pseudo-Newton method is always well-defined, unlike the Gauss-Newton method, and hence, it will produce some result in the case when the Gauss-Newton method diverges [11].Based on this general behaviour and interplay between the two approaches, we will be comparing them in the numerical section.
For details on the convergence of Newton-type methods with pseudo-inverse, the interested reader in referred to [12]. - In this section, we relax the strict complementarity assumption, considering the fact that it often fails for many problems as illustrated in the previous section.However, to ensure the smoothness of the function Υ λ (2.10), the Fischer-Burmeister function is replaced with the smoothing Fischer-Burmeister function (see [15]) defined by where the perturbation parameter µ > 0 helps to guaranty its differentiability at points (x, y, u) satisfying u j = g j (x, y) = 0.It is well-known (see latter reference) that 2) The smoothing system of optimality conditions becomes following the convention in (2.11), where µ is a vector of appropriate dimensions with sufficiently small positive elements.Under the assumption that all the functions involved in problem (1.1) are continuously differentiable, Υ λ µ is also a continuously differentiable function for any λ > 0 and µ > 0. Additionally, we can easily check that Following the smoothing scheme discussed, for example, in [25], our aim is to consider a sequence {µ k } decreasing to 0 such that equation (2.10) is approximately solved: for a fixed value of λ > 0. Hence, we consider the following algorithm for system (4.3): Algorithm 4.1.Smoothing Gauss-Newton Method for Bilevel Optimization Step 0: Choose λ > 0, µ 0 ∈ (0, 1), z 0 := (x 0 , y 0 , u 0 , v 0 , w 0 ), > 0, K > 0, set k := 0.
To implement Algorithm 4.1 numerically, we compute the direction by solving The Pseudo-Newton algorithm for the smoothed optimality conditions (4.3) will be the same as Algorithm 4.1 apart from Step 2, where the corresponding direction is given by Anther way to deal with the non-differentiability of the Fischer-Burmeister NCP-function is to introduce a generalized generalized Jacobian concept for the system (2.10).A semismooth Newtontype method for bilevel optimization following this type of approach is developed in [9].But we are not taking this approach here.
Next, we use this lemma to provide a condition ensuring that the matrix ∇Υ λ µ (z) T ∇Υ λ µ (z) is nonsingular.This will allow the smoothed Gauss-Newton step (4.4) to be well-defined.As in the previous section, it suffices to show that the columns of ∇Υ λ µ (z) are linearly independent.
There is at least one other way to show that ∇Υ λ µ (z) T ∇Υ λ µ (z) is nonsingular.The approach is based on the structure of the matrix, as it will be clear in the next result.To proceed, we need the following two assumptions.Assumption 4.5.Each row of the following matrix is a nonzero vector: Assumption 4.6.For λ > 0 and µ > 0, the diagonal elements of the matrix ∇Υ λ µ (z) T ∇Υ λ µ (z) dominate the other terms row-wise; i.e., a ii > N j=1, j =i |a ij | for i = 1, . . ., N, where a ij denotes the element in the cell (i, j) of ∇Υ λ µ (z) T ∇Υ λ µ (z) for i = 1, . . ., N and j = 1, . . ., N. Lemma 4.7.Let Assumption 4.5 hold at the point z := (x, y, u, v, w).Then for any λ > 0 and µ > 0, the diagonal elements of the matrix ∇Υ λ µ (z) T ∇Υ λ µ (z) are strictly positive.Proof.Considering the Jacobian matrix in (4.7), its transpose can be written as Denote by r i , i = 1, . . ., 4, respectively, the first, second, third, and fourth row-block of this matrix.
Thanks to Lemma 4.2, it is also clear that these items are all strictly positive.
Proof.It is known that the matrix is positive definite if it is symmetric, its diagonal elements are strictly positive, and diagonal elements dominate elements of the matrix in the corresponding row.This property is the consequence of the Gershgorin circle theorem, which can be found in [13, page 320].As ∇Υ λ µ (z) T ∇Υ λ µ (z) is symmetric, then combining Assumptions 4.5 and 4.6 to Lemma 4.7, we have the result.
Next, we provide an example where the assumptions required for Theorem 4.8 are satisfied.
Example 4.9.We consider an instance of problem (1.1) taken from the BOLIB Library [33] with and no upper-level constraint.For this problem the function Υ λ (2.10) can be written as The first item of note about this example is that the optimal solution (x, ȳ) = (1, 0) does not satisfy that the optimality conditions (2.4)-(2.9),given that Υ λ (x, ȳ, ū, w) = 0 for any values of ū and w.However, Algorithm 4.1 identifies the solution for λ taking the values 0.6, 0.7 or 0.8 with the smoothing parameter set to µ = 10 −11 .Indeed, the convergence of the method seems to be justified as for this problem, we can easily check that for x = 1 and ȳ = 0, and subsequently, we have the product Hence, Assumption 4.5 is clearly satisfied and for Assumption 4.6 to hold, we need This holds for any µ > 0, λ > 0, ū > 0, and 1 < w < 2.
We further note that the assumptions made in Theorem 4.3 hold for the problem in this example.Firstly, we observe that ∇ 2 L λ (z) is positive definite if λ w < ū + 1. Subsequently, we can check that both assumptions of Theorem 4.3 are satisfied if For instance, choosing ū = √ 8 × 10 −6 , w = 10 −6 , µ = 4 × 10 −12 , and λ < 2.25 gives the result.To conclude this section, we would like to analyse the Jacobian consistency of Υ λ .Recall that according to [2], the Jacobian consistency property will hold for Υ λ if this mapping is Lipschitz continuous and there exists a constant > 0 such that for any z ∈ R N and µ ∈ R + , we have Note that in (4.20), dist represents the standard distance function while commonly used in this context; see, e.g., [16].In (4.21), N := n + m + 2p + q and ∂Υ λ i , i = 1, . . ., N + m represents the subdifferential in the sense of Clarke.Note that ∂ C Υ λ (z) T contains the generalized Jacobian in the sense of Clarke of the function Υ λ .Roughly speaking, the Jacobian consistency property (4.20) translates to a framework ensuring that when the smoothing parameter µ is getting close to zero, the Jacobian ∇Υ λ µ (z) converges to an element in the C-subdifferential ∂ C Υ λ (z).This property is important in determining the accuracy of the smoothing method in Algorithm 4.1.
Based on (4.21), at all points z := (x, y, u, v, w) satisfying Assumption 3.1, where ∇Υ λ (z) is defined by (3.3).For the case when strict complementarity does not hold, elements of ∂ C Υ λ (z) have the same structure as in (3.3) with the only differences being in the terms τ j , γ j , α j , β j , θ j , and κ j for indices j where strict complementarity does not hold.We still have ∂Υ λ i (z) is the same as ∇Υ λ i (z) for rows i = 1, ..., n + 2m.To determine the remaining rows, consider Obviously τ j and γ j introduced in (3.4) are not well-defined for j ∈ Ω 1 .Similarly, the same holds for α j and β j for j ∈ Ω 2 and θ j and κ j for j ∈ Ω 3 .Following the same procedure as in [16, Proposition 2.1], we define Since we do not assume strict complementarity here, then in contrast to Subsection 3.1, we have Theorem 4.10.For λ > 0, the Jacobian consistency property holds for the approximation Υ λ µ of Υ λ .
The focus of our experiments in this section will be on the smoothing system (4.3),where we set µ := 10 −11 constant throughout all iterations.Based on this system, we test and compare the Gauss-Newton method, Pseudo-Newton method, and the Matlab built-in method called fsolve (with Levenberg-Marquardt chosen as option).The examples used for the experiments are from the Bilevel Optimization LIBrary of Test Problems (BOLIB) [33], which contains 124 nonlinear examples.The experiments are run in MATLAB, version R2016b, on MACI64.Here, we present a summary of the results obtained; more details for each example are reported in [27].
For Step 0 of Algorithm 4.1 and the corresponding smoothed Pseudo-Newton algorithm, we set the tolerance to := 10 −5 (see Subsection 5.4 for a justification) and the maximum number of iterations to be K := 1000.As for stopping criterion of fsolve, the tolerance is set to 10 −5 as well.For the numerical implementation we calculate the direction d k by solving (4.5) with Gaussian elimination.Five different values of the penalty parameter are used for all the experiments; i.e., λ ∈ {100, 10, 1, 0.1, 0.01}, see [27] for details of the values of each solution for a selection of λ.The motivation of using different values of λ comes from the idea of not over-penalizing and not underpenalizing deviation of lower-level objective values from the minimum, as bigger (resp.smaller) values of λ seem to perform better for small (resp.big) values of lower-level objective.After running the experiments for all values of λ ∈ {100, 10, 1, 0.1, 0.01}, the best one is chosen (see Table 1 in [27]), i.e. for which the best feasible solution is produced for the particular problem by the tested algorithms.Later in this section we present the comparison of the performance of the algorithms for the best value of λ.The experiments have shown that the algorithms perform much better if the starting point (x 0 , y 0 ) is feasible.As a default setup, we start with x 0 = 1 n and y 0 = 1 m .If the default starting point does not satisfy at least one constraint, we choose a feasible starting point; see [27].Subsequently, the Lagrange multipliers are initialised at u 0 = max 0.01, −g(x 0 , y 0 ) , v 0 = max {0.01, −G(x, y)}, and u 0 = w 0 .

. Performance profiles
Performance profiles are widely used to compare characteristics of different methods.In this section we consider performance profiles, where t p,s denotes the CPU time to solve problem p by algorithm s.If the optimal solution of problem p is known but it cannot be solved by algorithm s (i.e., upperlevel objective function error > 60%), we set t p,s := ∞.We then define the performance ratio by r p,s := t p,s min{t p,s : s ∈ S} , where S is the set of solvers.Performance ratio is the ratio of how algorithm s performed to solve problem p compared to the performance of the best performed algorithm from the set S. The performance profile can be defined as the cumulative distribution function of the performance ratio:  Comparing line-graphs of the performance profiles, the higher position of the graph means the better performance of the algorithms.The value on the y-axis shows the fraction of examples for which performance ratio is better than T (presented on the x-axis).As we set t p,s := ∞ for the cases when we do not solve the problem, the algorithms would not go much over 90% mark as for the rest of the problem ρ s (T ) = ∞.Figure 1 clearly shows that Gauss-Newton and Pseudo-Newton showed better performance than fsolve.Since the variable for the comparison was CPU time, based on the values of ρ s (1), we can claim that Gauss-Newton was the fastest algorithm for about 70% of the problems, Pseudo-Newton for about 60% of the problems and fsolve was the quickest for about 20% of the problems.From the graph, one can also see that Gauss-Newton and Pseudo-Newton methods have ρ s (2) = 80%, while fsolve only has the value ρ s (2) = 30%, meaning that fsolve was more than twice worse than the best algorithm for 70% of the problems.Approaching T = 6, Gauss-Newton and Pseudo-Newton getting performance ratio close to ρ s (T ) = 90%, where fsolve has the value ρ s (6) ≈ 65%.This clearly shows that Gauss-Newton and Pseudo-Newton methods show quite similar performance in terms of CPU time.Both these algorithms clearly outperform fsolve for solving the set of test problems in terms of the measure of performance profiles.

. Feasibility check
Considering the structure of the feasible set of problem (1.2), it is critical to check whether the points computed by our algorithms satisfy the value function constraint f(x, y) ϕ(x).If the lower-level problem is convex in y and a constraint qualification (e.g., the MFCQ) holds at a solution point generated by our algorithms, then this point will automatically satisfy the value function constraint, provided (2.7) and (2.9) hold.Note that the latter conditions are incorporated in the stopping criterion of Algorithm There are 14 problems, where convex f(•, y) and g i (•, y) with i = 1, . . ., p, are convex, but the constraints are not all linear w.r.t.y.For these examples the MFCQ has been shown to hold at the point computed by our algorithms.The rest of the problems have non-convex lower-level objective or some of the lower-level constraints being nonconvex.For these examples, we compare the obtained solutions with the known ones from the literature.Let f A stand for f(x, ȳ) obtained by one of the tested algorithms and f K to be the known optimal value of lower-level objective function.In the graph below we have the lower-level relative error, (f A − f K )/(1 + |f K |), on the y-axis.The x-axis starts from 25th example and the error is plotted in increasing order.From the figure above we can see that for 30 problems the relative error of lower-level objective is negligible (< 5%) for all three methods.Almost for all of the remaining 19 examples Gauss-Newton and Pseudo-Newton have smaller errors than fsolve and Pseudo-Newton seems to have slightly smaller errors than Gauss-Newton method for some of the examples.We have seen that convexity and a CQ hold for the lower-level hold for 74 test examples.We consider solutions for these problems to be feasible for the lower-level problem.Taking satisfying feasibility error to be < 20%, we claim that feasibility is satisfied for 113 (91.13%) problems for Gauss-Newton and Pseudo-Newton methods, and for 110 (88.71%) problems for fsolve. .

Accuracy of the upper-level objective function
Here, we compare the values of the upper-level objective functions at points computed by the algorithms; i.e., Gauss-Newton and Pseudo-Newton algorithms, and fsolve.For this comparison purpose, we focus our attention only on 116 BOLIB examples [33], as solutions are not known for six of them and the Gauss-Newton algorithm diverges for NieEtal2017e, possibly due the singularity of the matrix ∇Υ λ µ (z) T ∇Υ λ µ (z); see [27] for more details.To proceed, let FA be the value of upper-level objective function at the point (x, ȳ) obtained by one of the algorithms (Gauss-Newton, Pseudo-Newton or fsolve) and FK the value of this function at the known best solution point reported in the literature (see corresponding references in [33]).The comparison is shown in Figure 3, where we have the relative error ( FA − FK )/(1 + | FK |) on the y − axis and number of examples on the x-axis, starting from 85th example.The graph is plotted in the order of increasing error.From Figure 3, we can see that most of the known best values of upper-level objective functions were recovered by all the methods, as the relative error is close to zero.Precisely, for 93 of the tested problems, the upper-level objective function error is negligible (i.e., less than 5%) for the solutions obtained by all the three methods.For the remaining 23 examples, it is clear that the errors resulting from fsolve are much higher than the ones from the Gauss-Newton and Pseudo-Newton methods.It is worth noting that algorithms perform fairly well for most of the problems.With the accuracy error of 20% our algorithms recovered solutions for 92.31% of the problems, while fsolve recovered only 88.03% of the solutions. .

Variation of the tolerance in the stopping criterion
We are now going to evaluate the performance of Algorithm 4.1 as we relax the tolerance in the stopping criterion.Precisely, we set := 10 −8 (as opposed to := 10 −5 used so far) and it turns out that for most of the examples, this is not achievable.Hence, the algorithms then stop after the maximum number of iterations or if the gap between improvement from step to step is too small.For those examples, fsolve recovered solution with better tolerance than Gauss-Newton and Pseudo-Newton algorithms.This shows that whenever we are able to solve a problem almost exactly, fsolve's stopping criteria is more strict and obtains slightly better values of the system.This can be explained by an additional stopping criteria that we use for Gauss-Newton and Pseudo-Newton if the improvement between the steps of the algorithms gets too small.This shows that whenever we are able to solve the system Υ λ µ (z) = 0 almost exactly, fsolve's stopping criteria is a bit less strict and iterations keep going to produce values of ||Υ λ µ (z)|| that are closer to 0 than for the other two methods.The explanation could be that due to an additional stopping criteria of Gauss-Newton and Pseudo-Newton the algorithms stop earlier once significant improvement of the solution is not observed from step to step.If we now look at the performance of algorithms between tolerances of 10 −8 and 10 −6 , the Pseudo-Newton method shows better performance than the other algorithms and fsolve is the weakest of the three algorithms.This means that if we want to solve a problem with the tolerance of 10 −6 or better, the Pseudo-Newton algorithm is more likely to recover solutions than the other two methods.The other important observation from Figure 4 is that choosing := 10 −5 is the most sensible tolerance as better tolerance only allow about 50% of the examples to be solved (i.e., just over 60 examples as we can see from the graph). .

Checking assumption on λ
Considering the importance of the requirement that λ < It suffices to check that λ < c µ (λ).We set λ − c µ (λ) := 100 if the difference is undefined, while discarding problems where g(x, y) is not present.Note that the assumption on λ is not necessary in Theorem 4.3 for problems with no lower-level constraint.Let us also introduce the notions of the best λ and optimal λ, where by the best λ we mean the values of λ for which λ − c µ (λ) is the smallest and by the optimal we mean the values of λ that were the best to obtain the solution according to [27].In the figure below we present the difference λ − c µ (λ) on the y-axis and number of the example on the x-axis, following ascending order w.r.t. the values on the y-axis.Clearly, condition λ < c µ (λ) holds for the values of λ − c µ (λ) lying below the x-axis.From Figure 5 we can see that the assumption holds for 42 (out of 116) problems for the optimal λ.This means that the condition can hold for many examples.But, obviously, as it is not a necessary condition, our Algorithm 4.1 still converges for many other problems, where the condition is not necessarily satisfied.For the best values of λ, condition λ < c µ (λ) holds for 101 problems.Hence, showing that for most of the examples, there is at least one value of λ ∈ {100, 10, 1, 0.1, 0.01} for which the condition is satisfied. .

Final comments
In this paper, a class of the LLVF-based optimality conditions for bilevel optimization problems has been reformulated as a system of equations using the Fischer-Burmeister function.It was shown that the Gauss-Newton method can be well-defined and the framework for convergence is provided.
We have tested the method and its smoothed version numerically, alongside with Newton method with Moore-Penrose pseudo inverse.The comparison of the obtained solutions with known best ones showed that the methods are appropriate to be used for bilevel programs, recovering optimal solutions (when known) for most of the tested problems.It is worth mentioning that whenever ∇Υ λ µ (z) T ∇Υ λ µ (z) is not ill-conditioned throughout all iterations, Gauss-Newton and Pseudo-Newton methods produced the same results as expected.More interestingly, for the 38 problems for which Gauss-Newton could not be implemented due to singularity of the direction matrix for one or more values of λ (see [27]), our conjecture that Pseudo-Newton would produce reasonable solutions for these cases worked well for 14 problems (e.g.'CalamaiVicente1994c', 'DempeDutta2012b', 'Dempe-Franke2011a' in [27]) and failed for the remaining 24 examples (e.g.'Bard1988c', 'Colson2002BIPA3', 'DempeDutta2012a' in [27]).Nevertheless, we can say that the Pseudo-Newton method is indeed slightly more robust compared to the Gauss-Newton method.This, together with slightly better feasibility (seen in Figure 2) and very similar computation time, allows us to say that Pseudo-Newton method is indeed slightly more robust than Gauss-Newton method.
All the three methods (Gauss-Newton, Pseudo-Newton, and fsolve) show very fast performance with average CPU time of less than 0.5 seconds for any of the five values of λ used.However, the Gauss-Newton and Pseudo-Newton methods are more efficient at recovering solutions for the 124 BOLIB test problems from [33].Apart from recovering more solutions they also showed better CPU time than fsolve as seen from performance profiles.Our methods also showed to produce more feasible solutions than fsolve in the sense of the feasibility for the lower-level problem.

> 0 .
conflict with the requirement that λ be strictly positive, as due to Lemma 4.2, we have In Subsection 5.5, a numerical analysis of this condition is conducted.Next, provide an example of bilevel program, where the assumptions made in the Theorem 4.3 can be satisfied.Example 4.4.
ρ s (τ) := p ∈ P : r p,s τ n p , where P is the set of problems.Performance profile, ρ s (τ), is counting the number of examples for which the performance ratio of the algorithm s is better (smaller) than τ.The perfomance profile ρ s : → [0, 1] is non-decreasing function, where the value of ρ s (1) shows the fraction of the problems for which solver s was the best.

Figure 1 :
Figure 1: Performance profiles of the methods for 124 problems

Figure 2 :
Figure 2: Lower-level optimality check for examples with a nonconvex lower-level problem

Figure 3 :
Figure 3: Comparison of upper-level objective values for examples with known solutions

Figure 5 :
Figure 5: Checking assumption λ < c µ (λ) for best and optimal values of λ 19))Then with the result in Lemma 4.2, under the assumptions that ∇ 2 L λ (z) is positive definite and = 1, ..., p, equation (4.19)is the sum of non-negative terms, which can only be a sum of zeros if all components of d 1 , d 2 and d 3 are zeros.Since all components of d 2 are zeros, we can look back to (4.17) or (4.12) to deduce that d 4 j = 0 for j = 1, . . .p, completing the proof.

Table 1 .
4.1.To check whether the points obtained are feasible, we first identify the BOLIB examples, where the lower-level problem is convex w.r.t.y; see the summary of these checks in It turns out that significant part of test examples have linear lower-level constraints.For these examples, a constraint qualification (CQ) is automatically satisfied.

Table 1 :
Convexity of the lower-level functions