Finding global solutions of some inverse optimal control problems using penalization and semismooth Newton methods

We present a method to solve a special class of parameter identification problems for an elliptic optimal control problem to global optimality. The bilevel problem is reformulated via the optimal-value function of the lower-level problem. The reformulated problem is nonconvex and standard regularity conditions like Robinson’s CQ are violated. Via a relaxation of the constraints, the problem can be decomposed into a family of convex problems and this is the basis for a solution algorithm. The convergence properties are analyzed. It is shown that a penalty method can be employed to solve this family of problems while maintaining convergence speed. For an example problem, the use of the identity as penalty function allows for the solution by a semismooth Newton method. Numerical results are presented. Difficulties and limitations of our approach to solve a nonconvex problem to global optimality are discussed.


Introduction
In this paper we study an inverse problem in which we aim to identify finitely many parameters of an optimal control problem with a linear partial differential equation.This results in an infinite-dimensional bilevel optimal control problem.The concept of bilevel optimization is discussed in [Shimizu, Ishizuka, Bard, 1997;Bard, 1998;Dempe, 2002;Dempe, Kalashnikov, et al., 2015], while [Troutman, 1996;Hinze et al., 2009;Tröltzsch, 2009;Lewis, Vrabie, Syrmos, 2012] present a comprehensive introduction to optimal control.Bilevel optimal control problems are also studied in [Knauer, Büskens, 2010;Fisch et al., 2012;Hatz, 2014;Kalashnikov, Benita, Mehlitz, 2015], for example.To be more precise, we consider the parametric optimization problem where β ∈ Q ⊂ R n is a parameter, and the sets Q, U ad , the linear operators A, B, the spaces U , Y , and the function f are such that Assumption 2.1 is satisfied.Here u ∈ U ad is the control, y ∈ Y is the state, and Ay = Bu describes an elliptic PDE.Assumption 2.1 guarantees that the solution of (LL(β)) is unique for each β ∈ Q, see Lemma 2.2.
The problem (LL(β)) is also called the lower-level problem.The upper-level problem under investigation is min where Ψ(β) describes the unique solution of (LL(β)).Our main motivation for studying (UL) is the purpose of identifying an unknown parameter β from some (possibly perturbed) measurements of Ψ(β), see also Section 5.
Together, the problems (LL(β)) and (UL) constitute the bilevel optimization problem.Necessary optimality conditions of bilevel optimal control problems, i.e. hierarchical optimization problems with two decision layers, where at least one decision maker has to solve an optimal control problem, are derived in [Ye, 1995;1997;Benita, Mehlitz, 2016;Mehlitz, Wachsmuth, 2016;Mehlitz, 2017;Harder, 2021].Recently, solution theory for inverse optimal control problems of partial differential equation was developed in [Harder, Wachsmuth, 2018b;Holler, Kunisch, Barnard, 2018].We also note that optimal control problems with variational inequality constraints such as optimal control of the obstacle problem (see [Harder, Wachsmuth, 2018a]) can be viewed as a bilevel optimal control problem.Regarding the numerical solution of the presented problem type, there mainly exist (to the best of our knowledge) methods for inverse optimal control problems with ordinary differential equations, see [Albrecht, Leibold, Ulbrich, 2012;Hatz, Schlöder, Bock, 2012;Hatz, 2014;Albrecht, Ulbrich, 2017].The corresponding algorithms tend to replace the lower-level problem with their optimality conditions.A different approach was introduced in [Dempe, Harder, et al., 2019], where the authors solved a special class of inverse problems of partial differential equations by exploiting the optimal-value function of the parametric optimal control problem.The optimal-value function ϕ : Q → R of (LL(β)) is defined by ϕ(β) := inf f (β, y, u) (y, u) ∈ Y × U ad , Ay = Bu = f (β, Ψ(β)). (1.1) The idea of using the optimal-value function in bilevel optimization problems can be traced back to [Outrata, 1990].With the help of the optimal-value function, the hierarchical problem (UL) can be transformed into the single-level problem min β,y,u F (β, y, u) Ay − Bu = 0, u ∈ U ad . (OVR) We call this optimization problem the optimal-value reformulation of (UL).This resulting nonconvex surrogate problem does not satisfy standard constraint qualifications such as Robinson's CQ.However, in [Dempe, Harder, et al., 2019, Theorem 5.12] the authors were able to prove prove necessary optimality conditions of Clarke-stationary type via a relaxation approach.Furthermore, [Dempe, Harder, et al., 2019, Algorithm 1] introduces a solution algorithm using a piecewise affine approximation ξ of the optimal-value function ϕ with ξ ≥ ϕ, which leads to the relaxed optimization problem (OVR(ξ)) If f and F are convex, this problem can be split into finitely many convex subproblems for which a global solution can be obtained.The original problem can then be solved by iteratively improving the approximation ξ of the optimal-value function, see [Dempe, Harder, et al., 2019, Theorem 6.5].In this paper we start with the same approach to derive a global solution scheme.We slightly deviate in the construction of the piecewise affine approximation by starting with a triangulation of the admissible set for the upper-level control variable and subsequently enforce some regularity on further divisions.In addition to proving convergence of the global solution scheme in Theorem 3.1, this will allow us to link convergence speed to the size of the elements of the partition (see Theorem 3.5).
In order to solve (OVR(ξ)), we also consider the penalty problem min β,y,u F (β, y, u) + γP (f (β, y, u) − ξ(β)) (OVRP(ξ)) Here, P : R → R is a penalty function and γ > 0. Interestingly, we will see that it possible to choose the identity P (x) = x as a penalty function.This has several benefits.On the one hand, we show in Lemma 4.7 that a finite penalty parameter can be chosen such that one obtains the solution of (OVR(ξ)).On the other hand, the choice of the identity results in much simpler derivatives of the objective of (OVRP(ξ)) and this enables us to use a semismooth Newton method to solve the subproblems efficiently, see Section 5.3.
Solving nonconvex problems to global optimality is an intricate issue, and, hence, we expect difficulties.Indeed, our approach has some limitations concerning the obtained convergence speed, see Remark 3.6.Especially in a practical setting convergence speed deteriorates with an increasing dimension of the upper-level variable (curse of dimensionality).
Let us describe the structure of this paper.In Section 2 we present the used notation as well as the main governing assumption in addition to some preliminary theory related to optimal control problems.We proceed by introducing a global solution algorithm (Algorithm 1) in Section 3 and prove its convergence in Theorem 3.1.Further we present some convergence speed estimates in Theorem 3.5 related to the size and regularity of the elements in the partition.To ensure this property, we derive a simple method for refining the partition in arbitrary finite dimensions while keeping some regularity properties of the elements, see Lemma 3.2.On top of this foundation we introduce our penalty approach (Algorithm 2) in Section 4. We show that there exists a choice of the penalty parameter (see Lemma 4.7), for which one can expect to find the solution to the subproblems from Algorithm 1.A method for solving the penalty subproblems by means of a semismooth Newton method is presented in Section 5. We show its superlinear convergence in Theorem 5.8.The corresponding implementation of our algorithm for solving the inverse optimal control problem and a numerical example is covered in Section 6.

Notation
The norm in a (real) Banach space X is denoted by • X .Let B ε X (x) denote the closed ε-ball centered at x ∈ X with respect to • X .Furthermore, X is the topological dual of X and •, • X : X × X → R denotes the corresponding dual pairing.For a set A ⊂ X we denote by conv A, cone A, cl A, int A and ∂A the convex hull, the conical hull, the closure, interior and the boundary of A, respectively.For a Banach space Y , the space of all bounded linear operators from X to Y is denoted by L[X, Y ] and for some operator the radial cone and the normal cone to the set C at the point x ∈ C, respectively.For x ∈ C, we set N C (x) := ∅.
The set R n denotes the usual n-dimensional real vector space, equipped with the Euclidean norm • R n .The sets R + , R − represent the nonnegative and nonpositive numbers respectively.For an arbitrary bounded and open set Ω ⊂ R d , the space of equivalence classes of measurable, q-integrable functions is given by L p (Ω), q ∈ [1, ∞).Similarly, L ∞ (Ω) denote the space of essentially bounded (equivalence classes of) measurable functions.The space of functions on the set Ω for which the m-th derivatives exist in the Sobolev sense in L p (Ω) is denoted by W m,p (Ω).Furthermore, we use the notations for the Sobolev space with first order derivatives and homogeneous boundary conditions and its dual space.

A mapping
In this case, J (x) is called the Fréchet derivative of J at x.
is well defined and continuous in a neighborhood of x then J is said to be continuously Fréchet differentiable at x.

Assumptions
Throughout this work we utilize the following standing assumption.(e) The functionals are assumed to be bounded from below, convex and continuously Fréchet differentiable.
(f) The upper-level objective functional F and the partial derivative f β are assumed to be Lipschitz continuous on bounded sets, whereas f u and f y are Lipschitz continuous w.r.t.β on bounded sets, i.e., for every M ≥ 0 there exists a constant L M ≥ 0 such that The reduced lower-level objective u → f (β, S(u), u) is assumed to be strongly convex with respect to the control with constant µ > 0 independent of β ∈ Q, i.e., holds for all β ∈ Q and u 1 , u 2 ∈ U ad .Here, f y (•) and f u (•) denote the partial derivatives of f w.r.t.y and u at the point (β, S(u 1 ), u 1 ).

Preliminary results
Let the optimization problem min be given, with continuously Fréchet differentiable mappings J : X → R, g : X → Y between Banach spaces X, Y and C ⊂ Y being nonempty, closed and convex.A feasible point x ∈ X of (OP) satisfies the Karush-Kuhn-Tucker (KKT) conditions if then the KKT conditions hold, see [Zowe, Kurcyusz, 1979] and [Bonnans, Shapiro, 2000, Theorem 3.9].Due to Assumption 2.1, the lower-level problem fits into the setting of (OP).The KKT system for the lower level for a parameter β in a solution (ỹ, ũ) then where p ∈ Y (we identify Y with Y ), ν ∈ U are multipliers.Note that Robinson's CQ is satisfied due to the surjectivity of A. Thus, for a minimizer of the lower-level problem there exist multipliers such that the KKT system (2.4) is satisfied.
We can now prove that the assumption of strong convexity for the lower level implies a quadratic growth condition in the solution.
Proof.Existence of a solution follows from the direct method of calculus of variations.
Note that the boundedness of the minimizing sequence follows from the strong convexity.
Let (y β , u β ) denote a solution of (LL(β)).Utilizing the strong convexity in the solution (β, y β , u β ) yields for all u ∈ U , where f u (•) and f y (•) denote the partial derivatives of f in (β, y β , u β ).By using the KKT conditions with multipliers p, ν we obtain The last inequality holds since ν ∈ N U ad (u β ) and u ∈ U ad .Hence, one gets the quadratic growth condition (2.5).This also yields uniqueness of the solution.
Next, we introduce the solution operator for (LL(β)).
Definition 2.3.We denote by Ψ : Q → Y × U the solution mapping of the lower-level problem which maps β ∈ Q to the corresponding unique solution (y β , u β ) given in Lemma 2.2.We further denote by ψ y (β) ∈ Y and ψ u (β) ∈ U the components of Ψ(β).
We will now prove that the function Ψ is globally Lipschitz continuous.Local Lipschitz continuity follows already by [Harder, 2021, Lemma 3.1.6].However, by Assumption 2.1(f) we have a stronger assumption on the derivative of f .Thus, we can adopt the arguments from [Harder, 2021, Lemma 3.1.6]to obtain global Lipschitz continuity.
Lemma 2.4.Let X, V be Banach spaces, and let C ⊂ X, Q ⊂ V be nonempty, closed and convex sets.Further, let J : X × V → R and µ > 0 be given such that for all p ∈ Q, the function J(•, p) is strongly convex with parameter µ on the feasible set C and Fréchet differentiable.Then, the solution operator ψ : Q → X for the parametrized optimization problem min x J(x, p) exists and we have the estimate Proof.The existence of ψ follows by standard arguments for convex optimization problems with strongly convex objectives.
We now consider fixed elements p 1 , p 2 ∈ Q and their corresponding unique minimizers The associated optimality conditions are If we now add these inequalities with the special choices x = x 3−i , we obtain the estimate In the last step, we have used the strong convexity of J(•, p 1 ).Dividing the last inequality by µ x 2 − x 1 X yields the claim.
Corollary 2.5.The function Ψ from Definition 2.3 is Lipschitz continuous on Q.Moreover, there exists a constant M Ψ ≥ 0 such that Proof.We start by proving the boundedness.From Lemma 2.2, we get Together with the assumption that f is bounded from below (see Assumption 2.1(e)) we get an upper bound for ψ u (β) U = u β U .This also allows us to bound ψ y (β) Y = S(ψ u (β)) Y ≤ S ψ u (β) U , since S is a linear bounded operator by assumption.Since Q is bounded, β ∈ Q is bounded as well.We choose M Ψ to be the largest of the previously discussed bounds for β R n , ψ y (β) Y and ψ u (β) U .
In order to prove the Lipschitzness of Ψ, we want to apply Lemma 2.4 to the state-reduced lower-level problem, i.e., with the setting Assumption 2.1 yields that the assumptions of Lemma 2.4 are satisfied.From the chain rule, we get Now, Lemma 2.4 yields By owing to Assumption 2.1(f) with M = M Ψ , this yields the desired Lipschitz continuity of ψ u .Consequently, the Lipschitz continuity of ψ y follows due to the continuity of S.
We can use this property to prove the existence of solutions for (OVR).
Proof.The lower-level problem admits to a unique solution.Therefore the solution operator Ψ of the lower-level optimization problem can be used to reduce (UL) to an optimization problem in R n : The existence of a solution follows from the celebrated Weierstraß theorem.
We finally mention that more general results on the existence of solutions for bilevel optimal control problems are given in [Mehlitz, Wachsmuth, 2020].In particular, our result is covered by the second part of [Mehlitz, Wachsmuth, 2020, Theorem 16.3.5].
In order to use interpolation error estimates, we prove regularity of the optimal-value function ϕ.
Corollary 2.7.The optimal-value function is Fréchet differentiable on the interior of Q and the derivative is Lipschitz continuous.In particular, we have ϕ ∈ W 2,∞ (Q).

Algorithm
In this section, we present an algorithm to solve (OVR) under the given Assumption 2.1.The algorithm is similar to [Dempe, Harder, et al., 2019, Algorithm 1], with the main difference being the choice of the function ξ which approximates the value function ϕ.In that reference, the functions ξ k were defined via where X k = {x 1 , . . ., x m } ⊂ R n is a finite set.The sets X k are assumed to be increasing w.r.t.k and in order to achieve a uniform Lipschitz bound of ξ k on Q, one has to require Q ⊂ int conv X 1 , see [Dempe, Harder, et al., 2019, Lemma 6.1, Example 6.1].The reason for this extra assumption is that it is not possible to a priori control the shape of the simplices on which ξ k is affine.
We use a different method to obtain a bounded aspect ratio of all the simplices.We choose a subdivision T k of Q (recall that Q is a bounded polyhedron) into simplices.On each simplex T ∈ T k , we define ξ T : T → R as the affine interpolant of ϕ in the vertices of T .The function ξ T k is obtained by combining ξ T for all T ∈ T k , see (3.1) below.The advantage of this approach is that the approximation quality of ξ k can be controlled by the quality of the subdivision, which is measured by the aspect ratio where B T is the largest ball contained in T ∈ T k , see [Brenner, Scott, 2008, Def. (4.2.16) and Eq. (4.4.16)].
) is a global solution of (OVR) (and, thus, of (UL)) and the algorithm terminates.Otherwise, we construct T k+1 from T k by a refinement of Tk such that vol(T ) ≤ q • vol( Tk ) and ρ(T ) ≥ ρ for all T ∈ T k+1 \ T k .Set k := k + 1 and go to (S2).
We mention that our approach does not require continuity of ξ T k .Therefore, we do not need any special assumptions on the subdivision, in particular, we allow for hanging nodes.In fact, it is enough to require Therefore, if we have two elements T, S ∈ T k with T ∩ S = ∅, the values of ξ T and ξ S may not agree on T ∩ S. For the definition of ξ T k : Q → R, we choose This definition of ξ T k ensures upper semicontinuity.
The main idea in Algorithm 1 is to solve (OVR(ξ)) with ξ = ξ T k and to successively refine a simplex on which a solution is found.In order for Algorithm 1 to be well-defined, we need to guarantee the existence of global minimizers of (OVR(ξ, T )).This can be shown by the direct method of calculus of variations.The boundedness of β follows from β ∈ T and the boundedness of (y, u) follows from f (β, y, u) ≤ ξ T (β), cf.Assumption 2.1(g).
Under very mild assumptions we can show the convergence towards global minimizers.
Theorem 3.1.Algorithm 1 either stops at a global solution of (OVR) or the computed sequence (β k , y k , u k ) contains a subsequence converging strongly in R n × Y × U to a global solution of (OVR).If (OVR) has a unique global solution ( β, ȳ, ū), then the entire sequence (β k , y k , u k ) converges strongly to ( β, ȳ, ū).
Proof.The value function ϕ is convex and therefore ξ T k (β) ≥ ϕ(β).Thus, the feasible set of (OVR(ξ T k )) contains the feasible set of (OVR).If the solution (β k , y k , u k ) of (OVR(ξ T k )) is feasible for (OVR), it is globally optimal for (OVR).Hence, the stopping criteria of the algorithm ensures that (β k , y k , u k ) is globally optimal for (OVR).It remains to discuss the case where Algorithm 1 does not terminate.We denote by ( β, ȳ, ū) a global solution of (OVR).Then by the same argument.The feasible set Q is compact by Assumption 2.1(b).This implies the existence of N ∈ R with ϕ(β) ≤ N for all β ∈ Q.Therefore, the estimate (where we used (2.5) in the last step) together with the boundedness of u β k shows the boundedness of u k in U .The boundedness of y k in Y follows from the properties of the linear operators A and B. Therefore the sequence (β k , y k , u k ) is bounded by a constant M ≥ 0 and contains a weakly convergent subsequence (without relabeling) In particular, one has strong the convergence In order to estimate the distance between ϕ and its interpolant ξ T k , we use the interpolation error estimate [Brenner, Scott, 2008, Theorem 4.4.20](the required condition [Brenner, Scott, 2008, (4.4.16)] is satisfied due to (S3) in Algorithm 1).We apply this result (for polynomial degree one with m = 2, s = 0, p = ∞) on each simplex T ∈ T k and obtain where C ρ > 0 is a constant that depends on the regularization parameter ρ.Corollary 2.7 provides the upper bound We want to apply (3.3) for Tk ∈ T k , where Tk is chosen as in the algorithm, and also intend to show diam( Tk ) → 0. We will use the relation between diameter and volume given by the aspect ratio of the simplices and argue by contradiction.We assume that v := lim sup k→∞ vol( Tk ) > 0.
Thus the set T0 := { Tk | k ∈ N, vol( Tk ) ≥ v} is infinite.Now there has to be at least one simplex T 0 ∈ T 1 that contains infinitely many simplices from T0 , i.e., the set T1 := {T ∈ T0 | T T 0 } is infinite.These simplices are refined at least once and thus we have vol(T ) ≤ q vol(T 0 ) for all T ∈ T1 .Again, one simplex in T1 has to contain infinitely many of the simplices from T1 and we can repeat the above argument.This leads to a contradiction as the volume of the simplices is bounded from above by q −l vol(T 0 ) and this contradicts the lower bound v > 0. Hence, we have shown vol( Tk ) → 0. Using the bound on the aspect ratio, this implies diam( Tk ) → 0. Indeed, Now we are in position to apply (3.3) on Tk .This yields (3.4) Note that we have used the sequential weak lower semicontinuity of f which follows from convexity and continuity in Assumption 2.1(e).Thus, (3.4) yields feasibility of ( β, ŷ, û) for (OVR).Similarly, F is sequentially weakly lower semicontinuous.Therefore, we can pass to the limit k → ∞ in (3.2) and obtain This shows that ( β, ŷ, û) is a global solution for (OVR).
Next, we prove the strong convergence of y k and u k .Strong convergence of the control u k can be obtained by exploiting the quadratic growth condition from Lemma 2.2: Note that y k = S(u k ) by feasibility of (β k , y k , u k ) for (OVR(ξ, Tk )).Thus, Lemma 2.2 and the Lipschitz continuity of The continuity of the solution operator S now implies strong convergence of the states.
If the solution to (OVR) is unique, the convergence of the entire sequence follows from a usual subsequence-subsequence argument.
An important ingredient of Algorithm 1 is the refinement of the simplices in (S3) such that the properties involving the constants q and ρ are obtained.In the two-dimensional case Q ⊂ R 2 this can be done by splitting the triangle Tk into 4 similar triangles by using the midpoints of the edges.However, already in three dimensions this is not straightforward since a general tetrahedron cannot be divided into similar tetrahedrons.
In particular, a regular tetrahedron cannot be split into smaller regular tetrahedra.One, however, can use hypercubes to construct a method of refinement that maintains a bounded aspect ratio.
Proof.Let S n denote the permutations of {1, 2, . . ., n}.We consider the hypercube [0, 1] n and a permutation π ∈ S n .Then For each point x in the hypercube there exists at least one permutation π for which the definition of T π is consistent with the "≤"-ordering of the components of x, i.e., x ∈ T π .Therefore π∈Sn T π = [0, 1] n .If we consider a point x ∈ [0, 1] n with x i = x j for all i = j, then there exists only one permutation π such that x ∈ T π since the components of x have a uniquely determined order.Furthermore, those points are dense in [0, 1] n and this implies that two simplices constructed with two different permutations cannot have a n-dimensional intersection.Moreover, different simplices T π can be matched by a permutation of the coordinates and this implies that the volume of each T π is equal to 1/n! and the aspect ratio ρ(T π ) is independent of π.
The hypercube can be split into 2 n smaller cubes.By dividing these smaller cubes again into simplices, we arrive at where we consider all possible t ∈ {0, 0.5} n and π ∈ S n .We observe that these simplices are the translated and scaled versions of T π .In particular, we have T t π = 1 2 T π + t and this implies vol(T t π ) = 2 −n vol(T π ) = 2 −n /n! and ρ(T t π ) = ρ(T π ).We argue that for all π ∈ S n and t ∈ {0, 0.5} n , there exists π ∈ S n with T t π ⊂ T π.Indeed, for x ∈ T t π , the coordinates x i with t i = 0 are smaller (or equal) than the coordinates x j with t j = 0.5.Further, we have x π(i 1 ) ≤ x π(i 2 ) if t π(i 1 ) = t π(i 2 ) and i 1 ≤ i 2 .Thus, we can construct π by first taking the indices π(i) with t π(i) = 0 and afterwards the indices π(j) with t π(j) = 0.5.Due to vol(T t π ) = 2 −n vol(T π ) this implies that every T π can be divided into 2 n smaller simplices T t (i)  π (i) with i = 1, . . ., 2 n .Again, these smaller simplices have the same aspect ratio as T π .
Repeating this subdivision proves the assertion in the case that In the general case, we map each simplex T ∈ T 1 to T π for some fixed π ∈ S n by an (invertible) affine transformation a : T → T π .The first part of the proof showed that T π can be divided repeatedly into smaller simplices.In each subdivision step, the volume is scaled down by 2 −n whereas the aspect ratio is constant.By applying the inverse transformation a −1 , we get a subdivision of T .The ratio of volumes is invariant w.r.t. the affine transformation a −1 , thus we can take q = 2 −n .It remains to study the effect of the affine transformation a −1 on the aspect ratio.Every simplex that is the result of repeated refinement of T has the form a −1 ( T ), where T ⊂ T π is a simplex which has the same aspect ratio as T π .We denote the largest balls in T and a −1 ( T ) by B T and B a −1 ( T ) .The ellipsoid a −1 (B T ) is contained in a −1 ( T ) and it contains a ball of diameter θ min diam(B T ), where θ min is the smallest spectral value of the matrix a (0 where θ max is the largest singular value of the matrix a (0) −1 .This yields the estimate Thus, the aspect ratio of every simplex that is the result of repeated refinement of T can be bounded from below by ρ T .Since T 1 is finite, we can choose ρ = min{ρ T | T ∈ T 1 } > 0.
Remark 3.3.The refinement technique of Lemma 3.2 always generates hanging nodes.The presented method is consistent with splitting a triangle into 4 similar parts using the midpoints of the edges.In higher dimensions there might exist more advanced methods.
Since Algorithm 1 only requires a bound on the aspect ratio, we can use the simple strategy from Lemma 3.2.
After we have proven the convergence of Algorithm 1, we want to get an estimate on the convergence speed.We establish a preliminary result on the error in the upper-level objective induced by the approximation ξ T of ϕ.
Lemma 3.4.Let T be a subdivision of Q.For T ∈ T and any feasible point (β, y, u) of (OVR(ξ, T )) we have where (y β , u β ) is the solution of the lower-level problem associated with the parameter β, see Definition 2.3.Here, C ρ is as in (3.3) and . The constant M does not depend directly on T but only on ρ(T ).
Proof.We use the quadratic growth condition from Lemma 2.2 to obtain Finding global solutions of IOCs Friedemann, Harder, Wachsmuth Next, we apply the interpolation estimate (3.3) to get (3.9) In order to apply the Lipschitz assumption from Assumption 2.1, we define , where M Ψ is given in Corollary 2.5.Due to (3.9), all quantities are bounded by M .Thus, Theorem 3.5.Let T be a subdivision of Q and suppose that the upper-level objective functional satisfies a quadratic growth condition for a solution ( β, ȳ, ū) of (OVR) in the sense that holds for some constant G > 0. Let T ∈ T be an element satisfying the condition Then, for any feasible point (β, y, u) of the relaxed problem (OVR(ξ, T )) we have The constants appearing in (3.11) have the same meaning as in Lemma 3.4.
Remark 3.6.We give some interpretation of Theorem 3.5.Let ( β, ȳ, ū) be a solution to (OVR) satisfying the growth condition (3.10).Let T ∈ T satisfy (3.11) and let (β, y, u) be a feasible point of (OVR(ξ, T )).Further, let T ∈ T be a simplex with β ∈ T .Then, a solution (β T , y T , u T ) of (OVR(ξ, T )) satisfies Hence, Algorithm 1 will never refine the simplex T and, consequently, this simplex will be ignored in the subsequent iterations of the algorithm.
Theorem 3.5 also has a quantitative implication.We consider a subdivision of Q into simplices of diameter h.According to (3.11), the minimizer β cannot occur in simplices T with h < C dist(T, β) 2 , with some constant C > 0. That is, we only have to consider simplices with dist(T, β) ≤ h/C.The number of simplices satisfying this condition is roughly of the order If we are able to improve (3.11) to diam(T ) < C dist(T, β) α for some α ∈ [1, 2), see the discussion below, this number of simplices improves to h −n(1−1/α) .In particular, in the case α = 1, we expect a constant number of simplices.
Remark 3.7.There are two possibilities to improve condition (3.11).First, if one has a stronger growth condition for the upper-level objective functional, i.e., In particular, α = 1 might be possible if β is located on the boundary of Q or if the reduced objective is non-smooth at β.
Second, we can improve Theorem 3.5 if F ( β, ȳ, ū) = 0.For simplicity, we discuss the case that F is quadratic, i.e., In particular, the second derivative is constant.Together with the Lipschitz continuity of F and Ψ (see Corollary 2.5), we readily obtain Using this estimate and (3.9) in (3.14), we find By using this estimate in (3.12), we see that (3.11) can be replaced by diam(T ) < c dist(T, β) for some c > 0. Note that F ( β, ȳ, ū) = 0 is highly restrictive.However, the positive influence on the convergence speed can already be expected if the first derivative of F is close to zero in the solution.The approach can be applied to non-quadratic objective functionals F by replacing (3.14) by a Taylor expansion and requiring that F is bounded on bounded subsets.
Algorithm 1 can still be sped up substantially without additional restrictions.In (S3), we have to evaluate ϕ(β k ), and for this purpose we calculate the lower-level solutions (y is an upper bound for the minimal objective value of (OVR).On the other hand, the computed values F (β T , y T , u T ) for T ∈ T are lower bounds for the possible objective value of (OVR) restricted to T .Hence, all elements T ∈ T with cannot contain a solution of (OVR) and can be ignored in later iterations.Furthermore, the simplices can be sorted by F (β T , y T , u T ) and multiple simplices may be refined in each iteration.This results in a larger number of auxiliary problems which have to be solved in the next iteration (recall that (OVR(ξ, T )) has to be solved on refined elements only).These problems are independent of each other and can be solved in parallel.
Finally, we demonstrate that in most cases, the value-function constraint in (OVR(ξ, T )) will be satisfied with equality.To study the issue we introduce the problem min β,y,u This problem is a relaxation of (OVR), since we neglected the optimality of (y, u) for the lower level.We expect that this problem has a smaller optimal value than (OVR).
Lemma 3.8.Suppose that the infimal value of (3.15) is smaller than the infimal value of (OVR).Let (β k , y k , u k ) be defined as in Algorithm 1(S2).Then, the constraint f (β k , y k , u k ) ≤ ξ Tk (β k ) is satisfied with equality for all k large enough and for which ξ T k is continuous at β k .
Proof.Let ( β, ỹ, ũ) be a global solution for (3.15).Note that global solutions ( β, ȳ, ū) to (OVR) are not globally optimal for (3.15).The construction of the sequence (β k , y k , u k ) according to Algorithm 1 yields a monotonically increasing sequence We argue by contradiction and assume that f , and check that it is a feasible point of (OVR(ξ T k )) for s small enough.The constraint Ay = Bu is linear and the admissible sets Q and U ad are convex.Moreover, since f is continuous (see Assumption 2.1) and since ξ T k is continuous by Finding global solutions of IOCs Friedemann, Harder, Wachsmuth assumption, we have for some ε > 0. Now the convexity of the upper-level objective functional F (see Assumption 2.1(e)) implies for all s ∈ (0, ε].This contradicts the optimality of (β k , y k , u k ) from Algorithm 1(S2).
Note that the piecewise linear function ξ T k is continuous if the triangulation T k does not possess hanging nodes.Otherwise, it might be discontinuous at all facets containing hanging nodes.

Penalty approach
The subproblems (OVR(ξ, T )) presented in Algorithm 1 are already subject to convex constraints, however, the nonlinear inequality constraint f (β, y, u) ≤ ξ(β) still may introduce difficulties when implementing the solution algorithm.In particular, this constraint is of a rather unusual form in an optimal control context, see Section 5. Using a penalty method for this complicated constraint the treatment of the subproblems (OVR(ξ, T )) can be simplified since this inequality constraint is incorporated into the objective functional.Any additional error that is introduced by the penalty approach has to be compared to the error induced by the relaxation of the problem with the affine interpolation of the optimal-value function.
By replacing the subproblems in Algorithm 1 with a penalty approach, we arrive at Algorithm 2 for which we now provide some further comments.In a classical penalty method the penalty parameter depends only on the iteration counter k.In Algorithm 2, we allow an additional dependence on the simplex T .Indeed, if γ k,T is independent of k, it is sufficient to solve the auxiliary problems (OVRP(T, γ k,T )) only on the new cells T ∈ T k+1 \ T k .Otherwise, we would need to solve these problems on all cells in each iteration.The stopping criterion in (S3) is justified in the first part of the proof of the upcoming Theorem 4.2.
Lemma 4.1.Let the penalty function P : R → R be non-constant, non-decreasing and convex.Then, for every simplex T ⊂ Q and γ k,T > 0, the problem (OVRP(T, γ k,T )) possesses a solution.
Proof.From the monotonicity and convexity of P , we get P (s) → ∞ for s → ∞.
For a minimizing sequence (β k , y k , u k ), the boundedness of β k follows from β k ∈ T .Since F is bounded from below by Assumption 2.1(e) and since γ k,T > 0, the expression Algorithm 2 Computation of global solutions to (UL) with penalty approach (S1) Let T 1 be a subdivision of Q and select parameters q, ρ ∈ (0, 1) and a non-decreasing function P : R → R with P (0) = 0. Further, set k := 1. (OVRP(T, γ k,T ))

Select
Tk ∈ arg min ) is a global solution of (OVR) (and, thus, of (UL)) and the algorithm terminates.Otherwise, we construct T k+1 from T k by a refinement of Tk such that vol(T ) ≤ q • vol( Tk ) and ρ(T ) ≥ ρ for all T ∈ T k+1 \ T k .Set k := k + 1 and go to (S2).
) is bounded from above.Due to the properties of P , the sequence f (β k , y k , u k ) is bounded from above.Thus, the boundedness of (y k , u k ) follows from Assumption 2.1(g).Now, the remaining part of the proof is clear since the objective is continuous and convex, hence, weakly sequentially lower semicontinuous.

Standard penalization
We first prove the convergence of Algorithm 2 for a typical penalty function P .
Theorem 4.2.Let the penalty function P : R → R be monotone and convex, such that P (s) = 0 for all s ≤ 0 and P (s) > 0 for all s > 0. If γ k, Tk → ∞, Algorithm 2 either stops at a global solution of (OVR) or the computed sequence (β k , y k , u k ) contains a subsequence converging strongly in R m × Y × U to a global solution of (OVR).If (OVR) has a unique global solution ( β, ȳ, ū), then the entire sequence (β k , y k , u k ) converges strongly to ( β, ȳ, ū).
Proof.A global solution ( β, ȳ, ū) to (OVR) is feasible for (OVRP(T, γ k,T )) if β ∈ T .By definition of (β k , y k , u k ) and the assumed properties for the penalty function P one obtains the estimate If Algorithm 2 terminates in (S3), then the condition f (β k , y k , u k ) = ϕ(β k ) implies feasibility of (β k , y k , u k ) for (OVR) while (4.1) ensures global optimality.
It remains to check the case that Algorithm 2 does not terminate.From (4.1) and Assumption 2.1(e) we get a constant C ≥ 0 such that Using that P is non-decreasing and that ξ Tk (β k ) is bounded from below (since ϕ is bounded from below on Q), we get that f (β k , y k , u k ) is bounded from above.From Lemma 2.2 we get Since f is bounded from below and since u β k is bounded by Corollary 2.5, we obtain the boundedness of u k in U .The boundedness of the solution operator S then implies boundedness of the state y k = Su k in Y .Thus, the sequence (β k , y k , u k ) is bounded and contains a weakly convergent subsequence (without relabeling), (β k , y k , u k ) ( β, ŷ, û).The parameter β k converges strongly because β ∈ Q ⊂ R n is finite dimensional.It remains to check optimality of the weak limit ( β, ŷ, û) and the strong convergence.
Analogous to Theorem 3.1, the usual subsequence-subsequence argument can be used to obtain strong convergence of the entire sequence if the solution to (OVR) is unique.
Remark 4.3.We observe from Theorem 4.2 that it is sufficient to have the penalty parameter γ k,T being solely dependent on the simplex T .A possibility is the choice γ k,T = υ(diam(T )) with a function υ satisfying υ(t) → ∞ for t → 0. A direct benefit is that the solution of the subproblem on a fixed simplex is now independent of the iteration and only needs to be carried out once, as in Algorithm 1.

Direct penalization
The problem (OVRP(T, γ k,T )) can be further simplified if instead of a penalty function as described in Theorem 4.2 a direct penalization P = Id is considered.It is clear that we cannot use P = Id as a penalty function for a general optimization problem.The reason is that this function would reward overachieving the penalized constraint.Our constraint f (β, y, u) − ξ T (β) cannot be arbitrarily negative and this renders the usage of P = Id possible.This choice, however, has implications on the choice of the penalty parameter.The difference between the lower-level objective functional and the interpolation of the optimal-value function f (β, y, u) − ξ T (β) can be negative.Thus, arbitrarily increasing γ does not work.The penalty parameter γ needs to be set specifically for each simplex.
Corollary 4.4.We consider Algorithm 2 with P = Id and we assume that the penalty parameters satisfy γ Tk → ∞, γ Tk diam( Tk ) 2 → 0 as k → ∞.Then, (β k , y k , u k ) contains a strongly convergent subsequence and all accumulation points are globally optimal for (OVR).If (OVR) admits to a unique global minimizer ( β, ȳ, ū) then the entire sequence (β k , y k , u k ) converges strongly towards this minimizer.
Proof.The argumentation follows the lines of the proof of Theorem 4.2.Therefore, we just comment on the differences.The interpolation error estimate (3.3) allows for a lower bound for the violation of the constraint, i.e.
When using P = Id, an upper bound follows as in (4.2) and we have We can now argue as in Theorem 4.2 and obtain (β k , y k , u k ) → ( β, ŷ, û) along a subsequence, where ( β, ŷ, û) is a feasible point of (OVR).In order to achieve optimality of ( β, ŷ, û), we combine (4.3) and (4.4) and obtain The remaining part of the proof follows the proof of Theorem 4.2.
The next lemma addresses the continuous dependence of the solution on the penalty parameter.
Lemma 4.5.We suppose that F (•, S(u), u) is strongly convex (w.r.t.β) with constant µ β > 0, independent of the control u.Then, (OVRP(T, γ)) has a unique solution (β γ , y γ , u γ ) for all γ > 0. Further, let 0 < γ a ≤ γ T < ∞ and γ a ≤ γ.Then, Proof.The existence of a solution to (OVRP(T, γ)) follows from Lemma 4.1.For γ a ≤ γ, the strong convexity of f implies that the reduced objective of (OVRP(T, γ)) is strongly convex w.r.t.u with constant γ a µ on the feasible set.This gives uniqueness of the state y γ = S(u γ ) and of the control u γ .With the additional assumption on F , we get the uniqueness of β γ .
Next, we want to apply Lemma 2.4 to the state reduced variant of (OVRP(T, γ)), i.e., we apply the setting Assumption 2.1 ensures that the assumptions of Lemma 2.4 are satisfied.Thus, Lemma 2.4 implies Now, the derivative J x ((β, u), γ) contains the two components Thus, the above estimate implies This shows the claim.
The problem (OVRP(T, γ T )) is a relaxation of (OVR(ξ, T )) and consequently the objective functional attains a smaller minimal value and represents a lower bound to the minimal objective value of (OVR(ξ, T )).Since this lower bound depends on the chosen penalty parameter γ T , we try to adjust this parameter to obtain the largest possible lower bound.
We will now show that it is reasonable to aim for a choice of the penalty parameter such that the equality f (β γ T , y γ T , u γ T ) = ξ(β γ T ) holds for the solution (β γ T , y γ T , u γ T ) of (OVRP(T, γ T )).In the expected case where no solution to (3.15) is feasible for (OVR), this specific penalty parameter results in the largest possible minimal objective value for (OVRP(T, γ T )).
Lemma 4.6.Let the state reduced functional F be strongly convex with respect to β with constant µ β independent of the control u.Let a simplex T be given and, again, P = Id.Further, we assume the existence of β ∈ T with ϕ(β) < ξ T (β).For γ ≥ 0, we denote a solution to (OVRP(T, γ)) by (β γ , y γ , u γ ).
The existence of β ∈ T with ϕ(β) < ξ T (β) is equivalent to ϕ being not affine on T .Thus, this assumption is not very restrictive. Proof.
In general it is not possible to check which case of Lemma 4.6 applies.However, the proof suggests that after solving (OVRP(T, γ)) the value f (β γ , y γ , u γ ) − ξ T (β γ ) can be checked to infer whether the choice of the penalty parameter γ was adequate, too small or too large.Furthermore, when splitting the simplices in Algorithm 2, the approximation ξ T of the optimal-value function ϕ cannot increase in any point β ∈ Q. Together with the feasibility of the solution to the refined problems for the problem on the original simplex T , this yields that the minimal objective value may only remain constant or increase if the same penalty parameter γ T is used for a subproblem.We therefore suggest starting with γ = 0 and then using a heuristic to find a γ T .The refined problems can inherit the parameter γ T as a starting point instead of zero.This approach covers both cases of Lemma 4.6 without the need to calculate all solutions of (3.15).Once a γ T is found such that f (β γ , y γ , u γ ) − ξ T (β γ T ) > 0 one can be sure that all subproblems are of case Lemma 4.6(b), because ξ is decreasing with further refinement of the simplices.
This lemma shows that the problem (OVR(ξ, T )) is equivalent (in some sense) to (OVRP(T, γ T )) for the "optimal" value of γ T , cf.Lemma 4.6.In the application we have in mind, the structure of (OVRP(T, γ T )) is much nicer, since the "complicated" function f appears in the objective and not in the constraints.

Parameter identification in an optimal control problem
In the previous section we discussed how a global minimizer for (OVR) can be found using Algorithm 2. However, so far we did not introduce a solution scheme for the subproblems (OVRP(T, γ k,T )).In this section we will show that one of the main advantages when introducing the direct penalization (see Section 4.2) is that the semismooth Newton method is applicable.This is demonstrated by means of a class of example problems.

Problem formulation and properties
We consider the bilevel optimization problem with the lower-level problem min and upper-level problem min (y, u) solves (LL(α)).
We observe that the lower-level objective functional f is not convex with respect to all variables.In particular, Assumption 2.1(e) is not satisfied.Additionally, the corresponding optimal-value function is usually not convex either.As Algorithm 1 depends on convexity of the optimal-value function one has to first transform the problem in such a way that the new lower-level objective functional is convex.For this purpose, we consider the simple substitution β i = 1/α i .We also define σ β := σ α .For the upper-level objective this substitution results in The constraint α ∈ Q α has to be transformed to Here we use that for a Banach space Y , the function g : Y → R, y → 1 2 y 2 Y is convex and for λ > 0 the so-called perspective of g is given by Finding global solutions of IOCs Friedemann, Harder, Wachsmuth It is known that the perspective of a convex function is convex (e.g. one can simply generalize the proof of [Dacorogna, Maréchal, 2008, Lemma 2] to Banach spaces).Now convexity is preserved under composition with an affine function y → Cy − y d .Thus, the function ( The convexity of f follows.With the above setting and observations, one can show that the transformed problem satisfies Assumption 2.1.

Stationarity system for the direct penalization
Classic choices of the penalty function for (OVRP(T, γ k,T )), e.g., P = max(0, •) 2 , will result in subproblems that are difficult to handle.In particular, the optimality system cannot be reformulated as a simple projection formula.We will see that the direct penalization P = Id results in an easy to implement solution algorithm for (OVRP(T, γ k,T )).Computing the solution of (OVRP(T, γ k,T )) requires the construction of ξ and thereby the evaluation of ϕ(β) at certain points.This equates to solving single-level optimal control problems.
In order to state the stationarity conditions, we first reformulate the condition β ∈ T .Recall that T is a (non-degenerate) simplex.Thus, T can be written as the intersection of n + 1 half-spaces, )×n is a suitable matrix.Clearly, at most n of these constraints may simultaneously hold with equality and that all those constraints that are satisfied with equality are linearly independent.Thus, (OVRP(T, γ k,T )) with P = Id takes the form min β,y,u The KKT system for (OVRP(T, γ k,T )) with direct penalization (P = Id) is given by 0 where p ∈ H 1 0 (Ω), z ∈ R n+1 , and ν ∈ L 2 (Ω) are the Lagrange multipliers.The vector a T refers to the derivative of the affine function ξ T on the simplex T .
The solution and the corresponding multipliers are unique.
Proof."⇒": We check that the Robinson regularity condition for the reformulated problem is satisfied.This condition reads The two lines of the equation are independent of each other.By assumption, A is bijective, i.e., A(H 1 0 (Ω)) = H −1 (Ω).For the second line we recall that the Robinson regularity condition is equivalent to the Mangasarian-Fromovitz condition for standard nonlinear optimization problems, see [Bonnans, Shapiro, 2000, p. 71].Thus, the second line is satisfied since we have assumed that the simplex T is non-degenerate, i.e., we even have the linear-independence constraint qualification for the system K T β ≤ b T .This shows the existence of multipliers, see [Bonnans, Shapiro, 2000, Theorem 3.9]."⇐": This is clear since (OVRP(T, γ k,T )) is a convex problem.
It remains to address the uniqueness.The uniqueness of the solution follows from the strict convexity of the objective.The second line of the KKT system gives uniqueness of the adjoint p, since A is an isomorphism.Similarly one gets uniqueness of ν from the third line.Regarding uniqueness of z we observe that the matrix K T describing a non-degenerate simplex has rank n, even after removing an arbitrary line.Additionally, there exists at least one inactive constraint, such that z is equal zero in this component.After removing the corresponding component from z and the respective column from K T in the first line of (5.3), z is obtained by inverting a square matrix of full rank.Thus, z is unique.
We introduce two auxiliary functions h, ĥ : (0, (5.4) Note that h represents the part of the objective function of (OVRP(T, γ k,T )) that does not depend on u.
are bounded linear operators and that A is invertible.We define the function
(5.5) with σ := σ u + γ k,T σ l .Now we discuss the relation between the roots of W and the optimality system.
Proof.In view of Lemma 5.1, we have to check that (5.3) is equivalent to W (β, y, u, z, p) = 0.
It is clear that (5.3a), (5.3b) and (5.3d) are equivalent to lines 1, 2 and 5 in (5.5).The complementarity conditions (5.3e) on z and b T − K T β can be reformulated via A similar reformulation is standard for treating the gradient equation (5.3c) in combination with the inclusion (5.3f), see [Tröltzsch, 2009, Theorem 2.28].These two equations are equivalent to the projection formula i.e., line 3 in (5.5).Note that ν does not appear in (5.5), but it is uniquely determined by (5.3c).This shows that the KKT system is equivalent to W (β, y, u, z, p) = 0.This finishes the proof.

Semismooth Newton method for the subproblems
We have shown in Lemma 5.2 that we can characterize the solution of the subproblem (OVRP(T, γ k,T )) with the nonlinear operator W .An established way to solve problems with this structure is the semismooth Newton method, cf.[Hintermüller, Ito, Kunisch, 2002].To this end, we verify the Newton differentiability of W and the invertibility of the Newton matrix.In order to state the Newton derivative of W , we need to define some index sets and corresponding operators.We define and for i ∈ {1, 2} we write χ A i (β,z) ∈ R (n+1)×(n+1) for the diagonal matrix that whose k-th diagonal entry is 1 if k ∈ A i (β, z) and 0 otherwise.Similarly, we write χ A 3 (p) : L 2 (Ω) → L 2 (Ω) for the multiplication operator corresponding to the characteristic function of A 3 (p) on the space L 2 (Ω).
Lemma 5.3.The mapping W is Newton differentiable and a Newton derivative of W at a point (β, y, u, z, p) is given by the block operator Proof.To show Newton differentiability of W , one has to pay attention only to the third and fourth line as the others are Fréchet differentiable.For the fourth line one can use that in finite dimensions the composition of Newton differentiable functions is Newton differentiable cf.[Ulbrich, 2011, Proposition 2.9] and combine this with the fact that max(•, •) is Newton differentiable (see [Ulbrich, 2011, Proposition 2.26]).Furthermore, [Ulbrich, 2011, Theorem 3.49] can be used to show the Newton differentiability of the third line: If we use m = 3, ψ(s) = min(max(s 1 , s 2 ), s 3 ), r = r i = 2, G(p) = ((B p + σ u u m )/σ, u a , u b ) in the setting of [Ulbrich, 2011, Section 3.3], then the required [Ulbrich, 2011, Assumption 3.32] is satisfied with q i = q > 2, by the higher regularity Now a Newton derivative can be obtained using direct calculations and utilizing the index sets that are introduced above.
The proof required a norm gap, which was ensured by the higher regularity B ∈ L[H 1 0 (Ω), L q (Ω)] with q > 2, which is intrinsic to our problem setting.This allowed us to prove the Newton differentiability of W in the spaces where W is defined.In particular when adapting the Algorithm from [Ulbrich, 2011, Algorithm 3.10], see Algorithm 3, this allows for the smoothing step to be skipped.This smoothing step is designed to treat the more general case when Newton differentiability can only be shown by artificially introducing a norm gap while the boundedness of the inverse of the derivative can only be shown in the original setting (cf.[Ulbrich, 2011, Introduction to section 3]).Note that (S3) is well defined as long as β i is positive, since the function W is only defined for positive β.This, however, does not influence the local convergence of Algorithm 3.

and go to step (S2)
To prove fast convergence of the semismooth Newton method, the uniform invertibility of the Newton derivative W (β, y, u, z, p) is needed.For this purpose, we convert the Newton derivative W (β, y, u, z, p) into a self-adjoint operator, since the latter type of operator is easier to handle.For that purpose we fix a point (β, y, u, z, p).We use the notation , to refer to the canonical embedding operators that correspond to the index sets Here l 1 denotes the cardinality of A 1 (β, z).We mention that I 1 , I 2 , I 3 , I 4 are the corresponding restriction operators and, consequently, We define the linear operator Ŵ from It can be seen that Ŵ is self-adjoint.Note that the spaces on which Ŵ operates depend on β, z, p.The next lemma gives us a relation between Ŵ and W (β, y, u, z, p).
Lemma 5.4.Let (β, y, u, z, p) Proof.The proof can be carried out by direct calculation.We first assume (5.6) to be valid.Computing the application of Ŵ yields We use the definition of the index sets and receive the equivalent expression where we used (5.9) Note that the claimed relations I 4 u 1 = I 4 u 2 and −I 2 z 1 = I 2 z 2 follow from the equations Id With these relations, we directly get (5.7) from (5.9).
For the other direction, we first get (5.9)directly from (5.7).Then, a comparison with (5.8) yields the equations for β 2 , y 2 , p 2 , and The final expression (5.6) follows by utilizing I 4 u 1 = I 4 u 2 and −I 2 z 1 = I 2 z 2 again.
In order to ensure the uniform invertibility of the operators Ŵ , we state an auxiliary lemma.
Lemma 5.5.Let X, Y be Hilbert spaces and Â : X → X , B : X → Y be bounded linear operators.Let the bounded linear operator D : Suppose that B is surjective and that Â is coercive on ker B, i.e. there exists a constant γ > 0 such that Âx, x ≥ γ x 2 X for all x ∈ ker B. Then D is continuously invertible.Moreover, the estimate D−1 ≤ 4c 5 holds, where c : )), and γ > 0 is the coercivity constant from above.
Proof.This result follows from [Brezzi, Fortin, 1991, Proposition II.1.3].Note that we have B(X) = Y and ker B = {0}.Lemma 5.6.Let (β, y, u, z, p) is surjective, i.e. that the rows of K T which correspond to the index set A 1 (β, z) are linearly independent.Then, the operator W (β, y, u, z, p) is continuously invertible.Moreover, we have W (β, y, u, z, p) −1 ≤ C for a constant C > 0, which does not depend on β, y, u, z, p but can depend on an upper bound of y , on the upper and lower bounds of β, and on K T , A, B, h, σ, σ r .
Proof.We start with showing that Ŵ is continuously invertible, which we will do using Lemma 5.5.We notice that the operator Ŵ has the required block structure if we set Since A is invertible and I 1 K T is surjective by assumption, it follows that B is surjective.
In order to show that Ŵ is continuously invertible, it remains to show that Â is coercive on ker B.
Let ( β, ŷ, û) ∈ ker B be given.Then holds.Recall from (5.4) that h(β, y) = ĥ(β, y) and that ĥ is convex, and that for we can directly calculate the second derivative, which is a diagonal matrix with strictly positive entries, if β i > 0. Therefore, there exists a constant σ r > 0 for which h (β, y)( β, ŷ) ( β, ŷ) ≥ σ r β β (5.10) holds, where σ r depends on the upper bound of β i .This implies where γ > 0 is a suitable constant.Thus Â is coercive on ker B. It follows from Lemma 5.5 that Ŵ is continuously invertible.Because B is surjective, there exists a constant α > 0 such that B 1 (0) ⊂ B(B α (0)).Since there are only finitely many possibilities for I 1 and I 3 is not needed for surjectivity, the constant α can be chosen such that it is independent of I 1 and I 3 .For Â we note that it can be bounded by a constant which can depend on an upper bound on y H 1 0 (Ω) and a lower bound on β i .
It follows from Lemma 5.5 that the estimate Ŵ −1 ≤ 4c 5 holds for a suitable constant c > 0 which does not depend on β, y, u, z, p but can depend on an upper bound of y H 1 0 (Ω) , the lower bound of β i and on K T , A, B, h, σ, σ r .Next, we combine this result with Lemma 5.4 to show the invertibility of W (β, y, u, z, p).Let (β 2 , y 2 , u 2 , z 2 , p 2 ) be a right-hand side as in (5.6).Since Ŵ is invertible, by Lemma 5.4 there exists a unique solution (β 1 , y 1 , u 1 , z 1 , p 1 ) of (5.6).Using the estimate Ŵ −1 ≤ 4c 5 and (5.7), one get an estimate of the form (β 1 , y 1 , u 1 , z 1 , p 1 ) ≤ C (β 2 , y 2 , u 2 , z 2 , p 2 ) , where C > 0 is a suitable constant that can depend on c, σ, K T , B, the upper bound of y H 1 0 (Ω) and the bounds of β.The constant C however, does not depend on (β, y, u, z, p) or any of the embedding operators I 1 , I 2 , I 3 , I 4 . .Since we can estimate the norm of the unique solution in (5.6) by the norm of the right-hand side, the claimed invertibility and estimate W (β, y, u, z, p) −1 ≤ C follow.
Proof.We want to apply Lemma 5.6.We need to verify that I 1 K T (which can depend on β and z) is surjective in a neighborhood.
From the definition of W , we get z ≥ 0, K T β − b T ≤ 0 and z (K T β − b T ) = 0.In particular, β ∈ T .Recall that T is a non-degenerate simplex.Thus, at most n constraints in the system K T β ≤ b T are active, and these active constraints are linearly independent.Furthermore, if i ∈ {1, . . ., n + 1} is an index of an inactive constraint, we have z i = 0 due to the complementarity condition, and therefore i ∈ A 2 (β, z) and i ∈ A 1 (β, z).Thus, A 1 (β, z) contains at most n elements.Therefore, the rows of K T which correspond to the index set A 1 (β, z) are linearly independent, which yields that I 1 K T is surjective for this particular β, z.
If i ∈ A 2 (β, z), then i ∈ A 2 ( β, ẑ) holds also for ( β, ẑ) that are sufficiently close to (β, z).Thus, A 1 (β, z) cannot get larger in a neighborhood of (β, z).Hence, the rows of K T that correspond to A 1 (β, z) stay linearly independent in a neighborhood, i.e.I 1 K T is surjective in a neighborhood of (β, z).Now to apply Lemma 5.6 we restrict the neighborhood such that β > 1 2a i if necessary.This guarantees the lower bound β > 1 2a i .The upper bound of y H 1 0 (Ω) is obtained from the coercivity of f with constant γ k,T µ (cf.Assumption 2.1(g).Hence, with Lemma 5.6 there exists a constant C > 0, such that W (β, y, u, z, p) −1 ≤ C in the considered neighborhood of (β, y, u, z, p).Now we are ready to give our final theorem, which states that Algorithm 3 converges superlinearly.
Theorem 5.8.Let the function W be given as in (5.5).Further, we denote by (β k,T , y k,T , u k,T ) the unique global solution of (OVRP(T, γ k,T )) and by z k,T , p k,T the corresponding multipliers that satisfy (5.3).Then there exists a neighborhood of the point (β k,T , y k,T , u k,T , z k,T , p k,T ) such that for all initial values (β 0 , y 0 , u 0 , p 0 , z 0 ) from this neighborhood, the semismooth Newton method from Algorithm 3 either terminates in the i-th step with (β i , y i , u i , z i , p i ) = (β k,T , y k,T , u k,T , z k,T , p k,T ) or generates a sequence that converges q-superlinearly to Proof.We already established that the function W is semismooth in the solution to (OVRP(T, γ k,T )) (see Lemma 5.3).We have proven in Lemma 5.7 that the derivative from Lemma 5.3 is invertible and the norm of the inverse is bounded on a neighborhood of a solution.The result is now a direct application of [Ulbrich, 2011, Theorem 3.13].
In particular, we do not need a smoothing step, since the spaces in which W is Newton differentiable coincide with the spaces in which the Newton derivative is uniformly invertible, see Lemmas 5.3 and 5.6.

Numerical experiments
In this section we present an example for Algorithm 2 to illustrate the convergence behavior towards a global minimizer.To this end, we consider the parameter identification problem It turns out that these constraints are active on parts of the domain for the choice of the parameter β = (0.6, 0.3) .For the upper level we fix the parameters σ u = 0.05 and σ β = 10 −5 .We also choose β m := (0.6, 0.3) and (y m , u m ) := Ψ((0.6, 0.3) ), i.e. the objective value of F 1 is zero for the solution to the lower-level problem with β = β m .We call this setting "fully reachable target state".We mention that when this setting is implemented, the functions y m , u m are not the analytical solutions, but are calculated directly using the finite element solutions for the lower level.
For the setting of this section, Assumption 2.1 is valid.Additionally, for the chosen functionals and parameters we can apply the semismooth Newton method from Section 5.3 to solve the subproblems (OVRP(T, γ k,T )).In order to illustrate some fundamental properties of the proposed solution algorithm, we consider two additional problems that only differ in the choice of the objective functional, i.e. the functions are used instead of F 1 .In the second objective functional F 2 , the β term is only introduced as a regularization.This will be called "reachable target state".The functional F 3 is set up with desired states ŷm and ûm that are given by ŷm : Ω → R, ŷm (x) = (x 1 − 1)(x 1 + 1) sin(πx 2 ), ûm : Ω → R, ûm (x) = 0.
This state and control have the property that they do not arise as a solution of the lowerlevel problem.This setting is named "unreachable target state".We expect a noticeable difference in the convergence speed for the introduced settings, see Remark 3.7.
The refinement of the subdivision will be implemented by splitting the triangles at the midpoint of the edges.This refinement procedure is the application of Lemma 3.2 to the two-dimensional case.However, in this special case we can even guarantee that the diameter of the simplices is halved in each refinement.We initialize Algorithm 2 with the domain Q split into two triangles.
We use an implementation with the suggested improvements mentioned at the end of Section 3. In each iteration we get a lower bound on the optimal objective value from the element with the lowest objective value for the solution to (OVRP(T, γ k,T )).We obtain an upper bound from the vertex with the lowest objective value.Hence every element 10 0 10 1 10 2 10 3 10 4 10 5  the number of solved subproblems.For the setting of F 3 the results for two different regularization terms are displayed the lower bound in Figure 6.1 is close to zero with repsect to machine accuracy, which explains the slightly perturbed behaviour.
We have a stark difference of convergence speed for the different settings introduced in this section.Additionally there is a noticeable difference between looking at the vertex that provides the upper bound and the furthest active vertex.Note that only for the latter the distance to β is guaranteed to be nonincreasing, while the vertex providing the upper bound might be more interesting from a heuristic point of view if one considers a depth-search.The splitting of the domain can be seen in Figure 6.4.For the purpose of better visualization in the setting of F 1 and F 2 , the algorithm was continued for Figure 6.4 until every element either had a vertex for which the corresponding upper level objective was close (10 −9 ) to the upper bound or was dismissed.We show the difference of lower and upper bound for all the cases discussed in Figure 6.3.
Finally, we give some explanation for the difference in convergence speed.As discussed in Remark 3.6 and Remark 3.7, a growth condition for the upper-level objective functional for a solution w.r.t.β allows for an estimate of convergence speed.This is exactly what we have for the setting of F 1 .Thus, we get the estimate from Remark 3.6 and the number of active subproblems does not substantially increase between iterations.For the case of F 2 , we have the second case from Remark 3.7, where the derivative of F 2 is close to zero in the solution.This is, because the term β 2 R n only comes up as a regularization with a small parameter for the upper-level objective functional.The solution of the parameter estimation problem is still close to (y m , u m ).For the case of F 3 , we no longer have a setting for which we obtain a nice bound on the number of required subproblems to reach a certain accuracy.Especially, the number of of active subproblems might heavily increase during the runtime of Algorithm 2. This can be seen well in Figure 6.4.Finally Figure 6.3 indicates, that the important property in the setting of F 3 is that the solution is no longer close to (ŷ m , ûm ), i.e. that the target state is "unreachable" and that the choice of regularization term is of minor importance regarding convergence speed for this case.for the settings of F 1 , F 2 and F 3 .Simplices are differentiate by the color of their outline: Dismissed (blue), relevant(red), split in the last iteration (yellow), difference of lower and upper bound for the element is within 10 −9 (green).The element with the current best objective value is marked with a pink dot.

Assumption 2. 1 (
Standing assumption).(a) The spaces Y and U are (real) Hilbert spaces.(b) The set Q ⊂ R n is a nonempty bounded polyhedron, i.e., a nonempty and bounded intersection of finitely many closed halfspaces.We assume that Q possesses a nonempty interior.(c) The set U ad ⊂ U is nonempty, closed and convex.(d) The operator A ∈ L[Y, Y ] is an isomorphism and B ∈ L[U, Y ] is a linear bounded operator.We denote by S := A −1 B ∈ L[U, Y ] the control-to-state map.

F 3 ;
Figure 6.3: Difference of upper and lower bound for the settings of F 1 , F 2 and F 3 w.r.t.the number of solved subproblems.For the setting of F 3 the results for two different regularization terms are displayed

Figure 6 . 4 :
Figure 6.4: From left to right: Progression of the splitting of the domain Q for (OVRP(ξ T ))for the settings of F 1 , F 2 and F 3 .Simplices are differentiate by the color of their outline: Dismissed (blue), relevant(red), split in the last iteration (yellow), difference of lower and upper bound for the element is within 10 −9 (green).The element with the current best objective value is marked with a pink dot.