Convergence of the SQP method for quasilinear parabolic optimal control problems

Based on the theoretical framework recently proposed by Bonifacius and Neitzel (Math Control Relat Fields 8(1):1–34, 2018. https://doi.org/10.3934/mcrf.2018001) we discuss the sequential quadratic programming (SQP) method for the numerical solution of an optimal control problem governed by a quasilinear parabolic partial differential equation. Following well-known techniques, convergence of the method in appropriate function spaces is proven under some common technical restrictions. Particular attention is payed to how the second order sufficient conditions for the optimal control problem and the resulting L2\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$L^2$$\end{document}-local quadratic growth condition influence the notion of “locality” in the SQP method. Further, a new regularity result for the adjoint state, which is required during the convergence analysis, is proven. Numerical examples illustrate the theoretical results.

Existence-and regularity of their solutions is well understood, first order necessary and second order sufficient optimality conditions have been proven, and discretization errors for different types of discretization are available, see e.g. the pioneering work of Lions (1971) concerned with linear PDEs and Hinze et al. (2009), or Tröltzsch (2010 for a recent overview covering theoretical and numerical aspects of both linear and nonlinear problems. Recently, optimal control of quasilinear parabolic equations was addressed by Bonifacius and Neitzel (2018), Casas and Chrysafinos (2018), and Meinlschmidt et al. (2017a, b), Meinlschmidt and Rehberg (2016). The functional analytic framework for the analysis of the state equation is provided by the concept of maximal parabolic regularity of nonautonomous operators, see e.g. the work of Amann (2004Amann ( , 2003Amann ( , 2005, Meinlschmidt and Rehberg (2016), , or further references in Bonifacius and Neitzel (2018). The highly non-trivial existence and regularity theory for solutions of the underlying PDE poses the main difficulty in the theoretical analysis of such problems. For a discussion of previous literature concerning optimal control of quasilinear PDEs see the introduction of Bonifacius and Neitzel (2018) and Casas and Chrysafinos (2018), respectively. In particular, optimal control of quasilinear elliptic equations has been considered by Casas and Tröltzsch (2009, 2011, Casas and Dhamo (2011) and Yousept (2013), de Los Reyes and Dhamo (2016), Nicaise and Tröltzsch (2017). Several physical models lead to quasilinear PDEs (e.g. temperature-dependent thermal conductivity), which motivates the analysis of this challenging class of problems from the applied point of view, see e.g. the so-called thermistor problem (Meinlschmidt et al. 2017a, b).
For the efficient numerical solution of nonlinear optimal control problems sequential quadratic programming (SQP) methods form a prominent class of state of the art algorithms: The nonlinear optimization problem is approximated by a sequence of linear quadratic subproblems that can be solved e.g. by application of the well-understood primal dual active set strategy. The analysis of such SQP methods for nonlinear optimal control problems has been addressed by several researchers, see e.g. Tröltzsch (1999), Goldberg and Tröltzsch (1998) for semilinear parabolic equations, Hintermüller and Hinze (2006), Hinze and Kunisch (2001), Wachsmuth (2007) for optimal control of time-dependent Navier-Stokes equation, Griesse et al. (2010), Griesse et al. (2008) for semilinear elliptic problems with mixed constraints, and Heinkenschloss and Tröltzsch (1999) for optimal control of a phase field equation. For an overview concerning the origins of SQP methods in the context of PDE-constrained optimization we also refer to the introduction of Goldberg and Tröltzsch (1998). As further second order methods for the solution of nonlinear optimal control problems we mention the semismooth Newton method and versions of the primal dual active set strategy, respectively, see e.g. Hinze and Kunisch (2001), Hintermüller et al. (2007), Ito and Kunisch (2004).
In the present paper, we focus on the numerical solution of quasilinear parabolic optimal control problems by the SQP method. To our best knowledge, a corresponding convergence analysis in function space has not been carried out in the existing literature. The most closely related existing publications are those by Ulbrich and Ziems (Ulbrich and Ziems 2017;Ziems 2013;Ziems and Ulbrich 2011) and chapter 8 in the thesis of Feldhordt (2017), respectively. Ulbrich and Ziems consider trustregion and trust-region SQP methods for optimal control of general nonlinear PDE.
The main difference to our result is that they include discretization in their work and prove convergence of adaptive multilevel algorithms whereas we stick to the function space setting. In return, we are able to prove locally superlinear convergence around local minima fulfilling certain second order conditions avoiding the two norm gap (Ioffe 1979;, whereas Ulbrich and Ziems establish global convergence to a point fulfilling first order optimality conditions, but without explicit rate. Feldhordt (2017) considers optimal control of the so-called chemotaxis system and proves convergence of the SQP method assuming a rather strong second order sufficient condition. This corresponds to our interim result in Section 6.1, whereas our main focus during the rest of the paper is on the interplay of weaker second order conditions and the notation of "locality" in the SQP method. The second order sufficient conditions we refer to in the present paper are due to Bonifacius and Neitzel (2018). For the topic of second order conditions in PDE-constrained optimization in general we refer to Goldberg and Tröltzsch (1989), Bonnans (1998), or the recent survey by Casas and Tröltzsch (2015) and the references therein.
Many of our arguments in the present paper are similar to those known from earlier publications. However, we believe that our consideration is of interest for three main reasons: First, we demonstrate that the results on optimal control of quasilinear parabolic PDE obtained by Bonifacius and Neitzel (2018) allow to derive convergence of the SQP method. In particular, existence and regularity theory of quasilinear parabolic PDE is much more involved than the corresponding treatment of semilinear PDE. This makes the choice of the correct function spaces more complicated than in previous work on SQP methods and we believe that it is not clear a-priori that-in the end-the arguments from the existing literature apply to the present model problem as well.
Second, we show a new regularity result for the adjoint state in Sect. 7. The proof relies on maximal parabolic regularity arguments and is based on the work of Bonifacius and Neitzel (2018) and . The result is crucial for our further analysis, because the improved regularity allows us to estimate the second derivative of the nonlinearity of the state equation in an appropriate way.
Finally, most proofs concerning convergence of the SQP method have been published before the introduction of a framework for second order sufficient conditions without two norm gap by . As shown by Bonifacius and Neitzel (2018) our model problem fits into this framework and hence it is natural to revisit convergence theory-and in particular, the question of localization of the quadratic subproblems-of the SQP method under the aspect of absence of the two norm gap: Since quadratic growth of the reduced objective functional holds L 2 -locally (instead of L ∞ -locally) around the optimal control, one may wonder, whether it is possible to replace L ∞ -neighbourhoods from previous convergence proofs for the SQP method by L 2 -neighbourhoods. The answer of this question is not straightforward, due to the fact that convergence of the SQP method is-as usual-established by showing convergence of a generalized Newton method for a certain generalized (setvalued) equation: In order to obtain a differentiable map in this generalized equation we still need to measure controls in a norm stronger than the L 2 -norm. In contrast, the regularity property (analogous to the invertibility of the Hessian in the classical Newton method) relies on the L 2 -coercivity property due to the second order sufficient conditions. -For our model problem, we give an answer to this question in Sect. 6.3, which is our main result.
The rest of this paper is organized as follows and keeps the main structure of previous results concerning the analysis of SQP methods, cf. in particular the work of Tröltzsch (1999), Wachsmuth (2007) and Goldberg and Tröltzsch (1998): In Sects. 2 and 3 we briefly recall the assumptions and the model problem as well as its first order optimality conditions from Bonifacius and Neitzel (2018). The idea of the SQP method is outlined together with appropriate second order sufficient conditions. To prepare the analysis of the convergence properties of the SQP method, we provide some auxiliary results that are specifically related to our quasilinear parabolic model problem in Sect. 4. The proof of a new regularity result for the adjoint state is postponed to Sect. 7. After that, we follow the standard argument to prove convergence of the SQP method in Sects. 5 and 6: We utilize the connection to the Josephy-Newton method for a generalized equation originating from the first order optimality conditions. Convergence of this Newton method is proven in Sect. 5 and the interpretation of the iterates as the solutions of certain quadratic optimization problems is topic of Sect. 6. Assuming strong second order sufficient conditions we formulate our first main result in Sect. 6.1. The remaining two theoretical Sects. 6.2 and 6.3 of the paper are devoted to the analysis of the generalized Newton and the SQP method under weaker second order assumptions. In particular we are able to replace the L ∞ -neighbourhoods in the results of Tröltzsch (1999) and Wachsmuth (2007) by L 2 -neighbourhoods in our final result in Sect. 6.3. For a detailed overview of this part of the paper we refer to the introduction of Sect. 6. Finally, we give short numerical examples that illustrate our theoretical findings in Sect. 8. Notation For a Lipschitz domain Ω and θ ∈ (0, 1], k ∈ N, p ∈ [1, ∞] we denote by L p = L p (Ω), H θ, p = H θ, p (Ω) and W k, p = W k, p (Ω) the usual Lebesgue-, Bessel-potential-and Sobolev-spaces, respectively. For the two latter families of spaces a subscript D denotes incorporation of previously defined homogeneous Dirichlet boundary conditions. With H −θ, p D and W −1, p D we refer to the topological dual spaces of H θ, p D and W 1, p D , where ·, · stands for the duality pairing and-in case of Hilbert spaces-the scalar product. Norms · are indexed by the space they refer to. For some integrability exponent r ∈ [1, ∞], we define the conjugate exponent r by 1/r + 1/r = 1. Spaces of countinuously differentiable resp. Hölder continuous functions are denoted as usual by C α .
The open and closed balls of radius r > 0 around x 0 in a Banach space X are denoted by With (X , Y ) r ,s or [X , Y ] r we refer to real or complex interpolation spaces of two normed spaces X , Y , respectively. Given I ⊂ R, a Banach space X , and a function φ: I → X , we denote by tr t φ, t ∈ I , the trace φ(t) ∈ X , if such a pointwise evaluation is well-defined.
The notation ". . . . . ." will be used in order to express that ". . . ≤ C · . . ." holds with a generic constant C > 0, whose dependencies are not relevant for the present context. We use double arrows "⇒" to indicate set-valued maps.

The model problem
Our model problem is the same as the one in Example 2.5 of Bonifacius and Neitzel (2018), and reads as follows: subject to u ∈ U ad and ∂ t y + A (y)y = Bu Here, the quasilinear part A of the state equation is defined by The control operator B, Λ, and the admissible set U ad will be specified in the following section.

Assumptions
We rely on the following assumptions that we repeat from Bonifacius and Neitzel (2018) with minor changes, cf. the following remark.
∂Ω is relatively open and denotes the Neumann boundary part whereas Γ D = ∂Ω \Γ N denotes the Dirichlet boundary part equipped with homogeneous Dirichlet boundary conditions. We assume that Ω ∪ Γ N is Gröger regular such that every chart map in the Definition of Gröger regularity can be chosen volume preserving. The time interval For the definition of Gröger regularity we refer to the work of Gröger (1989). It has been used in a notation similiar to this work e.g. in Bonifacius and Neitzel (2018, Definition A.1). In Section 5 of  an alternative characterization can be found. The additional requirement of volume preserving chart maps is satisfied in particular by all domains with Lipschitz boundary ("strong Lipschitz domain"), but also e.g. by two crossing beams in dimension three (Haller-Dintelmann and Rehberg 2009, Remark 3.3 and section 7.3).

Assumption 2.2
The function ξ: R → R is twice differentiable with ξ being Lipschitz continuous on bounded subsets of R. Let μ: Ω → R d×d , μ = μ T , be measurable and uniformly bounded and coercive in the following sense: We assume a coercivity condition 0 < ξ • ≤ ξ ≤ ξ • for ξ as well. With this we define as above whenever y is a measurable function on Ω.
is a topological isomorphism and fix this choice of p.
Assumption 2.4 Let ζ ∈ (0, 1) and s > 2 be fixed such that holds. By D we denote the domain of the unbounded operator −div(μ∇·) + 1 in the Bessel potential space H −ζ, p D . The desired state y d ∈ L ∞ (I , L p/2 ), the initial condition for the state equation y 0 ∈ (H −ζ, p D , D) 1−1/s,s and the regularization parameter γ > 0 are fixed.
We introduce the measure space (Λ, ρ) by Λ = {•} m × I equipped with measure ρ being the product of the counting measure on the m−element set {•} m with the Lebesgue measure on I . Within the control space U := L s (Λ, ρ) = L s (I , R m ) the set of admissible controls is given by we define the bounded linear control operator by

Remark 2.5
The choice of control space and operator ("purely time-dependent controls") corresponds to Example 2.5 of Bonifacius and Neitzel (2018), where the control space is chosen as L ∞ (Λ) instead of L s (Λ). We will make use of measuring controls in L s instead of L ∞ when applying the Riesz-Thorin interpolation theorem, see the remark concluding Sect. 6.1. The reason for choosing purely time-dependent controls-apart from practical motivation, see e.g. de Los Reyes et al. (2008)-is outlined in the remark at the end of Sect. 5.1. The symmetry property μ = μ T as well as the slightly higher spatial integrability of the desired state y d (L p/2 instead of L 2 ) are required to derive improved regularity for the adjoint state in Sect. 7.

Optimality conditions and SQP method
We follow Tröltzsch (1998), Tröltzsch (1999), Wachsmuth (2007). From Bonifacius and Neitzel (2018), Section 4.1, recall the following notation: and a measurable function y on I × Ω. The divergence operators have to be understood in weak form, of course.

First order necessary optimality conditions
In Bonifacius and Neitzel (2018), Lemma 4.1, the existence of a global solution to (OCP) is established. Further, any local solution to (OCP) fulfills the following system of equations, cf. Bonifacius and Neitzel (2018, Lemmas 4.6-4.8): By L OC P (y, u, p) := J (y, u) − p, e 1 (y, u) we denote the Lagrangian of (OCP).

Generalized equation and SQP method
We reformulate the optimality system as the generalized equation with the maps where N U ad (u) denotes the normal cone of the closed convex set U ad at the point To make the definition of F and N precise, F is understood as map F: X s → Z s with and Accordingly, N is understood as set valued map X s ⇒ Z s . We equip X s and Z s with the canonical norms Having chosen these spaces, the following result holds: Lemma 3.1 F: X s → Z s is continuously Fréchet differentiable and N: X s ⇒ Z s has closed graph.
Proof Differentiability has been used implicitely by (Bonifacius and Neitzel 2018, Lemma 4.5) where the differentiability of the control to state map is shown by the implicit function theorem. The closed graph property is standard.
Remark 3.2 Note that we require time integrability s 1 as in Assumption 2.4 for the y-component in order to have the embedding cf. Bonifacius and Neitzel (2018, Proposition 3.3). The latter is needed to ensure differentiability of the superposition operators associated with ξ and ξ , and hence differentiability of F. In return, this implies by the definition of F that we have to consider L s -integrable control functions u, i.e. (GE) cannot be stated with controls measured in L 2 .
Sometimes we will need the following subspaces X ∞ and Z ∞ of X s , Z s : equipped with the canonical norms similarly as above. Note that changing from X s , Z s to X ∞ , Z ∞ means nothing more than replacing the L s (Λ)-factors by L ∞ (Λ)-factors, i.e. considering controls in the L ∞ -instead of the L s -norm. The same result as before holds: Lemma 3.3 F: X ∞ → Z ∞ is continuously Fréchet differentiable and N: X ∞ ⇒ Z ∞ has closed graph.
Due to Lemma 3.1 we can formulate the ansatz of the SQP method in its abstract form as the Josephy-Newton method for generalized equations, see Josephy (1979), Dontchev (1996), Alt (1990), or Hinze et al. (2009: Given an iterate (y k , u k , p k ) ∈ X s , solve to obtain the new iterate (y k+1 , u k+1 , p k+1 ) ∈ X s . Writing down the full system of equations for (2) we find: Obviously, the current u-iterate u k has canceled out, which implies that the next iterate (y, u, p) depends on y k and p k but not on u k . This is due to the structure of our model problem. Note that the first two equations (3) are equivalent to the linearized state equation A standard computation shows that is equal (up to addition of constants) to the expression that finally fulfills: The system of equations (3),(4),(5) is the formal optimality system of the following optimal control problem: subject to u ∈ U ad and equation (3).

(QP)
This is the classical formulation of the SQP method as sequence of quadratic problems to solve. Note that these computations were completely formal in the sense that we do not know whether (QP) is convex or not. Hence, we cannot say whether there is a unique minimizer or whether the optimality system (3),(4),(5) is a sufficient characterization for this minimizer. This issue will be addressed in the following section utilizing the assumption of second order sufficient conditions.

Second order sufficient conditions and SQP
Depending on second order sufficient conditions (SSCs) for (OCP) based on those derived in Bonifacius and Neitzel (2018) we have to restrict the admissible set for (QP) to ensure convexity.
Assumption 3.4 From now on letū ∈ U ad be a fixed L 2 -local minimizer for (OCP), i.e. there is r > 0 such that Letȳ andp the state and adjoint state associated withū. For σ ∈ [0, ∞] we define the σ -active set ofū as A σ (ū) := {x ∈ Λ: |γū + B * p |(x) > σ } and the corresponding subspace of directions vanishing on A σ (ū). We assume that the following second order sufficient condition for (OCP) is satisfied atū: There is a fixed σ ∈ [0, ∞] (whether we allow the case σ = 0 or not will be stated in our further results) such that there exists δ > 0 such that Condition (SSC-σ ) is stronger than the second order sufficient condition derived by Bonifacius and Neitzel (2018, Theorem 4.14) which has smallest possible gap to the corresponding necessary condition. However we conclude from the cited result: Theorem 3.5 Let Assumption 3.4 hold with some σ ∈ [0, ∞]. Then there are , η > 0 such that the quadratic growth condition holds for all u ∈ U ad ∩ B L 2 (ū).
We also mention the work of Casas and Chrysafinos (2018) in which second order optimality conditions analogous to those of Bonifacius and Neitzel (2018), but for a slightly different setting w.r.t. the domain, the boundary conditions and the boundedness properties of the nonlinearity, were derived. Casas and Chrysafinos deal with C 1,1 -smooth domains, homogeneous Dirichlet boundary conditions and locally Lipschitz continuous coefficients for the state equation, which enables them to consider W 2 -regularity of the states.
Given σ ∈ [0, ∞] that will become clear from the context, we introduce the modified admissible set as and define the corresponding restricted quadratic problem as follows: min y,u J k (y, u) subject to u ∈ U σ ad and Equation (3) (QP-σ ) Using the relation of J k to the second derivative of the Lagrangian of (OCP) (see (7) and (8)) it is clear that (QP-σ ) is a linear quadratic and under Assumption 3.4 strictly coercive and therefore strictly convex optimal control problem, at least for (y k , u k , p k ) = (ȳ,ū,p). This will be crucial for the convergence analysis of the SQP method.
Remark 3.6 Second order sufficient conditions related to strongly active sets turned out to be suitable assumptions for the analysis of SQP methods, see e.g. Tröltzsch (1999), Goldberg and Tröltzsch (1998), Wachsmuth (2007), which work with the same assumption as we do. That we do not work with the SSCs formulated by Bonifacius and Neitzel (2018) directly has two reasons: First, we require the coercivity condition in (SSC-σ ) to hold on a vector space instead of just a cone in the proof of the L 2 -stability result in Sect. 5.1. Second, in Sect. 6.2 we will make use of the fact that strongly active sets behave well under small perturbations for σ > 0. Remark 3.7 Strongest possible second order conditions, i.e. coercivity of L OC P on the whole space L 2 (Λ) will be refered to by σ = ∞. In this case it holds C ∞ (ū) = L 2 (Λ) and U ∞ ad = U ad . See e.g. Griesse et al. (2010Griesse et al. ( , 2008, Feldhordt (2017) or Heinkenschloss and Tröltzsch (1999) for such an assumption in the context of SQP methods. In Sect. 6.1 we state our main theorem for this special case.

Auxiliary results
Before going into the details of the convergence analysis for the SQP method we collect some auxiliary results in the following section.

Regularity of the adjoint state
For our further analysis we will heavily rely on L ∞ (I , W 1, p )-regularity of the adjoint statep associated with the optimal controlū, cf. the remarks in Sect. 4.3. For better readability we postpone the proof of the corresponding regularity theorem to Sect. 7 and state here only

A property of the control operator
Recall from Assumption 2.4 the definition of the control operator that refers to the case of purely time-dependent controls. Obviously, B is continuous from L 2 (Λ) to L 2 (I , W −1, p D ) and therefore its adjoint B * is defined on L 2 (I , W 1, p D ) with values in L 2 (Λ). To derive the L ∞ -stability result from the L 2 -stability result in Sect. 5.1, we need to perform a bootstrapping argument that requires us to know how B * behaves restricted to a space of more regular functions.
To simplify notation, let B: We need the following lemma: Proof This is a direct consequence of Griepentrog et al. (2002, Theorem 3.5).
For the two embeddings we refer e.g. to Amann (2003, formula (1.2)). We will come back to this in Sect. 5.1: Given an estimate on the control in L r , we have estimates for linearized state and adjoint state in respectively. Application of B * either yields an estimates for the control in L q with some q > r or in L ∞ if r already was large enough.

Some properties of A
Recall the definition of A from the beginning of Sect. 3. For the proof of the L 2and L ∞ -stability results in Sect. 5.1 we need the following

Lemma 4.3 It holds
The constant C can be chosen uniformly with respect to y for y's coming from a bounded subset of W 1,s (I , W [v, w], p for an arbitrary testfunction w ∈ L r (I , W 1, p D ) utilizing Hölders inequality.

In Lemma 4.3 we bounded the norm of
. This generality will be necessary in the bootstrapping argument in the proof of the L ∞ -stability, which was already mentioned in the previous Sect. 4.2. As explained in the remark after Lemma 4.4, this requires bounds for y in L ∞ (I , W 1, p D ) and p in L ∞ (I , W 1, p ). However, in the next section we will require an estimate of A (y) [v, w], p directly (and not of A (y) [v, ·] * p) which allows us to use the arguments v and w from the space In that case we can exploit more regularity of v, w, which allows to relax the assumptions on y and p.

The constant C can be chosen uniformly with respect to y for y's coming from a bounded subset of W
The proof works similar as for Lemma 4.3, but now we try to exploit more regularity of v and w. Using embeddings due to Amann (2003, formula (1.2)) and Griepentrog et al. (2002, Theorem 3.5) we find Now, an application of Hölders inequality (the temporal integrability exponents match due to (10)) yields the desired result. The uniform choice of the constant with respect to y follows from the boundedness of ξ and its derivatives on bounded sets of R and the compactness of the embedding W 1,s (I , W

Remark 4.5
The difference in the regularities assumed for y and p in the two lemmas is essential: Lemma 4.3 will be applied in Sect. 5.1 only for y =ȳ and p =p, i.e. the required regularity is guaranteed by Lemma 4.1 forp and Theorem 7.1 (1), (2b) forȳ, respectively. In Sect. 4.4 we will have to apply Lemma 4.4 for y = y k , p = p k with y k , p k being iterates of the SQP method, i.e. y k and p k are solutions of the linearized state and adjoint equation. Hence, the regularity requirements for Lemma 4.4 are met, but not immediately those of Lemma 4.3.
Remark 4.6 (Necessity of higher regularity for the adjoint state) Note that Lemma 4.3 cannot be improved: The limiting factor is the summand The function w has temporal integrability r and spatial integrability ∞, whereas ∇v has temporal integrability r and spatial integrability p, which is the best we can expect from the assumptions each. This implies that we require p ∈ L ∞ (I , W 1, p D ) in order to be able to estimate the above integral.

Derivatives associated to (QP)
In this section we provide results on the first and second derivatives of the reduced objective functionals associated to the quadratic subproblems (QP). We will apply them in Sect. 6.3 briefly before obtaining our main result.
Recall the definition of the space X s from Sect. 3.2 and denote by j k : L 2 (Λ) → R the reduced functional associated with the linear quadratic optimal control problem (QP) at (y k , u k , p k ) ∈ X s . In particular note that j k is constant, because j k is a quadratic functional, which makes us write j k instead of j k (v) for some v, because v → j k (v)[·, ·] is constant and hence independent of such v.
Proposition 4.7 Let Assumptions 2.1-2.4 and 3.4 be satisfied. Then, it holds uniformly in u ∈ L 2 (Λ) as y k →ȳ, p k →p in the above norms.
Proof Recall by (7) that j k · u 2 = L OC P (y k , u k , p k )(y, u) 2 with holds. We expand this as From the definition of the Lagrangian we know (I ) = j (ū)u 2 . Hence it remains to show that the contribution of (I I ) and (I I I ) gets uniformly small as claimed above. By definition we have , , wherein the summands can be estimated using the boundedness of the solution operator of the linearized state equation (Bonifacius and Neitzel 2018, Proposition 4.4) and applying Lemma 4.4 and a similar argument as in the proof of Lemma 4.4. In particular recall Remark 4.5. In the same way one can treat (I I I ) as well.
For the gradient of j k we find: and estimate both summands. For some v ∈ U ad , e.g. v = v k , introducing the following quantities will be helpful: Regarding (B) we know from (Bonifacius and Neitzel 2018, Proposition 4.9) that . This is shown using the convergence of the solution operators of the linearized state equation (Bonifacius and Neitzel 2018, Proposition 4.9). Utilizing similar techniques as before the desired result follows after some straight forward computations. We omit the details.

Generalized Newton method on U ad
Following the standard arguments, see e.g. Tröltzsch (2000Tröltzsch ( , 1999, Goldberg and Tröltzsch (1998), Alt et al. (2010), Griesse et al. (2010Griesse et al. ( , 2008, Wachsmuth (2007) and Hintermüller and Hinze (2006), we show that the Newton-Josephy method applied to a modified version of the generalized equation (GE), see Sect. 3.2, converges. Our own contribution here is to verify that-under the correct choice of spaces and with help of suitable auxiliary results that have been achieved in the previous sectionexisting arguments apply to the quasilinear case as well. Proving convergence of this generalized Newton method is a central step towards showing convergence of the SQP method: The iterates of the generalized Newton method will be interpreted as iterates of the SQP method in Sect. 6.
From formula (9) in Sect. 3.3 recall the definition of the modified admissible set U σ ad for some σ ∈ [0, ∞]. We consider the generalized equation with this modified admissible set, i.e. we replace (GE) by where U ad is replaced by U σ ad in the definition of the normal cone map N , i.e.
where N U σ ad (u) denotes the normal cone of U σ ad at u. The map F: X s → Z s as well as the spaces X s , Z s , see Sect. 3.2 for the definitions, do not change.
To prove convergence of the generalized Newton method strong regularity in the sense of Robinson has to be shown at an optimal point (ȳ,ū,p) ∈ X s , i.e. for every perturbation d ∈ Z s sufficiently close to 0 the generalized equation needs to have a unique solution that depends Lipschitz continuous on d ∈ Z s . For the definition of strong regularity we refer e.g. to Robinson (1980), (Hinze et al. 2009, Definition 2.5).
Translating back this generalized equation for (y, u, p) into an optimal control problem yields for a given perturbation vector d = (d y , d 0 , d p , d T , d u ) ∈ Z s with components coming from the corresponding spaces. Note that (GE-σ -D) is indeed the first order necessary and (due to convexity) sufficient optimality condition for (QP-σ -D), because (QP-σ -D) is convex since only linear perturbation terms have been added to the convex objective function from (QP-σ ). The perturbation in the corresponding affine linear state equation is only a constant and does not destroy convexity as well.

Stability of the quadratic problems (QP-)
We fix d 0 = 0 and d T = 0, i.e. we assume that initial and final conditions are met exactly during the application of the SQP method, which is reasonable from the numerical point of view.
The hidden constant depends on the data of (OCP) and (ȳ,ū,p), but not on d i .
Proof The proof relies on the linear quadratic structure of (QP-σ -D) and regularity results for the linearized state equation resp. the adjoint equation.
Hence it works completely analogous to Goldberg and Tröltzsch (1998) and we omit the details and only mention the required regularity results (Bonifacius and Neitzel 2018, Propositions 4.4 resp. 4.7) and that terms containing A are estimated with help of Lemma 4.3.
This shows L 2 -stability of the quadratic problems (QP-σ ) with respect to perturbations measured in corresponding norms. Utilizing a standard bootstrapping argument as e.g. in Tröltzsch (2000) we can show the corresponding L s -resp. L ∞ -stability result: Theorem 5.2 Let Assumptions 2.1-2.4 and 3.4 with some σ ∈ [0, ∞] hold. Then, for the (y i , u i , p i ), i = 1, 2, from the previous proposition we have In particular, the generalized equation (GE-σ ) is strongly regular at its solution (ȳ,ū,p) with respect to the spaces X s , Z s and X ∞ , Z ∞ .
Proof Again, the proof follows the techniques from Goldberg and Tröltzsch (1998); Tröltzsch (2000). From the projection formula holds pointwise on Λ. Thus, we can bound Δ u in the L q (Λ)-norm, if we can bound B * Δ p and Δ d u in the L q (Λ)-norm. We apply a bootstrapping argument that relies on the property of B * from Sect. 4.2: Assume that we already know for some r ∈ [2, s). Using the regularity theory of the linearized state resp. adjoint equation for (16) resp. (17) we conclude with some q fulfilling 1/q > 1/r + (ζ − 1)/2 holds. In the first case it follows and we are done. In the second case we have and we repeat the procedure with r = q as long as the first holds, which is clearly the case due to Assumption 2.4 if r = s is reached. Note that (ζ − 1)/2 < 0 is fixed and that we can avoid q being equal to the exceptional cases of Lemma 4.2 due to the strict inequality that allows small perturbations.

Remark 5.3
In addition to the case of purely time-dependent control Bonifacius and Neitzel (2018)  The main difficulty when generalizing our results to the setting of distributed control lies in keeping the arguments for Proposition 5.1 and Theorem 5.2 working. In that case, B * is the embedding L s (I , W 1, p ) → L s (I × Ω) and a similar discussion as in Sect. 4.2 has to be done. Sufficiently good estimates for Δ p could be obtained using the regularity theorem from Sect. 7, whereas the corresponding estimates for Δ y would require an analogous analysis of the linearized state equation on H −ζ, p -spaces, which is beyond scope and focus of this paper.

Convergence of the generalized Newton method
Invoking a general result on the convergence of generalized Newton methods, e.g. Hinze et al. (2009) the sequence of iterates generated by the Newton-Josephy method for equation (GE-σ ) with (y 0 , u 0 , p 0 ) as start is well-defined, stays in the ball B X s r Newton ((ȳ,ū,p)) and converges q-superlinearly to (ȳ,ū,p) in X s . (1) holds with X ∞ instead of X s .

Convergence of the SQP method
The well-definedness of the iterates in Theorem 5.4 is so far only ensured by some generalized implicit function theorem and the strong regularity of (GE-σ ) at (ȳ,ū,p). Convexity of the quadratic subproblems (QP-σ ) is so far only known in the case (y k , u k , p k ) = (ȳ,ū,p), i.e. the relation of possible minimizers of (QP-σ ) and solutions of (GE-σ ) is unclear at the moment.
Therefore, this final section is devoted to an extended analysis of the generalized Newton method for (GE) and the interpretation of the Newton iterates as solutions of some linear quadratic optimal control problems. In order to make the flow of the argumentation more clear, we give a short summary of this section: In a first step (Sect. 6.1) we consider the quadratic problems restricted to U σ ad , i.e. the set of those controls from U ad that coincide with the optimal controlū on the σ -active set ofū. The main argument here is that the quadratic problems sufficiently close to the true KKT-triple get strictly convex when restricted to U σ ad . Hence, their unique solution is characterized by the corresponding first order necessary optimality condition, which coincides with the generalized equation originating from the Newton method discussed in Sect. 5. The assumption to restrict to U σ ad can be slightly relaxed in case that (SSC-σ ) holds for a positive σ : The quadratic subproblems have to be restricted to U ad ∩ B L 2 ρ (ū) with some radius ρ > 0, as shown in Sect. 6.3, and the generalized Newton method for (GE) converges locally, even without further restrictions, see Sect. 6.2. That the restriction of the quadratic subproblems can be done in terms of L 2 -balls aroundū (instead of L ∞ -balls as in previous results) is-to our best knowledge-a new result that we obtain by careful application of the SSCs. The main steps of the argument are as follows: First, we establish convergence of the generalized Newton method for the corresponding set valued equation (GE) in Sect. 6.2, Theorem 6.10. Thereby the proof of strong regularity is the crucial part and essentially relies on the observation that L 2 -local quadratic growth and L 2 -local uniqueness of critical points implied by SSCs for certain quadratic problems also stays valid uniformly under perturbation (Proposition 6.7). This and the fact that the set of strongly active points behaves sufficiently well under perturbation (Lemma 6.6) allows to carry over results on U σ ad to U ad ∩ B L 2 ρ (ū) in Corollary 6.8. Finally, in Sect. 6.3 the iterates of the generalized Newton method are identified with the solutions of the quadratic subproblems, see Proposition 6.14. We start with the iterates of the SQP method with subproblems restricted to U σ ad from Sect. 6.1. Using perturbation arguments analogous to those from Sect. 6.2 it is shown that sufficiently close to the true KKT-triple these iterates can also be obtained as unique solution of the quadratic subproblems on U ad ∩ B L 2 ρ (ū) with appropriate ρ > 0, or as the unique local solution of the global quadratic subproblem that is contained in the aforementioned set, see Proposition 6.14.
Note that for theoretical reasons it is not possible to avoid technical restrictions as the above ones completely, even in finite dimensions, cf. the example given by Goldberg and Tröltzsch (1998, Section 6). In the infinite dimensional case an additional difficulty arises as pointed out by Tröltzsch (1999, final Remark): Unlike in finite dimensions we cannot assume that the possibly infinite set of active constraints is correctly identified after the first iteration, and therefore technical restrictions encoding some a-priori knowledge on the correct active set have to be imposed.

SQP method on U ad
In this section we relate the iterates of the Newton method from Sect. 5 to solutions of (QP-σ ), see Sect. 3.3 for the definition of U σ ad and (QP-σ ). To do so we will show that the formal optimality conditions for (QP-σ ) encoded in the Newton equations for (GE-σ ) are indeed sufficient optimality conditions for (QP-σ ). Following again the work of Tröltzsch (1999), Goldberg and Tröltzsch (1998), and Wachsmuth (2007) this is done by showing strict convexity for (QP-σ ) for (y k , u k , p k ) sufficiently close to (ȳ,ū,p). We prove convergence of the SQP method under the technical restriction to replace U ad by U σ ad . Assuming strongest possible SSCs, i.e. U ad = U σ ad , this yields our first main result.
Recall the definition of the space X s from Sect. 3.2. The following result corresponds to Lemma 6.2, Corollary 6.3 by Tröltzsch (1999).

Proposition 6.1 Let Assumptions 2.1-2.4 and 3.4 with some σ ∈ [0, ∞] be satisfied. Then, the linear quadratic SQP problem (QP-σ ) is a strictly convex optimization problem as long as (y k , u k , p k ) is sufficiently close to (ȳ,ū,p) in X s .
Proof The optimization problems (QP-σ ) are of linear quadratic type. To show strict convexity it suffices to show coercivity, but the latter is an immediate consequence of the second order sufficient condition (SSC-σ ) and the uniform estimate from Proposition 4.7. Now we are ready to show locally superlinear convergence of the SQP method with quadratic problems on U σ ad : Theorem 6.2 Let the assumptions of Theorem 5.4 be fulfilled.
1. There is a radius r SQP−σ > 0 such that for any start triple (y 0 , u 0 , p 0 ) ∈ X s fulfilling the sequences of iterates generated by the generalized Newton method applied to (GE-σ ) resp. generated by the SQP method with quadratic subproblems (QP-σ ) are both well-defined, coincide, stay in the ball B X s r SQPσ ((ȳ,ū,p)) and converge superlinearly to (ȳ,ū,p) in X s . ( ≤r SQP−σ converges superlinearly in X s and X ∞ to (ȳ,ū,p). In particular we can choose u 0 ∈ U ad , u 0 −ū L 2 (Λ) sufficiently small, y 0 , p 0 state and adjoint state associated to u 0 .

The statement analogous to (1) with X s replaced by X ∞ is true, too. 3. There is a radiusr SQP−σ > 0 such that the SQP method with quadratic subproblems (QP-σ ) and initial iterate
Proof For (1) and (2) the proof works analogous to that of Theorem 6.4 in Tröltzsch (1999). For (3) note that (QP-σ ) is actually independent of the current control iterate u k , cf. also the remark after (5), which shows the first statement in (3). Since U ad is bounded in L ∞ and s > 2 by Assumption 2.4 it holds by the Riesz-Thorin interpolation theorem, cf. also the remark after the next theorem.
Here, C > 0 is a constant depending only on the L ∞ -bound of U ad , i.e. on u a and u b . From this we conclude by continuity which shows the second statement of (3).
Assuming strongest possible second order sufficient conditions, i.e. coercivity of the second derivative of the Lagrangian on the whole space instead of only on a subspace, we are able to state our first main result. Note that it is possible to formulate all "closeness" required for convergence of the SQP method with respect to L 2 -norms. Theorem 6.3 Let the Assumptions 2.1-2.4 be fulfilled and let the second order sufficient condition (SSC-σ ) from Assumption 3.4 hold on the whole space L 2 (Λ) (i.e. σ = ∞). Then the SQP method for (OCP) started in (y 0 , u 0 , p 0 ) ∈ X s , u 0 ∈ U ad , u 0 −ū L 2 (Λ) sufficiently small, y 0 , p 0 state and adjoint state associated to u 0 , converges superlinearly in X s and X ∞ to (ȳ,ū,p).
Proof Use Theorem 6.2 (3) together with U σ ad = U ad .

Remark 6.4
That the topologies generated by the L 2 -and the L s -norm (s > 2), respectively, coincide on an L ∞ -bounded set by the Riesz-Thorin interpolation theorem, is a well known fact. However, this observation is a key argument for many proofs concerning second order conditions without two norm gap, see e.g. (Casas and Tröltzsch 2012, Proposition 3.4) or (Bonifacius and Neitzel 2018, Theorem 4.14). Here, we made use of this technique in Theorem 6.2 (3) and 6.3 to tighten the unsatisfying gap between the quadratic growth condition for j implied by (SSC-σ )-this growth condition holds L 2 -locally-and the L s -local convergence of the SQP method.
For the rest of Sect. 6 we will be concerned with relaxing this rather abstract and technical condition towards a more natural restriction.

Generalized Newton method on U ad resp. U ad ∩ B L 2 (ū)
Before showing convergence of the SQP method restricted to U ad ∩ B L 2 ρ (ū) we consider convergence of the Newton method for the associated generalized equation first. Our arguments follow in particular the presentation by Wachsmuth (2007), but similar results are also due to Goldberg and Tröltzsch (1998) and Tröltzsch (1999). To replace L ∞ -locality by L 2 -locality in the statements of Proposition 6.7 is-to our best knowledge-a new result. An analogous technique will be utilized afterwards in Sect. 6.3 to prove also convergence of the SQP method under certain localization conditions.
In the following we consider the perturbed generalized equation Note that we now use the normal cone map N associated with the true set of admissible controls U ad instead of the normal cone map N σ associated with the modified admissible set U σ ad that was used for the definition of (GE-σ -D) in the previous sections. Furthermore, note that (GE-σ -D) can be understood as generalized equation in the spaces X s , Z s resp. X ∞ , Z ∞ both. For the definition of these spaces see Sect. 3.2. As before, the generalized equation (GE-σ -D) is the formal optimality system of the following perturbed optimal control problem: The reduced objective function for (QP-D) will be denoted by j d . Note that we did not discuss properties of this optimization problem so far. Further, we introduce the following notation for the strongly active sets: i.e. d = 0 in the definition above.
Here, p denotes the adjoint state associated with u with respect to (QP-D) with perturbation vector d, see (15). Note that A σ 0 (ū) coincides with the strongly active set for u defined in Assumption 3.4.
In Sect. 5 we observed that under Assumptions 2.1-2.4 and 3.4 the restricted optimal control problem (QP-σ -D), i.e. problem (QP-D) restricted to U σ ad , is strictly convex and admits a unique solution (ȳ d ,ū d ,p d ). This holds true for arbitrarily large perturbation vectors d. In particular, the map d = (d y , d p , d u ) → (ȳ d ,ū d ,p d ) was shown to be Lipschitz from Z ∞ to X ∞ in Theorem 5.2, say with modulus L > 0. It follows that the mapping is Lipschitz as well, say with modulus L > 0.

Remark 6.5 Of course, even the map
is Lipschitz continuous as shown in Theorem 5.2, which implies that d → γū d + B * p d − d u is Lipschitz continuous from Z s to L s (Λ). Unfortunately, we rely on L ∞ -estimates in the following.

The solution (ȳ d ,ū d ,p d ) of (QP-σ -D) is a solution of (GE-D) as well, i.e. it holds
Proof Completely analogous to Wachsmuth (2007).
Lemma 6.6 shows that our solution of (QP-σ -D) that depends Z ∞ -X ∞ -Lipschitz on d is a solution of (GE-D) as well, if the perturbation d is small enough in Z ∞ . To establish strong regularity of (GE) (with spaces X ∞ , Z ∞ ) from this result we have to show that this solution is locally unique. This is done by proving that (ȳ d ,ū d ,p d ) is not only a global solution of (QP-σ -D) but even a local solution of (QP-D) fulfilling a quadratic growth condition on a ball around (ȳ d ,ū d ,p d ) with radius independent of d.
Proposition 6.7 Let the assumptions of Lemma 6.6 be satisfied.

Then there exist
i.e. the solution of (QP-σ -D), is also a L 2 -local solution of (QP-D) and satisfies the quadratic growth condition The first statement of this proposition corresponds to Theorem 5.5 (Wachsmuth 2007) with the L ∞ -ball aroundū d replaced by an L 2 -ball. To establish quadratic growth L ∞ -locally aroundū d , one could follow the direct proof of Theorem 5.17 (Tröltzsch 2010). Avoiding the two norm gap-which is our aim-can be done following ideas due to Casas and Tröltzsch (2012, Theorem 2.3), see also Tröltzsch and Wachsmuth (2006, Theorem 3.22), utilizing a proof by contradiction. We mention that similar arguments were also used by  in the context of abstract finite element errors.
Note that for every single perturbation d ∈ Z ∞ , both properties in the proposition are directly implied by Theorem 2.3 resp. Corollary 2.6 from . The crucial point here is to guarantee that the radii of the respective balls can be chosen independently of the choice of d as long as d Z ∞ is small enough.
Proof For the proof of (1) we extended the technique presented by (Casas and Tröltzsch 2012, Theorem 2.3) to our needs. First, note that due to the quadratic structure of (QP-D) it holds We are going to argue by contradiction and assume the contrary of our claim: There Define v n := h n h n L 2 and ρ n := h n L 2 . It holds d n = (d y,n , d p,n , d u,n ) → 0 strongly in Z ∞ , which impliesū d n →ū and ∇ j d n (ū d n ) → ∇ j(ū) strongly in L ∞ (Λ). Due to v n L 2 = 1 for all n ∈ N we can w.l.o.g. assume that v n v * weakly in L 2 (Λ) for some v * ∈ L 2 (Λ).
Step 1: We prove j (ū)v * = 0. We have , h n L 2 ≥ 0 holds for every n due tō u d n + h n ∈ U ad and Lemma 6.6 (2), for which we can assume w.l.o.g. that d n Z ∞ < σ 2L . Further, using the mean value theorem there are θ n ∈ (0, 1) such that Due to the structure of (QP-D)-see e.g. (16), (17) and use regularity results as in the proof of Theorem 5.2-we know that ∇ j d n (ū d n + θ n ρ n v n ) → ∇ j(ū) strongly in L 2 (Λ), which implies On the other hand it holds by assumption (19): which together with (21) yields j (ū)v * ≤ 0 first and then together with (20): Step 2: Step 1 the desired property: For σ > 0 arbitrary define A σ ,a (ū) := {x ∈ Λ : ∇ j(ū) > σ }. As in the proof of Lemma 6.6 we conclude that ∇ j d n (ū d n ) > 0 on A σ ,a (ū) for all sufficiently large n, which implies h n , v n ≥ 0 on A σ ,a (ū) for all such n. Because weak convergence preserves signs we conclude v * ≥ 0 on A σ ,a (ū). Since σ > 0 was arbitrary it follows v * ≥ 0 whenever ∇ j(ū) > 0, as stated. The case ∇ j(ū) < 0 is shown in the same way.
Step 3: In Step 2 we have shown that v * ∈ C 0 (ū) ⊂ C σ (ū) holds. For the definition of C 0 (ū) and C σ (ū) see Assumption 3.4. In the present step we will arrive at the final contradiction. First observe that by our assumption where we used the linear quadratic structure of (QP-D) at ( ) and the first order optimality condition at ( ). It follows where the first inequality comes from the weak lower semicontinuity of j (ū), see Proposition 4.10 (Bonifacius and Neitzel 2018). Since v * ∈ C σ (ū) we can apply (SSC-σ ) and conclude from (23) that v * = 0. Using property (4.11) by Bonifacius and Neitzel (2018) at ( ) we obtain which is the desired contradiction. The second part of the proposition is shown similarly adapting the proof of Corollary 2.6 by . We leave the details to the reader.
Given a radius ρ > 0 we introduce another modification of the perturbed linear quadratic problem (QP-D) and for which the following result holds: Corollary 6.8 Let the assumptions of Lemma 6.6 be satisfied.
1. There are , ρ > 0, such that the triple (ȳ d ,ū d ,p d ), i.e. the unique solution of (QP-σ -D), is also the unique solution of (QP-D-ρ) if d Z ∞ < . 2. There are , τ > 0, such that for d Z ∞ < the controlū d is the unique solution of (GE-D) that is contained in the set U ad ∩ B L 2 τ (ū).
A result similar to (2)-but with L ∞ -instead of L 2 -balls-was proven by (Goldberg and Tröltzsch 1998, Theorem 5.4) using a different argument that relies on strongly active sets and continuity of (18).
Proof 1. Choose ρ = 2ρ 3 and < min ˜ ,ρ 3C , where C > 0 is the Z ∞ -L 2 -Lipschitz constant for the map d →ū d , cf. Theorem 5.2 for the Lipschitz continuity. Then, it holds in particular d Z ∞ <˜ , i.e. the previous Proposition applies, and for all d Z ∞ < . Sinceū d is the unique minimizer of (QP-D) restricted to U ad ∩ B L 2 ρ (ū d ) by quadratic growth (Proposition 6.7 (1)) and this minimizer is contained in the smaller set U ad ∩ B L 2 ρ (ū), we finally proved thatū d is the unique minimizer of (QP-D) restricted to U ad ∩ B L 2 ρ (ū), i.e. the unique minimizer of (QP-D-ρ). (1). Now make use of Proposition 6.7 (2).

Similarly as for
We introduce another variation of (GE): (u) denotes the normal cone of the closed convex set U ad ∩B L 2 ρ (ū) at some point u. The first part of the following result is similar to Corollary 5.6 (Wachsmuth 2007), the second part to the observation on top of p. 240 by Goldberg and Tröltzsch (1998). Theorem 6.9 Let the assumptions of Lemma 6.6 be fulfilled. It holds: 1. The generalized equation (GE) in the spaces X ∞ , Z ∞ is strongly regular at (ȳ,ū,p). 2. There is a ρ > 0 such that the generalized equation (GE-ρ) in the spaces X ∞ , Z ∞ is strongly regular at (ȳ,ū,p).
Proof Both statements are consequences of Corollary 6.8 resp. Theorem 5.2. The first part is proven in the same way as in Wachsmuth (2007). We have to use that the L ∞ -norm is stronger than the L 2 -norm. For the second part note that for all u in the L 2 -interior of the ball B L 2 ρ (ū), i.e. in particular for all u sufficiently close toū in the L ∞ -norm, the equality N U ad ∩B L 2 ρ (ū) (u) = N U ad (u) holds, as already mentioned by Goldberg and Tröltzsch (1998).
The following result is an immediate consequence of an abstract result (Hinze et al. 2009, Theorem 2.19) and Theorem 6.9. The closed graph property for the normal cone map N ρ is standard. Theorem 6.10 Let Assumptions 2.1-2.4 and 3.4 with some σ ∈ (0, ∞) hold. For any (y 0 , p 0 ) sufficiently close to (ȳ,p) in the space it holds: 1. The sequence of iterates generated by the Newton-Josephy method for (GE) with initial iterate (y 0 , u 0 , p 0 ) is well-defined and converges superlinearly in X ∞ to (ȳ,ū,p). 2. The same holds true for the sequence of iterates generated by the Newton-Josephy method for (GE-ρ) with ρ from Theorem 6.9 (2).
From Lemma 6.6 on we had to consider perturbations in Z ∞ , i.e. we had to measure the control in L ∞ (Λ). This is the reason why have to show strong regularity only in Z ∞ , X ∞ and not in Z s , X s as well as we did before. That we impose no condition on u 0 is due to the fact that the Newton update equations for (GE) resp. (GE-ρ) are independent of the current u-iterate u k , see the comment after equation (5).

SQP method on U ad ∩ B L 2 (ū)
Finally, we investigate how the iterates of the generalized Newton method from Theorem 6.10 can be computed by solving linear quadratic optimal control problems restricted on U ad ∩ B L 2 ρ (ū). For analogous results in the case of semilinear equations (but with L ∞ -instead of L 2 -balls) we refer to Tröltzsch (1999) and Goldberg and Tröltzsch (1998).

Lemma 6.11
Let the assumptions of Theorem 6.10 hold. Let (y k , u k , p k ) ∈ X ∞ be a given triple and consider the restricted quadratic subproblem (QP-σ ) associated with this triple. There exists an X ∞ -neighbourhood V 1 of (ȳ,ū,p) such that the map is well-defined on V 1 and Lipschitz continuous, where (y σ k+1 , u σ k+1 , p σ k+1 ) denotes the unique solution of (QP-σ ).
With the previous lemma we have shown in particular that is Lipschitz continuous on the X ∞ -neighbourhood V 1 of (ȳ,ū,p). With j k we denoted the reduced functional of (QP-σ ) and p σ k+1 is the adjoint state (w.r.t. (QP-σ )) associated with the control u σ k+1 , see Eqs.

is the only stationary point for
Proof We proceed as in the proofs of Proposition 6.7 (1) and (2) and argue by contradiction. Instead of j d n andū d n we have to consider j k and u k+1 . We only mention the essential ingredients that keep all the previous arguments working: This was shown in Proposition 4.8; use the Riesz-Thorin interpolation theorem (see Remark 6.4) to obtain the required L s -convergence w k →ū from the given L 2 -convergence. (ii) If u k →ū strongly in L 2 and v k v * weakly in L 2 we have: Using the boundedness of (v k ) k , this is a consequence of Proposition 4.7 and the weak lower semicontinuity of j , see Bonifacius and Neitzel (2018), (4.10): This is shown by the same argument as above.
Next, we obtain with the same argument as for Corollary 6.8: Proposition 6.14 Let the assumptions of Theorem 6.10 hold.
1. There is an X ∞ -neighbourhood V 4 of (ȳ,ū,p) and a radius ρ > 0 such that for all (y k , u k , p k ) ∈ V 4 the next SQP iterate (y k+1 , u k+1 , p k+1 ) given by the unique solution of (QP-σ ) is also the unique solution of (QP) with admissible set U ad ∩ B L 2 ρ (ū). 2. There is an X ∞ -neighbourhood V 5 of (ȳ,ū,p) and a radius ρ > 0, such that for all (y k , u k , p k ) ∈ V 5 the next SQP iterate (y k+1 , u k+1 , p k+1 ) given by the unique solution of (QP-σ ) is also the unique L 2 -local solution of the global quadratic problem (QP) that is contained in U ad ∩ B L 2 ρ (ū).
For convenience of the reader we write down the quadratic problem which we will refer to in our final theorem: and ∂ t y + A (y k )y + A (y k )y = Bu + A (y k )y k y(0) = y 0 (QP(ρ, y k , p k )) Theorem 6.15 Let Assumptions 2.1-2.4 and 3.4 with some σ ∈ (0, ∞) hold. Then there are radii ρ > 0, r S Q P > 0 such that for any initial guess ≤ r S Q P the sequence of iterates generated by the successive solution of the SQP subproblems (QP(ρ, y k , p k )) converges superlinearly in X ∞ to (ȳ,ū,p).
A possible choice of y 0 , p 0 are state y 0 and adjoint state p 0 associated to some Proof Combine Proposition 6.14 with Theorem 5.4. This theorem is our main result. Note in particular that we tightened the gap between the L 2 -local growth condition originating from the second order sufficient conditions and the "closeness"-conditions in the SQP method. The latter had been formulated with respect to L ∞ in the existing literature. Now, in Theorem 6.15 above all required "closeness" can be formulated with respect to the L 2 -norm.

Regularity of the adjoint state
In this section we prove the regularity required for the adjoint state in our analysis. In (Bonifacius and Neitzel 2018, Proposition 4.7) it was shown that whereas we need additional regularityp ∈ L ∞ (I , W 1, p ) as explained in Remark 4.6. In fact, we will show even higher regularity forp in the theorem below than necessary.
To improve readability of our arguments, we start with a collection of results from Bonifacius and Neitzel (2018). As further reference for maximal parabolic regularity on H −ζ, p -spaces we mention the work of Haller-Dintelmann and Rehberg (2009). Some of the results cited below are originally due to them.
We want to obtain more regularity for the adjoint state and to do so we consider restrictions of the above isomorphism onto smaller spaces of more regular functions. First, note that a short computation shows A (y) * | L r (I ,W 1, p D ) = A (y)| L r (I ,W 1, p D ) and similarly we can express A (y) * restricted to L r (I , W 1, p D ) as first order differential operator A (y) * ϕ = ξ (y)μ∇ y∇ϕ. Standard Sobolev embeddings imply that under Assumption 2.4 it holds We already know by Theorem 7.1 (4a) that −div(ξ(y)μ∇·) has maximal parabolic regularity on L r (I , H −ζ, p D ) and that t → −div(ξ(y)μ∇·) is continuous as map I → L (D, H −ζ, p D ), which follows from Theorem 7.1 (2a) and (3). As above, we infer from Amann (2004) that is a topological isomorphism. Now, choose 1+ζ 2 < θ < 1 such that 1 r > 1 − θ . It follows by (Amann 2003, formula (1.2)) and Theorem 7.1 (2a), (5) that holds. Hence, the operator is compact as it can be expressed as composition of linear operators of which one is a compact embedding. We conclude that the sum is a Fredholm-operator of index 0 for every r ∈ (1, ∞). Since it is the restriction of the isomorphism (25) above, its kernel is trivial and therefore we actually have an isomorphism. To sum this up we have shown the following regularity result: Remark 7.3 Note that we did not need more assumptions than Bonifacius and Neitzel (2018) except for the slightly higher integrability of y d . In the framework of maximal parabolic regularity on W −1, p D -spaces they discuss first order necessary and second order sufficient optimality conditions, but in order to deal with the adjoint equation in the maximal parabolic regularity context (Bonifacius and Neitzel 2018, Lemma 4.6, Proposition 4.7) they required states in C α (I , W 1, p D ) which was achieved by consideration of the state equation on H −ζ, p D spaces. Since we aim at SQP methods having an adjoint equation with corresponding regularity theory is necessary anyway and therefore restriction to the H −ζ, p D -setting is not superfluous.

Remark 7.4
Since μ was assumed to be symmetric we could identify A (y) * with A (y) etc. directly. In fact, all arguments go through if we postulate the same assumptions for μ T as already done for μ.

Numerical Examples
In this final section we present numerical examples in order to illustrate our theoretical results. To do so we have constructed so-called manufactured solution examples, i.e. an optimal control problem with analytically known solution, see (Tröltzsch 2010, Section 2.9) for the construction of such examples. Further, we test with an example based on real-world parameters, cf. Sect. 8.2.
We implemented the SQP algorithm in python using an optimize-then-discretize approach and FEniCS (Alnaes et al. 2015;Logg et al. 2012) for the finite element discretization of the problem. Following the approach of Hintermüller and Hinze (2006) the algorithm implemented consists of three nested loops: The outermost iteration is given by the SQP method. The quadratic subproblem of each SQP iteration is solved iteratively by application of the semismooth Newton method (SSN), see e.g. Ulbrich (2011). Finally, the innermost loop consists of the iterative solution of the Newtonupdate equation by the CG method in every semismooth Newton iteration.
In order to solve the quadratic subproblems accurately enough we choose the relative tolerance for SSN to be 10 −5 , i.e. the solver of the quadratic subproblems either terminates when the L 2 -norm of projection residual (of the subproblem) is reduced by at least 10 −5 or the maximum of 20 SSN iterations is reached. To avoid problems in case of already very small initial residual norms, the SSN iteration also ends when the residual norm gets smaller than 10 −12 (absolute tolerance). Similarly, the CG method terminates if the intial CG-residual is decreased by factor at least 10 −2 . To enhance stability, SSN is combined with Armijo linesearch with the squared L 2 -norm of projection residual (of the subproblem) as merit function.
As observed in the existing literature the restriction of the quadratic subproblems to L ∞ -or-in our case-L 2 -balls is only required to prove convergence of the algorithm in function space. Fortunately, we can omit this additional constraint in practice and solve the quadratic subproblems on U ad without loosing convergence, i.e. the subproblems in our implementation are given by (QP), cf. the end of Sect. 3.2.
Initial guess for the SQP method is in all three examples (y 0 , u 0 , p 0 ) := (0, 0, 0). To measure optimality of some iterate u k we compute the L 2 -norm of the residual of the projection formula where the adjoint state p(u k ) associated to u k is computed using the implicit Euler scheme. The nonlinear equations appearing at each timestep during the solution of the state equation are solved by the built-in nonlinear solver of FEniCS. Convergence of the SQP-Algorithm is measured by the increments Note that we do not compute the norm of the increments with respect to the norms appearing in Theorem 6.15 because we do not have the abstract exponents p, s at hand in a practical context. To illustrate our theoretical results, we show for all examples both increments and residuals for different discretizations. Convergence behaviour of the SQP method uniform with respect to sufficiently fine discretization strongly indicates convergence in function space.
With help of Wolfram Mathematica we compute the remaining quantities y d , f ,ū such that the optimality system for (27) is fulfilled. In particular it holds Note that all our theoretical results remain true for a problem of type (27) since addition of the term f to the model problem (OCP) does not change its structural properties. Discretization of spatial functions is done with piecewise linear finite elements on a equidistant partition of Ω = [0, 1] into N h subintervals. For time discretization we apply an implicit Euler discretization with N t = N 2 h timesteps, whereby the number of timesteps is chosen in order to roughly balance spatial and temporal discretization errors, cf. Casas and Chrysafinos (2019). The behaviour of the increments incr ∞ k and incr 2 k during the SQP iteration is shown in Table 1, whereas L 2 -residuals and errors of the SQP-iterates with respect to the interpolated true KKT-triple are shown in Table 2. Note that increments (Table 1a, b) and their decrease factors (Table 1c, d) indicate superlinear convergence and behave uniform with respect to the different discretization levels, which illustrates superlinear convergence in function space. Also, the residuals (Table 2a) and errors (Table 2b-f) seem to behave uniformly, at least until their convergence stagnates due to the limited accuracy given by discretization.

Example 2
For I = [0, 1] and Ω = [0, 1] 2 we consider a problem of the same structure as the 1D manufactured solution example (27), but now with y 0 (x) = sin(π x 1 ) sin(π x 2 ), y(t, x) = cos(2π t) sin(π x 1 ) sin(π x 2 ), and the regularization parameter γ = 2 · 10 −3 in (27) replaced by γ = 10 −2 . As before, the remaining quantities are computed utilizing Wolfram Mathematica and the optimal control is given by Discretization of spatial functions is now done with piecewise linear finite elements on a triangular mesh generated by mshr, the mesh-generation tool of FEniCS, with maximum element diameter h max . For time discretization we apply an implicit Euler discretization with N t timesteps, whereby the size of timesteps τ = N −1 t ≈ h 2 max is chosen in order to roughly balance spatial and temporal discretization errors, cf. Casas and Chrysafinos (2019). Maximum element diameter and number of timesteps of the four different discretization levels used in our numerical experiments can be found in Table 3. In Table 4 we display increments and their decrease rates during the SQP iteration. Similarly to the 1D manufactured solution example these quantities behave uniform with respect to different discretization levels, which illustrates convergence in function space. Moreover, residuals (Table 5a) and errors of the iterates with respect to the interpolated true KKT-triple (Table 5b-f) show uniform behaviour until stagnation due to the respective discretization occurs.

Example 3
This final example is chosen to demonstrate that our assumptions also cover an example with real-world parameters. We consider the following problem related to heat conduction in a block of silicon modelled according to Selberherr (1984):     Fig. 1 Example 3 (Sect. 8.2) on the finest discretization level: a Optimal control and optimal state y opt evaluated at the points x corner = (−0, 5, 2, 0) and x middle = (0, 0, 0) (left hand side plot). As comparison we also display the state y naive associated with the "naive" first guess u naive (t) := 10 − 71 400 t at the same points and the desired trajectory y d . b Control iterates during the SQP method on a certain subinterval of I and α = 0.0146647. In order to make ξ formally fulfill Assumption 2.2 we can choose a C 2 -continuous uniformly bounded from below and above continuation of the above ξ outside the relevant values of y.
Measuring temperature in units of 100K, length in 0.1m and time in 60s, the state equation of (28) describes the evolution of the temperature y of a block Ω of silicon with initial temperature 1000K when the temperature of the surrounding air is given by the control variable u. Hence, the optimal control problems aims at finding the optimal temperature trajectory for the ambient air in order to cool down the block Ω following the desired temperature trajectory y d from 1000K to room temperature 290K. Density, specific heat, and temperature-dependent thermal conductivity are taken from Selberherr (1984, Chapters 2.5 and 4.3) and rescaled according to the abovementioned units. For the heat transfer coefficient between silicon and air (forced convection) we guess the value 40Wm −2 K −1 which results in the value given for α.
As pointed out after Assumption 2.1 the domain under consideration fulfills our assumptions although not being a domain with Lipschitz boundary. The Robin boundary condition in (28) is not covered by our assumptions, but since it differs from Neumann boundary conditions only by a linear term, this can be tackled by straightforward modifications of our arguments, cf. also Meinlschmidt et al. (2017a, b).  All computations were performed on tetrahedral meshes generated by mshr with maximal cell diameter h max and N t implicit Euler timesteps, see Table 6 for the different discretization levels. The numerically determined optimal control and associated optimal state are shown in Figure 1 a). Due to the three-dimensionality of the problem we were not able to choose discretization as fine as in the previous examples and therefore the behaviour the increments (Table 7) and residuals (Table 8) is not as illustrative as in 1D or 2D. Figure 1 b) shows an enlarged section of the control iterates near the change from inactive to active set at t ≈ 17.1: It can be seen that once the correct active set is identified after the third iteration, convergence is so fast that there is no visible difference between the further iterates. This might be seen as an illustration of the importance of detection of the correct active sets in infinite dimensions that has been discussed at the beginning of Sect. 6. The small kinks in the plots at the border between active and inactive set are due to the fact that time discretization (size of timesteps τ ≈ 3.38 · 10 −2 ) does not resolve the active/inactive sets exactly.