A hybrid semismooth quasi-Newton method for nonsmooth optimal control with PDEs

We propose a semismooth Newton-type method for nonsmooth optimal control problems. Its particular feature is the combination of a quasi-Newton method with a semismooth Newton method. This reduces the computational costs in comparison to semismooth Newton methods while maintaining local superlinear convergence. The method applies to Hilbert space problems whose objective is the sum of a smooth function, a regularization term, and a nonsmooth convex function. In the theoretical part of this work we establish the local superlinear convergence of the method in an infinite-dimensional setting and discuss its application to sparse optimal control of the heat equation subject to box constraints. We verify that the assumptions for local superlinear convergence are satisfied in this application and we prove that convergence can take place in stronger norms than that of the Hilbert space if initial error and problem data permit. In the numerical part we provide a thorough study of the hybrid approach on two optimal control problems, including an engineering problem from magnetic resonance imaging that involves bilinear control of the Bloch equations. We use this problem to demonstrate that the new method is capable of solving nonconvex, nonsmooth large-scale real-world problems. Among others, the study addresses mesh independence, globalization techniques, and limited-memory methods. We observe throughout that algorithms based on the hybrid methodology are several times faster in runtime than their semismooth Newton counterparts.


Introduction
In Mannel and Rund (2019) we developed the convergence theory of a hybrid semismooth quasi-Newton method to solve structured operator equations in Banach spaces. In the present paper we address applications and numerical realizations of that work. Specifically, we propose a novel algorithm to solve infinite-dimensional nonsmooth optimization problems of the form show that its theoretical requirements are met by certain optimal control problems that are special instances of (P), and provide an extensive numerical study for the new method on convex and nonconvex optimal control problems. In (P), U is a Hilbert space, γ > 0, ϕ : U → R∪{+∞} is convex but possibly nonsmooth, andf : U → R is smooth but possibly nonconvex. The precise problem setting is given in Sect. 3. A prototypical example from PDE-constrained optimal control is min u∈L 2 ( ) where a < 0 < b and α, β > 0 are real numbers, is a bounded Lipschitz domain, y d ∈ L 2 ( ), and y = y(u) is for u ∈ L 2 ( ) the solution of the semilinear equation − y + y 3 = u with appropriate boundary conditions. Asf (u) = 1 2 y(u) − y d 2 L 2 ( ) in this instance of (P), the evaluation off and its derivatives requires PDE solves.
The new algorithm is a semismooth Newton-type method that exploits the presence of the smooth term ∇f in the optimality conditions of (P) by applying a quasi-Newton method. Specifically, the operator ∇ 2f (u k ) that appears in semismooth Newton methods is replaced by a quasi-Newton approximation B k . In PDE-constrained optimal control problems this lowers the runtime significantly because it omits the PDE solves that occur in the evaluation of Hessian-vector products ∇ 2f (u k )d while maintaining superlinear convergence. Note that a direct application of quasi-Newton methods to semismooth equations cannot ensure superlinear convergence. For instance, Broyden's method on a piecewise affine (hence semismooth) equation in R may converge only r-linearly, cf. (Griewank 1987, Introduction). We present the hybrid method and its convergence properties for problems of the form (P) in Sect. 3. Its application to a model problem from PDE-constrained optimal control, the time-dependent sparse optimal control of linear and semilinear heat equations subject to box constraints, constitutes Sect. 4 and concludes the theoretical part.
The remainder of the paper is devoted to numerics. We devise various numerical realizations of the hybrid method and compare them as part of an extensive numerical study. This study addresses many theoretical and practical aspects of the hybrid approach: We verify experimentally the superlinear convergence with respect to different norms, we investigate mesh independence properties, we compare different globalization strategies including a trust-region method, and we examine different quasi-Newton updates (Broyden, SR1, BFGS).
The numerical study is based on two optimal control problems, starting with the time-dependent box-constrained sparse optimal control of the linear heat equation from the theoretical part. This model problem allows us to display very clearly the convergence properties of the hybrid method, e.g., its local superlinear convergence. The second problem deals with the design of radio-frequency pulses for magnetic resonance imaging, a topic from medical engineering. A realistic modeling yields a nonsmooth and nonconvex optimal control problem that serves as a benchmark for the performance of the hybrid approach on real-world applications. The numerical results underline that the hybrid approach can be competitive on such problems. Indeed, a previous version of the presented trust-region method formed the kernel of the code Rund et al. (2018a, b) that won the ISMRM challenge on radio-frequency pulse design in magnetic resonance imaging Grissom et al. (2017). Here we provide an improved successor.
Since quasi-Newton methods involve the Hilbert space structure of U in an essential way, it may be surprising that the numerical results for both problems clearly indicate convergence of the control u with respect to stronger norms than that of U . In Sect. 4 we establish rigorous theoretical results that explain this behavior for the control of the heat equation. It is related to the regularity of the problem and the quality of the initial approximation.
State-of-the-art methods for solving optimal control problems of the form (P) are semismooth Newton methods, cf. Ito and Kunisch (2008), Hinze et al. (2009), Ulbrich (2011 and De los Reyes (2015). In particular, they have been applied successfully to sparse optimal control problems, cf., e.g., Stadler (2009), Amstutz and Laurain (2013), Herzog et al. (2012), Herzog et al. (2015), Kunisch et al. (2016) and Boulanger and Trautmann (2017). Since the problems that we address in the numerical study are of this type, we consistently compare the hybrid approach to semismooth Newton methods.
In finite-dimensional settings there are many works available that treat (modified and unmodified) quasi-Newton methods for nonsmooth equations. The idea to apply a quasi-Newton method to the smooth part of a structured nonsmooth equation, which is at the core of the hybrid approach, appears in Chen and Yamamoto (1992), Wang et al. (2011), Qi and Jiang (1997), Han and Sun (1997), Sun and Han (1997). Among these contributions, Han and Sun (1997) is the closest to our work because it is based on finding the root of a normal map; after reformulating the optimality conditions of (P) we are faced with the same task (which we tackle by the hybrid semismooth quasi-Newton method). However, Han and Sun (1997) does not address the infinitedimensional setting and, in finite dimensions, is less general than our approach in regard to the set of constraints, which in Han and Sun (1997) has to be polyhedral. On the other hand, Han and Sun (1997) offers a deeper treatment for the specific setting of normal maps with polyhedral sets in finite dimensions. This paper is organized as follows. Section 2 specifies some notions that are important for this work, e.g., semismoothness. In Sect. 3 we provide the problem under consideration in full detail, introduce the hybrid method, and establish convergence results for it. Section 4 discusses the application of the method to sparse optimal control of linear and semilinear heat equation with box constraints. In Sect. 5 we comment on implementation issues. Section 6 contains the numerical study and in Sect. 7 we draw conclusions from this work. In "Appendix A" we provide the algorithm that we found most effective in the numerical studies-a matrix-free limited-memory truncated trust-region variant of the hybrid method.

Preliminaries
In this short section we fix the notation, recall the notion of proximal maps, and specify which concept of semismoothness we use.
We set N := {1, 2, 3, . . .}. All linear spaces are real linear spaces. Let X and Y be Banach spaces. We denote B δ (x) := {x ∈ X : x −x X < δ} for δ > 0 and x ∈ X . Moreover, L(X , Y ) represents the space of bounded linear maps from X to Y . If X can be continuously embedded into Y , this is indicated by X → Y . In the Hilbert space U we write (v, w) U for the scalar product of v, w ∈ U and (v, ·) U for the linear operator w → (v, w) U from U to R. Furthermore, we recall the definition and elementary properties of the proximal map.
Definition 1 Let U be a Hilbert space and let γ > 0. Let ϕ : U → R ∪ {+∞} be a proper, convex, and lower semicontinuous function and denote its effective domain by C := {u ∈ U : ϕ(u) < +∞}. The proximal mapping of ϕ γ := ϕ γ is given by It is easy to see that Prox ϕ γ is single-valued, has image C, and satisfies the relation for u,û ∈ U and arbitrary γ > 0, where ∂ϕ denotes the convex subdifferential of ϕ.
If ϕ is the characteristic function of a closed convex set, then Prox ϕ γ is the projection onto that set. More on proximal mappings in Hilbert spaces can be found in (Bauschke and Combettes 2017, Section 24), for instance. We will use the rather general notion of semismoothness from (Ulbrich 2011, Definition 3.1) that includes, for instance, Newton differentiability, cf. (Ito and Kunisch 2008, Definition 8.10).
Definition 2 Let X , Y be Banach spaces and letx ∈ X . Let G : X → Y be continuous in an open neighborhood ofx. Moreover, let ∂G : X ⇒ L(X , Y ) satisfy ∂G(x) = ∅ for all x ∈ X . We say that G is semismooth atx with respect to ∂G iff there holds sup M∈∂G(x+h) for h X → 0.
The set-valued mapping ∂G : is called a generalized differential of G at x.

Problem setting, algorithm, and convergence results
In this section we introduce the problem class in full detail and present the hybrid method to solve it. We provide its convergence properties and recall a result concerning local uniform invertibility of generalized differentials.

Problem setting and algorithm
Throughout this work we consider optimization problems of the form the details of which are contained in the following assumption.
Assumption 1 Let the following conditions be satisfied.
1) U is a Hilbert space.
3) The function ϕ : U → R ∪ {+∞} is proper, convex, and lower semicontinuous. 4) The functionf : U → R is continuously differentiable. 5) There is a Banach space Q → U such that such that ∇f : U → Q is differentiable, and such that and such that for each q ∈ B δ (q) and all M ∈ ∂ Prox ϕ γ (q) there holds where the final implication involves (1). The last identity yields 0 ∈ ∇ f (û) + ∂ϕ(û). The assertion concerning convexity is true since it just restates that the necessary optimality condition is sufficient for global optimality in convex optimization.
Next we provide the new algorithm. It aims at solving the operator equation H (q) = 0. Note that H acts on the artificial variable q that is related to the control u by u k = Prox ϕ γ (q k ) for k ≥ 1, respectively,q = − 1 γ ∇f (ū). The key idea of the new method is to replace the Hessian ∇ 2f (u k ) that appears in semismooth Newton methods by a quasi-Newton approximation B k . In contrast, the generalized derivative of the proximal map Prox ϕ γ is left unchanged. The algorithm thus combines a quasi-Newton method with a semismooth Newton method and can be regarded as a hybrid approach. It reads as follows.
Algorithm 1: Hybrid semismooth quasi-Newton method for (P) ; 11 else let B k+1 := B k 12 end Output:ū In the numerical study in Sect. 6 we work exclusively with (σ k ) ≡ 1 in line 9, i.e., the classical Broyden update, as already this simple choice results in efficient algorithms. Also, we will compare this update to other update formulas, specifically and Moreover, it should be clear that Algorithm 1 needs to be globalized for the numerical experiments. We will observe in the first half of Sect. 6 that iff is convex, then a simple line search suffices for this. In contrast, if (P) is severely nonlinear, then a trust-region globalization yields better results, cf. the optimal control of the Bloch equation in the second half of Sect. 6.

Convergence results
For the iterates (q k ) of Algorithm 1 we have the following convergence result.

A general approach for local uniform invertibility
The following result is taken from (Pieper 2015, Section 3). It allows to conveniently establish condition 7) of Assumption 1.
Lemma 2 Let conditions 1)-6) of Assumption 1 hold and let H and ∂ H be given by (3), respectively, (4). Moreover, suppose that ∂ Prox ϕ γ can be extended to U in such a way that ∂ Prox ϕ γ (u) ⊂ L(U , U ) for every u ∈ U . Then condition 7) of Assumption 1 is satisfied if there exist ν,δ > 0 such that for each u ∈ Bδ(ū) and all M ∈ ∂ Prox ϕ γ (u) we have Proof This is (Pieper 2015, Lemma 3.15) for the situation at hand. (5) is, in particular, valid if ∇ 2f (ū) is positive semidefinite.

Application to PDE-constrained optimal control
In this section we show for a model problem how the hybrid approach can be applied to PDE-constrained optimal control problems. These problems are well-suited for the application of the hybrid method because the assumptions for fast local convergence are typically satisfied.

An important proximal map
To facilitate the discussion of the model problem in Sect. 4.2, we study the associated proximal map in this section. To this end, let N ∈ N, T > 0, and U := L 2 (I ) N , where I := (0, T ) for some T > 0. We are interested in the proximal map of where β i , 1 ≤ i ≤ N , are nonnegative real numbers, U ad is given by L 2 (I ) ) 1/2 . This norm is equivalent to the standard norm in U and it is derived from a scalar product, hence U is a Hilbert space with respect to it. We write U ad : U → U for the projection onto U ad , i.e., U ad (u) := max(a, min(u, b)) for a.e. x ∈ I . Furthermore, let us introduce the soft-shrinkage operator σ : U → U , which is given componentwise for 1 ≤ i ≤ N and the constants α i > 0 and β i ≥ 0 by where (r ) + := max(0, r ) and (r ) − := min(0, r ) for all r ∈ R. The proximal map for γ = 1 can now be characterized as follows.

Lemma 3
The proximal map Prox ϕ 1 : U → U ad is given by Proof This can be established as in (Pieper 2015, Section 3.3) or through direct computation.
It is important to note that Prox ϕ 1 is semismooth.
Lemma 4 Prox ϕ 1 is semismooth at every q ∈ Q when considered as a mapping from have to be dropped in (6) and there holds σ i (q) = q i . In any case, 2 holds for each q ∈ Q and all M ∈ ∂ Prox ϕ 1 (q).
Proof Since Prox ϕ 1 = U ad • σ , respectively, Prox ϕ 1 = U ad is a superposition operator, the representation (6) can, for instance, be deduced from (Hinze et al. 2009, Theorem 2.13). The estimate for M L(Q,U ) follows since for h ∈ Q with h Q ≤ 1 we have (Mh) i (t) ≤ 1 for a.e. t ∈ I and all 1 ≤ i ≤ N .

Remark 5
We emphasize that the superposition operator U ad • σ is not semismooth from U to U , cf. Ulbrich (2011); Schiela (2008). Correspondingly, the role of the additional space Q in Assumption 1 is to ensure the necessary norm gap. In turn, the demand for ∇f to map U to Q, cf. 5) of Assumption 1, is a smoothing property.

The linear heat equation
Let N ∈ N, Q := C([0, T ]) N , U := L 2 (I ) N , and Y := P := W (I ; L 2 ( ), H 1 0 ( )), where W (I ; L 2 ( ), H 1 0 ( )) is the usual solution space for weak solutions of the heat equation, cf., e.g., (Hinze et al. 2009, (1.53)) or (Chipot 2000, (11.12)). We consider the optimal tracking of the linear heat equation in where := I × ∂ . Moreover, y d ∈ L 2 (I × obs ) is the desired state, obs ⊂ is the observation domain, α i > 0 are the control cost parameters per control function, β i ≥ 0 influences the size of the support of u i , y 0 ∈ L 2 ( ) is the initial state, and g i ∈ L 2 ( ) are fixed spatial functions whose support is denoted by ω i ⊂ , 1 ≤ i ≤ N . For instance, g i could be the characteristic function χ ω i of a given control domain ω i ⊂ . The set of admissible controls is given by with functions a, b ∈ L ∞ (I ) N that satisfy a ≤ b a.e. in I . By using the same norm on U as in Sect. 4.1 we can regard (OCP) as a special instance of (P) with γ = 1. From (Tröltzsch 2010, Theorem 3.13) we obtain that for every u ∈ U there exists a unique y = y(u) ∈ Y such that the PDE-constraints in (OCP) are satisfied; the dependence u → y(u) is linear and continuous from U to Y . This implies that the solution operator u → y(u) is infinitely many times continuously differentiable. Thus, Since the control reduced version (P) of (OCP) is a convex problem with strongly convex objective, it is standard to show that it possesses a unique solutionū ∈ U ad ; the associated state is denoted byȳ := y(ū) ∈ Y . We can now derive the following result. (3) is for γ = 1 given by

Lemma 5 For the control reduced version of (OCP) the mapping H defined in
Here, p = p(u) ∈ P is the adjoint state, i.e., the unique solution of the adjoint equation where p = p(u) is the adjoint state. Inserting this in H (q) = ∇f (Prox ϕ 1 (q)) + q yields (8) due to Lemma 3. Moreover, the assumptions on the problem data imply p(u) ∈ P for the solution of the adjoint equation, cf. (Tröltzsch 2010, Lemma 3.17).
To show that H maps to Q we deduce from the continuous embedding P → C([0, T ]; L 2 ( )), cf. (Chipot 2000, Theorem 11.4 , it follows that ∇f maps U to Q and hence that H maps to Q.

Remark 6
The proof of Lemma 5 demonstrates that we have to choose Q in such a way that t → ω i g i (x) p(t, x) dx belongs to Q, where p solves (9). Thus, the available regularity of the adjoint state p, respectively, of the multiplier λ, restricts the choice of Q. If additional regularity is available, then Q may be chosen as a space of smoother functions than C([0, T ]) N . For instance, from (Hinze et al. 2009, We stress that Lemma 5 is valid for all these choices of Q. Assumption 1 holds unconditionally for (OCP).

Lemma 6
The control reduced version of (OCP) fulfills Assumption 1 with H : Q → Q given by (8). Moreover, H has a unique rootq ∈ Q.
Proof Assumption 1 holds Conditions 1)-4) of Assumption 1 were already established, cf. the remarks above Lemma 5. In the proof of Lemma 5 we have demonstrated that ∇f maps U to Q. Since ∇f : U → Q is linear and continuous, it is infinitely many times continuously differentiable. This yields 5). Condition 6) follows from Lemma 4. To establish 7) we use Lemma 2. The representation in Lemma 4 shows that ∂ Prox ϕ 1 (q), q ∈ Q, can be extended in a canonical way to ∂ Prox ϕ 1 (u), u ∈ U . From the linearity of u → y(u) we obtain thatf (u) = 1 2 y(u) − y d 2 L 2 (I × obs ) is convex, hence (5) is fulfilled. Also, we readily check that the first three properties listed in Lemma 2 are satisfied by the elements of ∂ Prox ϕ 1 (u), u ∈ U . Thus, (7) holds. H has a unique root From Lemma 1 we infer by convexity that the unique solutionū of the control reduced version of (OCP) corresponds to a unique rootq of H .
We obtain the following convergence result for Algorithm 1, in which we write ∇ 2f for the constant Hessian. Note in (2) and (3) that (u k ) converges in various norms.

Theorem 2
1) Let (ȳ,ū) ∈ Y × U ad be the solution of (OCP), let H be given by (8), and denote byq ∈ Q the unique root of H . Moreover, let μ ∈ (0, 1). Then there exists ε > 0 such that for every initial pair Algorithm 1 either terminates after finitely many iterations or generates a sequence of iterates (q k ) that converges q-linearly with rate μ toq in Q. If, in addition, σ min , σ max ∈ (0, 2) in Algorithm 1 and In particular, is generated by Algorithm 1 and converges q-linearly (q-superlinearly) tō q in Q, then (u k ), (y(u k )) and ( p(u k )) converge r-linearly (r-superlinearly) in L s (I ) N , respectively, Y and P, where s is as in 2). Moreover, (H (q k )) converges r-linearly (q-superlinearly) in Q to zero, then.
Proof Proof of 1) The claim of 1) follows from Theorem 1, part 1), which can be applied since Assumption 1 is satisfied, cf. Lemma 6. Proof of 2) The feasibility of the u k is valid since Prox ϕ 1 (Q) ⊂ U ad and since proving the first error bound in 2). Moreover, this also implies (10) provided that This property of U ad • σ is elementary to see, for instance by showing it separately for σ and U ad . The second error bound follows from the first for s = 2 by use of where |I | is the Lebesgue measure of I . Since u → y(u) and u → p(u) are linear and continuous from U to Y = P, they are globally Lipschitz, too. This in combination with the estimate u k −ū L 2 (I ) N ≤ q k −q L 2 (I ) N and the continuous embedding Q → L 2 (I ) N yields the third and fourth error bound.

Proof of 3)
The claims follow from the error estimates in 2) and, for (H (q k )), from part 3) of Corollary 1.

Remark 8
It is not difficult to argue that Theorem 2 also holds for Q := Lŝ(I ) N for anyŝ ∈ (2, ∞]. Of course, the L s estimates of that theorem are then only true for s ∈ [1,ŝ]. The use of a weaker norm in Q relaxes the assumption on (u 0 , B 0 ) and the compactness requirement in that theorem.
In Theorem 2 we have worked with Q = C([0, T ]) N , P = W (I ; L 2 ( ), H 1 0 ( )), and y d ∈ L 2 (I × obs ), and we have allowed obs = . If obs = and y d is more regular, then we can exploit this in two ways. First, by letting Q and P be spaces of smoother functions the convergence results of Theorem 2, that contain the norms of Q and P, become stronger (but a better quality of the initial approximation (u 0 , B 0 ) is required for the theorem to apply). Second, the operator ∇ 2f (ū) becomes compact. In view of Theorem 2, part 1), this suggests to choose a compact initial B 0 to achieve q-superlinear convergence, for instance B 0 = 0. In contrast, B 0 = I for > 0 can only be expected to yield q-linear convergence, presumably the faster the smaller is. Indeed, the numerical results in Sect. 6 show that B 0 = 0 is most often superior to scaled identities and that the results for B 0 = I improve as becomes smaller. The following two lemmas offer several choices for P and Q and provide the corresponding convergence results. For simplicity we consider constant bounds. Proof Proof for Q := H 1 (I ) N The proof is completely analogue to the one of Theorem 2, except for the claim above (10) and the one below (10). That is, we have to establish that (u k ) k≥1 , {ū} ⊂ Q = H 1 (I ) N and u k →ū in Q provided that q k →q in Q. In fact, this follows since Prox ϕ 1 = U ad • σ satisfies Prox ϕ 1 (Q) ⊂ Q and since Prox ϕ 1 : Q → Q is continuous. Both properties can be proven separately for U ad and σ . For U ad these properties follow from (Kinderlehrer and Stampacchia 2000, Chapter II, Corollary A.5 . Suppose that obs = and y d ∈ Y , and let a i , b i be constant for each 1 ≤ i ≤ N . Then all claims of Theorem 2 except (10) and the convergence assertion below it are true. Moreover, q k →q in Q implies u k →ū Proof The main task is to establish that Prox ϕ 1 = U ad • σ satisfies Prox ϕ 1 (Q) ⊂ Q and that Prox ϕ 1 : Q → C 0,ŝ−τ ([0, T ]) N is continuous atq for any τ ∈ (0,ŝ). The proof can be undertaken separately for U ad and σ . The property Prox ϕ 1 (Q) ⊂ Q follows from (Appell and Zabrejko 1990, Theorem 7.1), and the continuity atq is a consequence of (Appell and Zabrejko 1990, first part of Theorem 7.6). The compact-

The semilinear heat equation
We show that our method also applies to nonlinear state equations. To this end, we consider the same setting as in Sect. 4.2.1 but replace the linear heat equation in (OCP) by a semilinear variant. For ease of presentation we use the rather concrete defined on the nonempty and bounded Lipschitz domain ⊂ R d , 1 ≤ d ≤ 3, and the time interval The set of admissible controls is still given by (7) with a, b ∈ L ∞ (I ) N , a ≤ b a.e. in I . From (Casas et al. 2017, Proposition 2.1) we obtain that for every u ∈ U there exists a unique y = y(u) ∈ Y such that (sem) is satisfied. The solution operator u → y(u) is C 2 from U to Y by the implicit function theorem, cf. also (Casas et al. 2017, Proposition 2.2).
Since the control reduced problem (P) does no longer have a convex objective function, it is possible that (P) has multiple local and global minimizers; it is, however, standard to show that at least one global minimizerū ∈ U ad exists, for instance by following the arguments of (Tröltzsch 2010, Theorem 5.7). The associated state isȳ := y(ū) ∈ Y . Similarly to Lemma 5 the following holds.
Lemma 9 For the control reduced version of (OCP) with (sem) the mapping H defined in (3) is for γ = 1 given by Here, the adjoint state p = p(u) ∈ P uniquely solves the linear problem Inserting this in H (q) = ∇f (Prox ϕ 1 (q)) + q and using Lemma 3 establishes the formula for H . Moreover, the assumptions on the problem data together with Assumption 1 holds for the semilinear problem under the following condition.
Lemma 10 Assumption 1 holds for the control reduced version of (OCP) with state equation (sem) provided Proof Conditions 1)-4) of Assumption 1 were already established, cf. the remarks above Lemma 9 and those above Lemma 5. In the proof of Lemma 9 we have demonstrated that ∇f maps U to Q. (5) is satisfied. Condition 6) follows from Lemma 4. To establish (7) we use Lemma 2. Except for (5) the assumptions of Lemma 2 follow as in the proof of Lemma 6. A computation reveals that . This shows that (5) is fulfilled for ν := γ . Thus, 7) holds.
We obtain the following convergence result for Algorithm 1. Note in (2) and (3) that (u k ) converges in various norms.
Proof Proof of 1) The claim of 1) follows from Theorem 1, part 1), which can be applied since Assumption 1 is satisfied, cf. Lemma 10.

Proof of 2)
The proof is almost identical to the one for part 2) of Theorem 2. The only necessary change is that u → y(u) and u → p(u) are no longer globally Lipschitz, but only locally Lipschitz. The existence of L y and L p thus follows from u k →ū in U , the latter being a consequence of the assumption that q k →q in Q.
Proof of 3) The claims follow from the error estimates in 2) and, for (H (q k )), from part 3) of Corollary 1.

Implementation
The hybrid framework is tested with three different quasi-Newton updates: Broyden, SR1, and BFGS. The methods are implemented with limited-memory techniques storing the last up to L (called the limit) updates as vectors and matrix-free. Consequently, the Newton systems are solved with iterative methods, specifically with GMRES or cg. The limited-memory BFGS method is implemented in the compact variant according to Byrd et al. (1994), see also (Nocedal and Wright 2006, (7.24)). Additionally, the quasi-Newton methods are compared to Newton's method itself, i.e., setting B k = ∇ 2f (u k ) for all k and dropping lines 8-11 in Algorithm 1. Here, the matrix-free evaluation in a direction is implemented via forward-backward solve. We stress that when Newton's method is used, Algorithm 1 is a standard semismooth Newton method.
The methods are applied with three different globalization techniques. First, a standard backtracking line search on the residual norm H U is used together with GMRES (ls-GMRES). The line search selects the smallest integer 0 ≤ j ≤ 18 with H (q k + 0.5 j s k ) U < H (q k ) U , and j = 19 otherwise. GMRES from Matlab is used with a tolerance of 10 −10 and a maximum number of 50 iterations to solve the full Newton system. Second, a non-monotone line search (nls-GMRES) with N ls ∈ N steps is used, where the step size ρ j = 0.5 j , 0 ≤ j ≤ 18, is accepted as soon as R(q k + ρ j s k ) < max{R(q k ), R(q k−1 ), . . . , R(q k−N ls +1 )} holds, and j = 19 otherwise. Therein, R will be the residual norm H (·) U or the objective f (·) + ϕ(·) of (P). Third, a trust-region method is investigated based on Steihaug-cg (tr-cg), cf. Steihaug (1983). The precise algorithm is included in "Appendix A". It is started with a radius 0 = 0.1 and stopped with a relative tolerance of 10 −5 . The parameters are σ 1 = 0.05, σ 2 = 0.25, σ 3 = 0.7, radius factors f 1 = 0.4, f 2 = 2, f 3 = 0.6, and a maximum radius max = 2. Up to 300 iterations are allowed. The update of the quasi-Newton matrix is also carried out at rejected steps. Additionally, in case of BFGS, the update is only applied if the curvature condition (y k , s k u ) U > 0 holds. To be able to use Steihaug-cg, the linear system is first reduced to a symmetric one by restricting to the Hilbert space induced by the inner product (·, M k ·) U with M k ∈ ∂ Prox ϕ γ (q k ), cf. Lemma 2 and (Pieper 2015, Def. 3.4). Then a correction step gives the full update, cf. (Pieper 2015, (3.25)). The cg method is limited to 100 iterations and stopped with a relative tolerance of 10 −10 . We use small tolerances to suppress the influence of inexact linear system solves.

Numerical experiments
The numerical experiments below show the application of the hybrid method to optimal control problems. The first example is the heat equation, the second the bilinear control of the Bloch equation in magnetic resonance imaging. As nonsmooth problem parts we deal with pointwise box constraints on the controls and sparsity promoting objectives. All computations are carried out using Matlab 2017a and a workstation with two Intel Xeon X5675 (24GB RAM, twelve cores with 3.06GHz). All time measurements are performed on a single CPU without multi-threading.

Time-dependent tracking of the heat equation
The first example problem is sparse control of the linear heat equation with obs = = (−1, 1) 2 and time domain I = (0, 1). Specifically, we consider min (y,u)∈Y ×U ad in .
is the characteristic function of the right half ω = (0, 1) × (−1, 1), and y 0 ≡ 0. The example is a special case of (OCP) with N = 1, however using Neumann instead of Dirichlet boundary conditions. We emphasize that the results of Sect. 4.2 can also be developed for these boundary conditions. In view of Lemmas 7 and 8 we expect convergence in rather strong norms. Specifically, we are interested in q-superlinear convergence of (q k ) in H 1 (I ) and in C 0,s ([0, T ]) for all s ∈ (0, 1), convergence of (u k ) in H 1 (I ) and in C 0,s−τ ([0, T ]) for all τ ∈ (0, s), and r-superlinear convergence of (u k ) in L 2 (I ) and L ∞ (I ). Furthermore, we should be able to observe the error estimates u k −ū L s (I ) ≤ q k −q L s (I ) for s ∈ {2, ∞}, cf. part 2) of Theorem 2. We will investigate these properties numerically.
We use an unstructured triangular mesh with 725 P1 elements generated by Matlab's initmesh with Hmax=0.1, see Fig. 1. As time-stepping scheme the CG(1)  method is chosen (corresponding to the Crank-Nicolson method) with 101 equidistant time points and a piecewise constant discretization of q and u. We initialize the algorithm with u 0 = 0, corresponding to q 0 = 0. Since the problem is strongly convex, the solution is possible based on the simple globalization ls-GMRES (for β = 0 and sufficiently large α we expect based on a global convergence result in Kunisch and Rösch (2002, Theorem 3) that it would even be possible to drop the line search and use only full steps). The optimal controls for different α, β are depicted in Fig. 1. We will use varying parameters α, β, with α = 0.01 as default value. For the limit we use L = 30. The proximal mapping is U ad (σ (q)) from Lemma 3. In the algorithm we select and (M k h)(t) = 0 otherwise. In (6) this corresponds to the choice r (t) = 1 for all t satisfying |q k (t)| = β/α and a < σ (q k )(t) < b, and r (t) = 0 for all remaining t ∈ I .
The first study shows the performance of the optimization methods for different initializations of B 0 . Table 1 shows the results for β = 0. The first two columns show α and the resulting number of active points |A| (optimal control on upper or lower bound) out of 101. The other columns depict the iteration numbers that are needed to reach the relative tolerance 10 −8 , separately for semismooth Newton (SN) and the three hybrid methods. Here, four different initializations are applied, including the scaled identity α I , the zero matrix 0 (except for BFGS, where this produces a vanishing denominator), and the two formulas (Byrd et al. 1994, Eq.(3.23)) B 0 = y T s/(s T s)I and (Nocedal and Wright 2006, Eq.(7.20)) B 0 = y T y/(y T s)I . We note that both formulas are implemented in the first step with B 0 = 0 for Broyden/SR1 and B 0 = α I for BFGS. A good performance for any α and for all three methods can be obtained by choosing B 0 = α I . However, Broyden and SR1 show faster convergence with the zero initialization, and BFGS shows in the mean less iterations with B 0 = y T s/(s T s)I , which are the default initializations for all other studies below. We stress that from an infinite-dimensional point of view the choice B 0 = 0 results in B 0 − ∇ 2f (ū) ∈ L(U , Q) being compact, cf. Lemma 7 and Lemma 8, whereas this is not the case for the scaled identities. Therefore, BFGS may be inherently slower than Broyden and SR1 in the following experiments, cf. also the reasoning provided above Lemma 7. This will indeed turn out to be true. The results for β > 0 are depicted in Table 2. Here, |O| collects the number of time points with zero control out of 101. We note that the inequality constraints are in general inactive (|A| = 0) in the optimum if α ≥ 1. On the other hand, smaller values α ≤ 0.01 result in only one or two inactive points (the number of inactive points is 101 − |A| − |O|). Increasing the parameter β between 0.1 and 1 increases the sparsity from around 20% to 80%. For β ≥ 2 the optimal solution is zero. The four last columns show the iteration counts of the four methods with default initializations. All methods convergence quickly for all values of α and β. They tend to require fewer iterations for larger β, which corresponds to more degrees of freedom being fixed to zero. The desired tolerance is reached after at most 8 iterations for Broyden (Br), respectively, 7 iterations for SR1. BFGS needs up to 17 iterations.
The iteration counts of the hybrid methods for different discretizations are depicted in the upper part of Table 3 for α = 0.01, β = 0 using N c + 1 equidistant time points and three different meshes from initmesh with N x nodes. The results show mesh independence for all three quasi-Newton methods with respect to both the spatial and the temporal discretization. The lower part of the table shows the corresponding runtimes in seconds. All values are averages of five runs. Broyden and SR1 show nearly identical runtimes in this example and are twice as fast as BFGS. All three methods outperform the semismooth Newton method in runtime. For all time and space discretizations a speedup factor of five to six is observed for Broyden and SR1, and three to four for BFGS. The next study analyzes the superlinear convergence properties numerically. For comparison the optimal solutionq is first computed in high precision with the semismooth Newton method and GMRES using fine relative tolerances of 10 −14 for both. Then the indicators of superlinear convergence are computed for each method for the norms Z = L 2 (I ), L ∞ (I ) and the seminorms Z = H 1 (I ), C 0,1 ([0, T ]). For q-superlinear convergence these indicators should converge to zero in the last steps of an optimization run. Table 4 depicts the indicators for the last four iterations. The results are obtained for α = 0.1, β = 0.2, and a relative tolerance of 10 −8 . We observe that the semismooth Newton method converges in one step as soon as the active set has converged. Broyden and SR1 show fast superlinear convergence with final indicators between 7 × 10 −3 and 6 × 10 −4 both for the control u and the optimization variable q. For BFGS the indicators are slightly larger, but also decrease towards the end. If α is further reduced to 0.01, we observe one-step convergence for all three limited-memory methods, too, which can be explained by the fact that all but one time point are either active or zero then. Next we consider the convergence of u and q in different norms. Table 5 displays the following errors for the last four steps of each optimization method: and analogue definitions e k q,L 2 , e k q,H 1 , e k q,L ∞ for q. The semismooth Newton method exhibits one-step convergence as soon as the active sets have converged. The other methods show a quick decrease of all values towards the relative tolerance during Table 4 Indicators of superlinear convergence with Each group of four rows shows these indicators for different (semi-)norms the last four steps of the optimization run. In particular, SR1 yields the fastest reduction, while BFGS shows a significantly slower convergence here. As predicted by Theorem 2, part (2), we have e k u,L 2 ≤ e k q,L 2 and e k u,L ∞ ≤ e k q,L ∞ (since these inequalities are a consequence of the fact that Prox ϕ 1 : L s → L s is nonexpansive for any s ∈ [1, ∞], they hold in all four methods under consideration). We observe that, in contrast, e k u,H 1 ≤ e k q,H 1 does not hold in general. However, the results indicate that e k u,H 1 still goes to zero, which agrees with part (2) of Theorem 2 for the choice Q = H 1 (I ) described in Lemma 7.

Sparse control of the Bloch equation
As example for a nonconvex optimization problem we investigate the bilinear control of the Bloch equations in magnetic resonance imaging (without relaxation, in the rotating frame, and on-resonance). A realistic optimal control modeling for radiofrequency (RF) pulse design in slice-selective imaging is considered based on Rund et al. (2018a). However, we add sparsity to the control model, which is a desirable feature in practice since the duty cycle of the RF amplifier is often limited. For details on magnetic resonance imaging we refer to Bernstein et al. (2004). As model problem we consider the slice-selective imaging with a single slice. Here, imaging data of a whole slice is to be acquired. The spatial field of view is described by its extent ⊂ R perpendicular to the slice direction. The slice itself is described by in ⊂ while the remaining part of is denoted by out = \ in . The latter should not contribute to the data acquisition.  u(t), v(t), w(t)x) depends on the RF pulse (u, v) ∈ L 2 (I ) 2 and the slice-selective gradient amplitude w = w(t) ∈ L 2 (I ). While these three timedependent functions can often be controlled, we consider for simplicity the situation in which w ≡ 2 is given and the RF pulses are real-valued, i.e., v ≡ 0. Hence, u is the control variable. Technical limitations of the RF amplifier are modeled as control constraints U ad = {u ∈ L 2 (I ) : |u| ≤ u max } with u max = 1.2 [10 2 μT]. This value reflects a typical 3T magnetic resonance scanner hardware.
The specific example here is the optimization of a refocusing pulse, which is, among others, a central building block of the clinically important turbo spin echo based sequences. The initial condition results from assuming that an ideal 90 • -excitation Table 5 Errors in different norms for pulse for the same slice has been applied before, keeping the net magnetization vectors out of the slice in the steady state (0, 0, 1) T while exciting the slice itself. In particular, we set M 0 = χ out (x)(0, 0, 1) T + χ in (x)(0, 1, 0) T . A slice of 1.65 [cm] thickness is assumed: in = [−0.00825, 0.00825]. The aim of the refocusing is to flip the magnetization vectors in the x − y−plane in the interior in of the slice, which is modeled as rotation around the x axis with angle π . This desired magnetization pattern at the end time t = T of the refocusing pulse is given bŷ recalling that M = M(u). However, this tracking term for basic refocusing pulses is typically not used in numerical practice. Instead, we apply a more involved formulation of the desired state at the terminal time for advanced refocusing pulses, that we describe now. Because of practical reasons including robustness issues, refocusing pulses are generally applied within crusher gradients, cf. Bernstein et al. (2004), which are additional sequence elements surrounding the RF pulse. These crusher gradients cannot be modeled by the depictedf (u). It seems that the only practical way to model tracking terms for refocusing pulses with ideal crusher gradients is to define them in the spin domain, cf. Bernstein et al. (2004), Rund et al. (2018a). Therefore, we choose an equidistant time grid t k = (k − 1)τ , k = 1, . . . , N t with N t = 270 points and step size τ = T /(N t − 1) = 0.01 [ms], together with piecewise constant w and control u with values w m , u m , m = 1, . . . , N t − 1. This implies that the magnetic field B is piecewise constant. It is well-known that for piecewise constant magnetic field the Bloch equations (14) in a spatial point x 0 can be solved analytically as a sequence of rotations. This is expressed by using the Cayley-Klein parameters (a m ), (b m ) ∈ C, m = 1, . . . , N t − 1 with evolution and with initial conditions a 0 = 1, b 0 = 0, cf. Pauly et al. (1991). For the formula relating a m , b m and M(t m−1 , x 0 ) see (Bernstein et al. 2004, eq.(2.15)). The coefficients α m , β m are given by Since it is well-known that perfect refocusing with ideal crusher gradients is obtained through |b(T , x)| 2 = χ in (x) for a.e. x ∈ , the tracking term is given bŷ Note that b = b(u). In the numerical experiments we usef as defined in (15). The adjoint equation and the reduced gradient for this formulation are derived in the appendix of Rund et al. (2018a). The spatial domain is discretized equidistantly in N x = 481 points.
In accordance with the presentation in Sect. 4 we use a control-reduced problem formulation for (13)(14). For the problem at hand this bears the advantage to iterate only on the small control vector with N t − 1 = 269 entries, but not on the large state vector with 3N x N t = 389610 entries. Consequently, B k is a rather small matrix of format 269 × 269, and the linear algebra operations for its update and evaluation are cheap. In effect, the numerical effort is dominated clearly by the state and adjoint solves; concrete values are reported below.
We consider the same four optimization methods as in Sect. 6.1, i.e., a semismooth Newton method and the hybrid method with the Broyden, SR1 and BFGS update, respectively. If not mentioned otherwise, these methods are globalized with tr-cg. The stopping criterion is a relative tolerance of 10 −5 for the residual norm H L 2 (I ) . Unless declared otherwise, the following settings are applied for the hybrid methods: a limit of L = 75, B 0 = 0 for Broyden and SR1 methods, and B 0 = α I for BFGS. These initializations are selected because they turned out to be the most effective for the respective methods on this problem. Note that for Broyden and SR1 this is consistent with the numerical results for the optimal control of the heat equation in Sect. 6.1, cf. Table 1. Also, let us stress one more time that B 0 = 0 can be expected to yield a better performance than scaled identities, so we anticipate that Broyden and SR1 may be somewhat superior to BFGS in the following experiments whenever B 0 = 0, respectively, B 0 = α I are used. This turns out to be true.

Comparison of the optimization methods
The optimization is initialized with a sinc-shaped RF pulse f 0 (t) = 1.8 · sinc(−2.2 + 4t/T ). To maintain a good initial slice profile we use q 0 = f 0 + sign( f 0 )β/α, which implies f 0 = σ (q 0 ) and u 0 = U ad (σ (q 0 )) = U ad ( f 0 ). The initialization, the corresponding slice profile M 3 (T , x), and the desired slice profile are depicted in the top of Fig. 2. Also depicted are optimal controls for different α, β. The sparsity and bound properties of the solutions for different α, β are depicted in Table 6. If not mentioned otherwise, then we use α = 5 × 10 −4 and β = 10 −4 below. In all experiments we monitor that only runs leading to the same local minimizer are compared, which is important since the problem possesses several different minimizers.
The first study compares the performance of four different semismooth Newtontype methods embedded in a trust-region cg framework for varying α, β and for different initializations of the quasi-Newton matrix B 0 . First, the limited-memory methods are analyzed in Table 7 and compared to the semismooth Newton method. The up to four columns per method show the iteration counts for different B 0 . The last row shows the mean value per column taken over the converged runs only. The runs that do not converge are marked with −. The first three columns display the parameters α, β and the iteration number of SN. The next two column groups show that the hybrid method with Broyden updates behaves quite similar to the variant with SR1 updates, the latter often requiring slightly fewer iterations. The choices B 0 = 0 and B 0 = α I yield fast convergence throughout all (α, β)-pairings, while the other two formulas turn out to be less efficient in this setting. Looking at the hybrid method Table 6 Number of time instances with sparsity (|O|), respectively, active box constraint (|A|) out of 269 for different α (rows) and β (five columns each) with BFGS in the last column group we observe that its performance degenerates for large values of α. In view of the comment before Lemma 7 this is not entirely unexpected, but the extent of this behavior still appears somewhat surprising. Apart from this phenomenon the scaled identity yields good results also for BFGS. Using the best choice B 0 for each method, Broyden requires 41 iterations on average, SR1 35, and BFGS 83, compared to 16 for the semismooth Newton method. Since the latter has much more costly iterations due to the forward-backward solve of the second-order equations, it is important to compare the corresponding runtimes. They are included below, cf. Table 9.
To address the choice of the limit parameter L we compare the performance of the hybrid methods for different limits in Table 8 based on the iteration counts. Depicted are four columns per method which differ in the choice of the limit ranging from L = 25 to L = 100. The rows show results for different α while keeping β = 10 −4 fixed. We stress that in combination with tr-cg it is appropriate to choose a limit L that is larger than typical values from the literature for globalization by line search methods. This is due to the fact that Steihaug-cg employs earlier breaks in the cg method leading to smaller and more steps. We observe that a limit of 25 is only adequate for Broyden and SR1 in the case of large α ≥ 5 × 10 −4 . For smaller α the limit should be increased to 50. In contrast, the performance of BFGS is less sensitive to the values of L that are investigated.  The symbol − indicates that the relative tolerance is not met within 300 iterations. The last line depicts the mean iteration counts per column for the converged runs The symbol − stands for not converging within 300 iterations In the particular setting of optimal control problems with many state variables and few control variables, which is typically the case in practical applications of PDEconstrained optimization or Bloch-models, it pays off in runtime to use larger limits since this helps to save some iterations of the trust-region method while it increases the required time per trust-region iteration only marginally. This is due to the fact that the costs per trust-region iteration are largely dominated by the evaluation of objective and gradient. For example, the runtime of Broyden with α = 5 × 10 −5 for a limit of 75 (56 iterations) is 5.1 seconds, which is significantly lower than the 6.4 seconds that are needed with a limit of 50 (73 iterations). In both cases, around 90% of the runtime is spent on the evaluation of objective and gradient. Therefore, a limit of L = 75 is chosen for all subsequent studies. We emphasize that all runs converge to the same local minimizer independently of the limit parameter.
To investigate mesh independence properties of the hybrid methods we perform runs with different temporal and spatial mesh sizes. The results are shown in Table 9 using iteration counts in the left table, respectively runtime (mean runtime in 5 runs, in seconds) in the right table. The rows depict results for different temporal refinements with N c = N t − 1 control points. The two columns per method show different spatial grids with N x = 481, respectively, N x = 4811 points. The finest example with N c = 17216 and N x = 4811 features 248 million degrees of freedom for the state variable. In all cases the same initial guess is used. Furthermore, the same local minimizer is attained in all runs, which allows for a direct comparison. The left table shows that the iteration counts of all four methods do not increase with N c or N x . In particular, SN, Broyden and SR1 exhibit nearly the same iteration count for any of the discretizations. The hybrid method with BFGS displays a constant iteration number per column with a reduced iteration count for the right column (larger N x ). Interestingly, its performance is rather similar to that of Broyden and SR1 for N x = 4811, while for N x = 481 it requires roughly twice as many iterations and twice as much runtime as Broyden and SR1.
The right table shows the mean values of five runtimes in seconds, measured for a single CPU without parallelization. Despite their higher iteration counts, the three hybrid methods have much smaller runtimes than the semismooth Newton method. The bottom line depicts the mean value per column of the runtime in microseconds divided by N x N c , a number that varies only slightly per column (and per row, although not displayed). We regard this quantity as an efficiency index and denote it by E. In contrast, the runtime of the semismooth Newton method increases faster in N c than linearly. We attribute this to the fact that the cg method requires more iterations for larger systems and that these iterations involve expensive operations for SN. Therefore, the speedup factor of the Broyden variant of the hybrid method over the semismooth Newton method rises with the number of time instances, starting at 7 and reaching 70 for N c = 17216 and N x = 481. Using SR1 updates leads to similar runtimes with a speedup of up to 68. As already seen in the left table, the use of BFGS updates produces higher iteration counts resulting in an increased runtime. Still, a speedup factor of up to 37 over the semismooth Newton method is reached.
Let us take a closer look at the convergence properties of the different quasi-Newton updates in the trust-region method. To this end, the solution is first computed in high precision with the semismooth Newton method and a relative tolerance of 10 −13 . Then the different optimization runs are performed with a relative tolerance of 10 −7 , measured in H (·) L 2 (I ) . The results are displayed in Table 10, with the error norms defined as in the first example, see (12). The table shows that all methods reduce all six errors to approximately the size of the relative tolerance, but the semismooth Newton method needs much fewer iterations to achieve this, cf. Table 9. We attribute this to the strong nonconvexity of the bilinear problem at hand. As in the first example we have e k u,L 2 ≤ e k q,L 2 and e k u,L ∞ ≤ e k q,L ∞ , but this relationship is not satisfied for the H 1 (I )-norm, which, however, does not impede e k u,H 1 → 0.

Comparison of the globalization techniques
This section is devoted to comparing the three globalizations tr-cg, ls-GMRES and nls-GMRES on the bilinear control problem of the Bloch equations. The globalizations are paired with the semismooth Newton method and the three limited-memory quasi-Newton methods. For each of these twelve combinations, 2000 optimization runs from a random initial (Matlab rand) q 0 are performed with the following parameters: N c = 269, N x = 481, α = 5×10 −4 , β = 10 −4 , up to 300 iterations, relative tolerance 10 −4 , cg/GMRES tolerance 10 −10 , up to 100 cg/GMRES iterations. The monotone line search operates on the residual H (q) L 2 (I ) , while the non-monotone line search is tested with N ls = 2, 3, 4, 5 based on the objective f (u) + ϕ(u) and based on the residual H (q) L 2 (I ) . Due to space limitations we show only the best results, which are obtained with N ls = 2 and R(u) = f (u) + ϕ(u). Throughout the 8000 optimization runs with tr-cg, twelve different stationary points are observed, whose controls are depicted in Fig. 3, divided into three sets (top row). Every set of four controls yields identical optimal valuesm := f (ū) + ϕ(ū), control norms ū U , and final magnetization (bottom row). The four controls are related by axial symmetry to the t-axis and the axis t = T /2. Since these four controls are equivalent in practical application, they are counted henceforth as one solution with multiplicity four. The relative occurrences of the resulting three candidates for a minimizer are depicted in the upper three rows of Table 11 in columns five to eight. The lower part of the table additionally displays the relative occurrences of runs that do not reach the prescribed relative tolerance within 300 iterations; they are labeled "not converged". The average objective value returned by the optimizer ∅m and Table 9 Iterations (left table) and runtime in seconds (right table)  the average runtime in seconds are also shown. In contrast to previous experiments they are taken over all runs here, i.e., they include also the "not converged" runs.
In particular, we observe that SN, Br and SR always meet the relative tolerance. In contrast, one seventh of the BFGS runs does not converge. In fact, these runs yield the same twelve minimizers but fail to reach the prescribed tolerance, which is underlined by the agreeing values of ∅m. This mean objective value is ∅m = 0.64 × 10 −3 for all four methods with tr-cg, which is significantly smaller than the values achieved with the other globalization techniques. The mean runtime shows a clear speedup of the hybrid methods compared to SN, despite the fact that this is a small scale example with N t = 270 and N x = 481. In contrast to the trust-region method, the line search globalizations find many more stationary points, 78 in total. However, 49 of these have a prohibitively high cost. They are summarized in lines number 10 and 11 of Table 11. The results with monotone line search (ls-GMRES) are depicted in the fourth column group of Table 11. The semismooth Newton method with a basic monotone line search on the residual H (q) L 2 (I ) quickly converges to a noncompetitive minimizer in nearly all cases. The three quasi-Newton methods yield smaller cost values in the mean, but most of the runs fail to match the prescribed relative tolerance. The mean optimal values of all four methods are much larger than those obtained with tr-cg.
The non-monotone line search nls-GMRES is more effective than ls-GMRES for all three quasi-Newton methods, see the last column group of Table 11. We note that the number of runs that do not converge is smaller than for the monotone line search. In particular, Broyden and BFGS converge in most of the cases. Moreover, the best control and the top three controls are found more often leading to much better average optimal values compared to ls-GMRES. However, excellent values similar to those of tr-cg are attained only for BFGS. Notably, the semismooth Newton method does not benefit from the non-monotone line search; it behaves similarly as with ls-GMRES. It Table 11 Quality of the solutions for different optimization and globalization methods for 2000 optimization runs from random initializations are performed for each combination. Depicted are the multiplicity of the minimizer, the optimal valuem, and the norm of the optimal control. The next three column groups show the relative occurrence (%) of the respective solution, separately for tr-cg, ls-GMRES and nls-GMRES. Each column group is divided into the four methods semismooth Newton (SN), Broyden (Br), SR1 (SR), and BFGS (BF). The three bottom lines depict the percentage of runs that did not converge, the average objective value ∅m, and the average runtime, both averaged over all runs is also worth mentioning that the average runtimes of the tr-cg hybrid quasi-Newton methods are only slightly above those of the nls-GMRES variant. Summarizing, in this application problem the tr-cg globalization robustly delivers the top-three candidates for all four optimization methods. In contrast, the line search globalizations often have difficulties with convergence for the quasi-Newton methods, and tend to noncompetitive solutions for the semismooth Newton method. Thus, for optimal control of the Bloch equations tr-cg should clearly be preferred over a line search globalization for the semismooth Newton, the hybrid Broyden, and the SR1 method. For BFGS, both tr-cg and nls-GMRES work equally well.

Conclusions
In this paper we have studied a hybrid approach for nonsmooth optimal control problems that blends semismooth Newton and quasi-Newton methods. We established its local superlinear convergence and provided numerical results to show that it has significantly lower runtime than semismooth Newton methods. A matrix-free limitedmemory truncated trust-region variant seems to be particularly promising.