A path-following inexact Newton method for PDE-constrained optimal control in BV (extended version)

We study a PDE-constrained optimal control problem that involves functions of bounded variation as controls and includes the TV seminorm of the control in the objective. We apply a path-following inexact Newton method to the problems that arise from smoothing the TV seminorm and adding an $H^1$ regularization. We prove in an infinite-dimensional setting that, first, the solutions of these auxiliary problems converge to the solution of the original problem and, second, that an inexact Newton method enjoys fast local convergence when applied to a reformulation of the optimality system in which the control appears as implicit function of the adjoint state. We show convergence of a Finite Element approximation, provide a globalized preconditioned inexact Newton method as solver for the discretized auxiliary problems, and embed it into an inexact path-following scheme. We construct a two-dimensional test problem with fully explicit solution and present numerical results to illustrate the accuracy and robustness of the approach. This is an extended version of the corresponding journal article appearing in"Computational Optimization and Applications". It contains some proofs that are omitted in the journal's version.


Problem setting and introduction
This work is concerned with the optimal control problem min (y,u)∈H 1 0 (Ω)×BV(Ω) where throughout Ω ⊂ R N is a bounded C 1,1 domain and N ∈ {1, 2, 3}. The control u belongs to the space of functions of bounded variation BV(Ω), the state y lives in Y := H 1 0 (Ω), the parameter β is positive, and Ay = u is a partial differential equation of the form Ay + c 0 y = u in Ω, y = 0 on ∂Ω with a non-negative function c 0 ∈ L ∞ (Ω) and a linear and uniformly elliptic operator of second order in divergence form A : H 1 0 (Ω) → H −1 (Ω), Ay(ϕ) = Ω N i,j=1 a ij ∂ i y∂ j ϕ dx whose coefficients satisfy a ij = a ji ∈ C 0,1 (Ω) for all i, j ∈ {1, . . . , N }. The specific feature of (OC) is the appearance of the BV seminorm |u| BV(Ω) in the cost functional, which favors piecewise constant controls and has recently attracted considerable interest in PDE-constrained optimal control, cf. [7,13,14,16,23,25,26,29,30,33] and the earlier works [17,18]. The majority of these contributions focuses on deriving optimality conditions and studying Finite Element approximations. In contrast, the main focus of this work is on a path-following method. Specifically, we propose to smooth the TV seminorm in J and add an H 1 regularization, and show in an infinite-dimensional setting that the solutions of the resulting family of auxiliary problems converge to the solution of (OC); for any given auxiliary problem we prove that an infinite-dimensional inexact Newton method converges locally; we derive a practical path-following method that yields accurate solutions for (OC) and illustrate its capabilities in numerical examples for Ω ⊂ R 2 .
To the best of our knowledge, these aspects have only been investigated partially for optimal control problems that involve the TV seminorm in the objective. In particular, there are few works that address the numerical solution when the measure ∇u is supported in a two-dimensional set. In fact, we are only aware of [23], where a doubly-regularized version of the Fenchel predual of (OC) is solved for fixed regularization parameters, but path-following is not applied. We stress that in our numerical experience the two-dimensional case is significantly more challenging than the one-dimensional case. A FeNiCs implementation of our path-following method is available at https://arxiv.org/abs/2010.11628. It includes all the features that we discuss in section 6, e.g., a preconditioner for the Newton systems, a non-monotone line search globalization, and inexact path-following. A further contribution of this work is that we provide an example of (OC) for N = 2 with fully explicit solution.
For the case that ∇u is defined in an interval (N = 1) such examples are available, e.g. [14,33], but for N = 2 this is new. Let us briefly address three difficulties associated with (OC). First, the fact that (OC) is posed in the non-reflexive space BV(Ω) complicates the proof of existence of optimal solutions. By now it is, however, well understood how to deal with this issue also in more complicated situations, cf. e.g. [14,16].
Third, our numerical experience with PDE-constrained optimal control involving the TV seminorm [14,22,29,30,33] suggests that path-following Newton methods work significantly better if the optimality systems of the auxiliary problems do not contain the control as independent variable. Therefore, we express the auxiliary optimality conditions in terms of state and adjoint state by regarding the control as an implicit function of the adjoint state.
Let us set our work in perspective with the available literature. As one of the main contributions we show that the solutions of the auxiliary problems converge to the solution of (OC), cf. section 2.3. The asymptotic convergence for vanishing H 1 seminorm regularization is analyzed in [16,Section 6] for a more general problem than (OC), but the fact that our setting is less general allows us to prove convergence in stronger norms than the corresponding [16,Theorem 10]. The asymptotic convergence for a doubly-regularized version of the predual of (OC) is established in [23,Appendix A], but one of the regularizations is left untouched, so convergence is towards the solution of a regularized problem, not towards the solution of (OC). Next, we demonstrate local convergence of an infinite-dimensional inexact Newton method applied to the optimality system of the auxiliary problem. Because the control and the adjoint state are coupled by a quasilinear PDE, this convergence analysis is non-trivial; among others, it relies on Hölder estimates for the gradient of the control that are rather technical to derive. A related result is [23,Theorem 3.5], where local q-superlinear convergence of a semismooth Newton method is shown for the doubly-regularized Fenchel predual for fixed regularization parameters. Yet, since we work with a different optimality system, the overlap is small. Nonetheless, [23] is closely related to the present paper, and it would be interesting to embed the semismooth Newton method [23, Algorithm 2] in a path-following scheme and compare it to our algorithm. The concept to view the control as an implicit function of the adjoint state or to eliminate it, is well-known in optimal control, cf., e.g., [14,23,35,36,37,38,48,49,53,56].
Turning to the discrete level we provide a Finite Element approximation and demonstrate that the Finite Element solutions of the auxiliary problems converge to their non-discrete counterparts. Finite Element approximations for optimal control in BV involving the TV seminorm have also been studied in [7,13,14,16,25,26,29,30,33], but in our assessment the regularization of (OC) that we propose is not covered by these studies. The papers [24,43] study the linear systems that arise in split Bregman methods when applied to a discretization of (OC) with homogeneous Neumann boundary conditions.
TV-regularization is also of significant importance in imaging problems and its usefulness for, e.g., noise removal has long been known [50]. However, we take the point of view that imaging problems are different in nature from optimal control problems, for instance because their forward operator is usually cheap to evaluate and non-compact.
This paper is organized as follows. After preliminaries in section 1, we consider existence, optimality conditions and convergence of solutions in section 2. In section 3 we study differentiability of the adjoint-to-control mapping, which paves the way for proving local convergence of an inexact Newton method in section 4. Section 5 addresses the Finite Element approximation and its convergence, while section 6 provides the path-following method. Numerical experiments are presented in section 7, including for the test problem with explicit solution. In section 8 we summarize. Several technical results such as Hölder continuity of solutions to quasilinear PDEs are deferred to the appendix.

Preliminaries
We recall facts about the space BV(Ω), introduce the smoothed BV seminorm that we use in this work, and collect properties of the solution operator associated to the PDE in (OC).
Convergence with respect to d BV,1 is called strict convergence.
We will also use the following density property.
Proof By straightforward modifications the proof for the special case r = 1 in [5, Thm. 10.1.2] can be extended, using that the sequence of mollifiers constructed in the proof converges in L r , see [5,Prop. 2.2.4].
For the remainder of this work we fix a number s = s(N ) ∈ (1, N N −1 ) with where the first embedding is compact and the second is continuous. For N = 1 we interpret N N −1 as ∞.

The smoothed BV seminorm
We will replace the BV seminorm in (OC) by the function ψ δ : BV(Ω) → R, where δ ≥ 0. We stress that ψ δ is frequently employed in imaging problems for the same purpose, for instance in [1,21]. It has the following properties.

Lemma 2
The following statements are true for all δ ≥ 0.

The solution operator of the state equation
Lemma 3 For every u ∈ H −1 (Ω) the operator equation Ay = u in (OC) has a unique solution y = y(u) ∈ Y . The solution operator is linear, continuous, and bijective. In particular, S is L s -L 2 continuous. Moreover, for given q ∈ (1, ∞) there is a constant C > 0 such that is satisfied for all u ∈ L q (Ω). Remark 3 From BV(Ω) → L s (Ω) → H −1 (Ω) and Lemma 3 we obtain that (OC) has a nonempty feasible set.

The solutions of original and regularized problems
In this section we prove the existence of solutions for (OC) and the associated regularized problems, characterize the solutions by optimality conditions, and show their convergence in appropriate function spaces.

The original problem: Existence of solutions and optimality conditions
To establish the existence of a solution for (OC) we use the reduced problem min u∈BV(Ω) .

(ROC)
Lemma 4 The function j : BV(Ω) → R is well-defined, strictly convex, continuous with respect to d BV,s , and coercive with respect to · BV(Ω) .
Proof The term 1 2 Su − y Ω 2 L 2 (Ω) is well-defined by Remark 3 and strictly convex in u due to the injectivity of S. Since |·| BV(Ω) is convex, the strict convexity of j follows. The continuity holds because S is L s -L 2 continuous. The coercivity follows by virtue of [1, Lemma 4.1] using again that S is injective and L s -L 2 continuous.
The strict convexity implies that j has at most one (local=global) minimizer.
Proof The proof is included in the proof of Theorem 2.

Remark 4
The optimality conditions of (ROC) are only needed for the construction of the test problem with explicit solution in appendix D, but not for the following analysis. They are deferred to appendix C. We stress, however, that they allow to discuss the sparsity of ∇ū, cf. Remark 9.
Proof The well-definition and strict convexity of j γ,δ follow similarly as for j in Lemma 4. Using Lemma 2 1. we find j ≤ j γ,δ , so j γ,δ inherits the coercivity from j. The continuity follows term by term. For the first term it is enough to recall from Lemma 3 the L 2 -L 2 continuity of S. The second term is Lipschitz in H 1 by Lemma 2. The continuity of the third term is obvious.
We obtain existence of unique and global solutions.
Proof Since j γ,δ is strictly convex by Lemma 5, there is at most one minimizer. For γ > 0 the existence ofū γ,δ ∈ H 1 (Ω) follows from standard arguments since j γ,δ | H 1 (Ω) is strongly convex and H 1 continuous by Lemma 5. For γ = 0 and any δ ≥ 0, the existence of a minimizer follows from [1, Theorem 4.1] by use of the injectivity and boundedness of S : While Ω is convex in [1], [1, Theorem 4.1] remains true without this assumption.
Lemma 6 For γ, δ > 0 the functional j γ,δ : Its first derivative is given by Proof It suffices to argue for ψ δ , which we do in Lemma 15 in appendix A. The other terms are standard.
For differentiable convex functions a vanishing derivative is necessary and sufficient for a global minimizer. This yields the following optimality conditions.
Proof Let > 0 and let ((γ k , δ k )) k∈N ⊂ R 2 ≥0 converge to (0, 0). There holds where we used j(ū) ≤ j(ū γ k ,δ k ) ≤ j γ k ,δ k (ū γ k ,δ k ). The first term in brackets satisfies where the last inequality follows from Lemma 2. For the second term in brackets we deduce from Lemma 1 and the d BV,s continuity of j established in Lemma 4 that there is u ∈ C ∞ (Ω) such that |j(ū) − j(u )| < . This yields Putting the estimates for the two terms together shows For k → ∞ this implies the claim since > 0 was arbitrary and We infer that the optimal controlsū γ,δ converge toū in L r for suitable r.
Proof Let ((γ k , δ k )) k∈N ⊂ R 2 ≥0 converge to (0, 0). Let C > 0 be so large that γ k , δ k ≤ C for all k. The optimality ofū γ k ,δ k and Lemma 2 yield for each k ∈ N As (j(ū γ k ,δ k )) k∈N is bounded, we obtain from Lemma 4 that ( ū γ k ,δ k BV(Ω) ) k∈N is bounded, too. The compact embedding of BV(Ω) into L r (Ω), r ∈ [1, N N −1 ) thus implies that there existsũ ∈ L r (Ω) such that a subsequence of (ū γ k ,δ k ) k∈N , denoted in the same way, converges toũ in L r (Ω). It is therefore enough to showũ =ū. Since j is lower semi-continuous in the L s topology, cf. Lemma 2 and continuity and convexity of the other terms, we have where we used Lemma 7 to derive the equality. This showsũ ∈ BV(Ω), hence Theorem 1 impliesũ =ū.
Theorem 5 For any r ∈ [1, N N −1 ) and any r ∈ [1, ∞) we have lim Proof The continuity of S from L q to W 2,q for any q ∈ (1, ∞), see Lemma 3, implies with Lemma 8 that lim (γ,δ)→(0,0) ȳ γ,δ −ȳ W 2,r (Ω) = 0 for any r ∈ Remark 5 The results of section 2 can also be established for nonsmooth domains Ω, butȳ,p,ȳ γ,δ ,p γ,δ may be less regular since S may not provide the regularity stated in Lemma 3. A careful inspection reveals that only Theorem 5 has to be modified. If, for instance, 3 Differentiability of the adjoint-to-control mapping The main goal of this section is to show that the PDE has a unique weak solution u = u(p) ∈ C 1,α (Ω) for every right-hand side p ∈ L ∞ (Ω), and that p → u(p) is Lipschitz continuously Fréchet differentiable in any open ball, where the Lipschitz constant is independent of γ and δ provided γ > 0 and δ > 0 are bounded away from zero. This is accomplished in Theorem 8. Note that we suppress the dependency on γ, δ in u = u(p; γ, δ).
for the open ball of radius b 0 > 0 centered at the origin in L ∞ (Ω).
We first show that p → u(p) is well-defined and satisfies a Lipschitz estimate.
Lemma 9 Let Assumption 6 hold. Then there exist L > 0 and α ∈ (0, 1) such that for each (γ, δ) ∈ I and all p 1 , p 2 ∈ B the PDE (2) has unique weak solutions u 1 = u 1 (p 1 ) ∈ C 1,α (Ω) and u 2 = u 2 (p 2 ) ∈ C 1,α (Ω) that satisfy In particular, we have the stability estimate Proof Unique existence and the first estimate are established in Theorem 15 in appendix B. The second estimate follows from the first for p 2 = 0.
We introduce the function The following two results prove that the adjoint-to-control mapping is differentiable and has a locally Lipschitz continuous derivative whose Lipschitz constant is bounded on bounded sets uniformly in I.
Theorem 7 Let Assumption 6 hold and let α ∈ (0, 1) be the constant from and there exists C > 0 such that for all (γ, δ) ∈ I, all p ∈ B, and all d ∈ L ∞ (Ω) we have and the estimate A C 0,α (Ω) ≤ a 0 for A := γI + f (∇u(p)) with a constant a 0 that does not depend on γ, δ, p. Since f (v), v ∈ R N , is the Hessian of the convex function v → δ + |v| 2 , it is positive semi-definite. It follows that Theorem 13 is applicable. Thus, the PDE (4) has a unique weak solution z ∈ C 1,α (Ω) that satisfies the claimed estimate. Concerning the Fréchet differentiability we obtain for r : where we set w := w(p, d) := ∇u(p + d) − ∇u(p). Theorem 13 implies that there is C > 0, independent of d, such that The expression in the norm on the right-hand side satisfies the following identity pointwise in Ω Lemma 16 yields As f ∈ C 3 (R N , R N ) with bounded derivatives we have that f is Lipschitz continuous and bounded. We infer from Lemma 16 and Lemma 9 that Theorem 8 Let Assumption 6 hold and let α ∈ (0, 1) be the constant from Lemma 9. Then the mapping u : B → L(L ∞ (Ω), C 1,α (Ω)) is Lipschitz continuous and the Lipschitz constant does not depend on (γ, δ), but only on Ω, N , Proof Let p, q ∈ B and d ∈ L ∞ (Ω). Set z p := ∇ u (p)d and z q := ∇ u (q)d . Then By the same arguments as below (5), A := γI + f ∇u(p) satisfies A C 0,α (Ω) ≤ a 0 with a constant a 0 that does not depend on γ, δ, p, q. Moreover, A is elliptic with constant γ 0 . By Theorem 13 this yields Here, C > 0 does not depend on p, q, but only on the desired quantities. From Lemma 16 and Theorem 7 we infer that Lemma 16 and Lemma 9 therefore imply The first factor is bounded since f is bounded and Lipschitz. This demonstrates the asserted Lipschitz continuity.
Remark 6 Theorem 8 stays valid, for some different α, if Ω is of class C 1,α for some α > 0.

An inexact Newton method for the regularized problems
In this section we introduce the formulation of the optimality system of (ROC γ,δ ) on which our numerical method is based, and we show that the application of an inexact Newton method to this formulation is globally welldefined and locally convergent. We use the following assumption.
To find the root of F we apply the inexact Newton method Algorithm 1. The remainder of this section is devoted to the convergence analysis for Algorithm 1. A similar analysis can be carried out if the optimality system of (ROC γ,δ ) and the inexact Newton method are based on u instead of (y, p). However, in our numerical experiments the path-following method based on (6), cf. section 6, was clearly superior to its counterpart based on u: The former could reduce γ, δ to much smaller values than the latter and was also significantly more robust. Both observations are well in line with our previous experience Algorithm 1: An inexact Newton method for (ROC γ,δ ) Set (y k+1 , p k+1 ) = (y k , p k ) + (δy k , δp k ) 5 end Output: (y * , p * ) [14,22,29,30,33] on PDE-constrained optimal control problems involving the TV seminorm.
There are many works in which the control is considered as an implicit variable of some sort or avoided altogether, e.g., [14,23,35,36,37,38,48,49,53,56]. Concerning the optimal triple (ȳ,p,ū) for (OC) we share with those works the idea to base the optimality system on the smoother variables (y, p). In contrast to those works, however, (6) does neither improve the regularity of the controls that appear as iterates (in comparison to a formulation based on u) nor does it circumvent their use.
The next two lemmas yield convergence of Algorithm 1.
Lemma 10 Let Assumption 9 hold. Then F defined in (6) is locally Lipschitz continuously Fréchet differentiable. Its derivative at (y, p) ∈ Y × P is given by Proof Only p → u(−p) is nonlinear, so the claims follow from Theorem 8.
Lemma 11 Let Assumption 9 hold. Then F (y, p) is invertible for all (y, p) ∈ Y × P .
Proof The proof consists of two parts. First we show that F (y, p) is injective and second that it is a Fredholm operator of index 0, see [39, Chapter IV, Section 5]. These two facts imply the bijectivity of F (y, p). For the injectivity and therefore The representation of z := u (−p)δp from Theorem 7 yields Since f is positive semi-definite, we find δy 2 L 2 (Ω) ≤ 0. This shows δy = 0. By (7) this yields A * δp = 0 in L 2 (Ω), hence δp = 0, which proves the injectivity. To apply Fredholm theory we decompose F (y, p) into the two operators We want to use [39, Chapter IV, Theorem 5.26], which states: If the first operator is a Fredholm operator of index 0 and the second operator is compact with respect to the first operator (see [39, Chapter IV, Introduction to Section 3]), then their sum F (y, p) is also a Fredholm operator of index 0. By the injectivity of F (y, p) this implies its bijectivity. The operators A : Y → Y * and A * : P → L 2 (Ω) are invertible by Lemma 3, and thus is invertible and in particular a Fredholm operator of index 0. It remains to show that is compact with respect to the first operator. Thus, we have to establish that for any sequence ((δy n , δp n )) n∈N ⊂ Y × P such that there exists a C > 0 with δy n Y + δp n P + Aδy n Y * + A * δp n L 2 (Ω) ≤ C ∀n ∈ N, (9) the sequence (u (−p)δp n , δy n ) n∈N ⊂ Y * × L 2 (Ω) contains a convergent subsequence. By (9) we have that ( δy n Y ) n∈N is bounded. The compact embedding Y → → L 2 (Ω) therefore implies the existence of a pointŷ ∈ L 2 (Ω) and a subsequence, denoted in the same way, such that δy n −ŷ L 2 (Ω) → 0. We also have that ( δp n P ) n∈N is bounded. In particular δp n L ∞ (Ω) ≤ b 0 for all n ∈ N for some b 0 > 0. By Theorem 7 this implies that (u (−p)δp n ) n∈N is bounded in C 1,α (Ω). Since C 1,α (Ω) → → Y * , the proof is complete.
Remark 7 Lemma 11 implies that Algorithm 1 is globally well-defined.
It is well-known that the properties established in Lemma 10 and Lemma 11 are sufficient for local linear/q-superlinear/q-quadratic convergence of the inexact Newton method if the residual in iteration k is of appropriate order, e.g. [40,Theorem 6.1.4]. Thus, we obtain the following result.

Finite Element approximation
In this section we provide a discretization scheme for (ROC γ,δ ) and prove its convergence. Throughout, we work with a fixed pair (γ, δ) ∈ R 2 >0 .

Discretization
We use Finite Elements for the discretization of (ROC γ,δ ). Control, state and adjoint state are discretized by piecewise linear and globally continuous elements on a triangular grid. We point out that discretizing the control by piecewise constant Finite Elements will not ensure convergence to the optimal controlū γ,δ , in general; cf. [6,Section 4]. For all h ∈ (0, h 0 ] and a suitable h 0 > 0 let T h denote a collection of open triangular cells T ⊂ Ω with h = max T ∈T h diam(T ). We write Ω h := int(∪ T ∈T hT ). We assume that there are constants C > 0 and c > 1 2 such that Because V h → H 1 (Ω h ) it follows that Y h contains precisely those functions of V h that vanish on ∂Ω h . We use the standard nodal basis ϕ 1 , ϕ 2 , . . . , ϕ dim(V h ) in V h and assume that it is ordered in such a way that and by defining S h u := y h we obtain the discrete solution operator S h : L 2 (Ω h ) → Y h to the PDE in (OC). The discretized version of (ROC γ,δ ) is given by where ψ δ,h : H 1 (Ω h ) → R is defined in the same way as ψ δ , but with Ω replaced by Ω h . By standard arguments this problem has a unique optimal solutionū γ,δ,h . Based onū γ,δ,h we defineȳ γ,δ,h := S hūγ,δ,h andp γ,δ,h := S * h (S hūγ,δ,h − y Ω ). For h → 0 the triple (ū γ,δ,h ,ȳ γ,δ,h ,p γ,δ,h ) converges to the continuous optimal triple (ū γ,δ ,ȳ γ,δ ,p γ,δ ) in an appropriate sense, as we show next.
Proof Let R hȳγ,δ ∈ Y h denote the Ritz projection with respect to A. Extendinḡ y γ,δ,h ∈ Y h and R hȳγ,δ by zero to Ω we clearly have Thus, choosing ϕ h =ȳ γ,δ,h − R hȳγ,δ and using the ellipticity of A and c 0 ≥ 0 in Ω together with the Poincaré inequality in Ω yields a constant C > 0, indepen- where we also used extension by zero and Theorem 11. Since R hȳγ,δ The proof for p γ,δ,h −p γ,δ is analogous.

Numerical solution
Based on the Finite Element approximation from section 5 we now study an inexact Newton method to compute the discrete solution (ȳ γ,δ,h ,p γ,δ,h ,ū γ,δ,h ) and we embed it into a path-following method.

A preconditioned inexact Newton method for the discrete problems
We prove local convergence of an inexact Newton method when applied to a discretized version of (6) for fixed (γ, δ) ∈ R 2 >0 . To this end, let us introduce the discrete adjoint-to-control mapping u h . We recall that the constant h 0 > 0 is introduced at the beginning of section 5.1. The proof of the following result is similar to the continuous case in Theorems 7, 8 and 15, so we omit it.

Lemma 13
Let h ∈ (0, h 0 ]. For every p ∈ L 2 (Ω h ) there exists a unique u h = u h (p) ∈ V h that satisfies the following discrete version of (2) The associated solution operator u h : With u h at hand we can discretize (6) by The same F h is obtained if we consider the optimality conditions of (ROC γ,δ,h ) and express them in terms of (y, p). Moreover, (ȳ γ,δ,h ,p γ,δ,h ) is the unique root of F h and the properties of F from Lemma 10 and Lemma 11 carry over to F h .
Proof The regularity follows from Lemma 13.
, it is sufficient to show that F h (y, p) is injective. This can be done exactly as in Lemma 11. Similar to Theorem 10 we have the following result.
As a preconditioner for the fully discrete Newton system based on where B = A −1 is the inverse of the stiffness matrix A, and M is the mass matrix. The preconditioner would agree with F h (y, p) −1 if u h (−p) were zero.

A practical path-following method
The following Algorithm 2 is a practical path-following inexact Newton method to solve (ROC γ,δ,h ). We expect that its global convergence can be shown if ρ(γ i , δ i ) and η k are chosen sufficiently small, but this topic is left for future research. For the choices detailed below, global convergence holds in practice. The inner loop in lines 3-11 of Algorithm 2 uses an inexact Newton method to compute an approximation (ŷ i+1 ,p i+1 ) of the root of F h for fixed ( In the implementation we use ρ(γ, δ) = max{10 −6 , γ}, which may be viewed as inexact path-following. For the forcing term η k we use the two choices η k =η k := 10 −6 and η k =η k := max{10 −6 , min{10 −k−1 , The choice η k =η k is related to quadratic convergence, while η k =η k corresponds to superlinear convergence, cf. Theorem 12. We also terminate gmres Algorithm 2: Inexact path-following inexact Newton method call Algorithm 3, input w k := (y k , p k ), δw k := (δy k , δp k ); output: λ k 10 set (y k+1 , p k+1 ) := (y k , p k ) + λ k (δy k , δp k ) if the Euclidean norm of F h (y k , p k ) + F h (y k , p k )(δy k , δp k ) drops below η k since this seemed beneficial in the numerical experiments.
To compute the control u h (−p k ) that satisfies (11) we use a globalized Newton method that can be shown to converge q-quadratically for arbitrary starting points u 0 ∈ V h . In fact, since (11) is equivalent to u h (p) being the minimizer of the smooth and strongly convex problem standard globalization techniques for Newton's method will ensure these convergence properties (e.g., Newton's method combined with an Armijo line search [8, Section 9.5.3]). The method terminates when the Newton residual falls below a threshold that decreases with (γ i , δ i ). The linear systems are solved using SciPy's sparse direct solver spsolve. As an alternative we tested a preconditioned conjugate gradients method (PCG). The results were mixed: The use of PCG diminished the total runtime of Algorithm 2 if all went well, but broke down on several occasions for smaller values of (γ i , δ i ).
In lines 12-14 it is decided whether to accept (ŷ i+1 ,p i+1 ) as a solution and terminate the algorithm; if not, then we continue the path-following by updating (γ i , δ i ) with the factor σ i . We select σ i based on the number of Newton steps that are needed to compute the implicit controls {u h (−p k )} k in outer iteration i. If this number surpasses a predefined m ∈ N, then we choose σ i > σ i−1 . If it belongs to [0, 0.75m], then we choose σ i < σ i−1 . Otherwise, we let σ i = σ i−1 . In addition, we enforce σ i ≥ 0.25 for all i since we found in the numerical experiments that choosing σ i too small can slow down or prevent convergence in some cases once (γ i , δ i ) is very small, cf. Table 3 below. The weighing 1/β in the termination criterion is made since the amplitude of the adjoint state is roughly of order β in comparison to the state. In all experiments we use κ = 10 −3 .

Algorithm 3 augments the inexact Newton method in lines 3-11 of Algorithm 2 by a non-monotone line search globalization introduced in [42] for
Broyden's method. The non-monotonicity allows to always accept the inexact Newton step and yields potentially larger step sizes than descent-based strategies. The intention is to keep the number of trial step sizes low since every trial step size requires the evaluation of F h and hence a recomputation of u h (−p k ). In the numerical experiments we use τ = 10 −4 and we observe that in the vast majority of iterations full steps are taken, i.e., λ k = 1. To briefly discuss convergence properties of the globalized inexact Newton method, let us assume for simplicity that u h (−p k ) is determined exactly for each k. By following the arguments of [42] we can show that for sufficiently small η k the sequence ((y k , p k )) k∈N obtained by ignoring lines 4-7 must either be unbounded or converge to the unique root of F h ; a key ingredient in the corresponding proof is that F h (y, p) is invertible for all (y, p), which we have demonstrated in Lemma 14. In particular, if ((y k , p k )) k∈N is bounded, then the globalized inexact Newton method in lines 3-11 terminates finitely. While we always observed this termination in practice, the question whether ((y k , p k )) k∈N can be unbounded remains open. Furthermore, as in [42] it follows that if ((y k , p k )) k∈N converges, then eventually step size 1 is always accepted, in turn ensuring that the convergence rates of Theorem 12 apply.
All norms without index in Algorithm 2 and 3 are L 2 (Ω h ) norms.
Algorithm 3: Computation of step size We provide numerical results for two examples. Our main goal is to illustrate that Algorithm 2 can robustly compute accurate solutions of (OC). The results are obtained from a Python implementation of Algorithm 2 using DOLFIN [46,47], which is part of FEniCS [3,45]. The code for the second example is available at https://arxiv.org/abs/2010.11628.

Example 1: An example with explicit solution
The first example has an explicit solution and satisfies the assumptions used in this work. We consider (OC) for an arbitrary β > 0 with non-convex C ∞ domain Ω = B 4π (0) \ B 2π (0) in R 2 , A = −∆ and c 0 ≡ 0. The desired state is where r(x, y) = x 2 + y 2 , and the optimal stateȳ is with constants A, B, C whose values are contained in appendix D. The optimal control isū (r) = 1 (2π,3π) (r), i.e.,ū has value 1 on the disc B 3π (0) \ B 2π (0) and value 0 on the disc B 4π (0) \ B 3π (0). The optimal value is j(ū) ≈ 24.85β 2 + 59.22β. In appendix D we provide details on the construction of this example and verify that (ȳ,ū) is indeed the optimal solution of (OC). If not stated otherwise, we use β = 10 −3 .
We use unstructured triangulations that approximate ∂Ω increasingly better as the meshes become finer, cf. (10). Figure 1 depicts the optimal controlū h , optimal stateȳ h and negative optimal adjoint state −p h , which were computed by Algorithm 2 on a grid with 1553207 degrees of freedom (DOF).
We begin by studying convergence on several grids. We use the fixed ratio (γ i /δ i ) ≡ 10 2 and apply Algorithm 2 with (γ 0 , δ 0 ) = (1, 0.01) and (ŷ 0 ,p 0 ) = (0, 0). Table 1 shows #it, which represents the total number of inexact Newton steps for (y, p), and #it u , which is the total number of Newton steps used to compute the implicit function u. Table 1 also contains the errors as well as where Ω * represents a reference grid with DOF = 1553207. To evaluate the errors,û final ,ŷ final andp final are extended to Ω * using extrapolation. Table 2 provides details for the run from Table 1 with DOF = 97643 and η k =η k . Table 2 includes , which appears in the termination criterion of Algorithm 2, and τ i u Table 1 indicates convergence of the computed solutions (û final ,ŷ final ,p final ) to (ū,ȳ,p) and of the objective value j γ final ,δ final ,h toj. It also suggests that convergence takes place at certain rates with respect to h.
Moreover, the total number of Newton steps both for (y, p) and for u stays bounded as DOF increases, which suggests mesh independence. The choice η k =η k frequently yields lower numbers of Newton steps for (y, p) and for  u, yet the runtime (not depicted) is consistently higher than for η k =η k since more iterations of gmres are required to compute the step for (y, p). Specifically, usingη k saves between 5% and 36% of runtime, with 36% being the saving on the finest grid. (Since the runtime depends on many factors, these numbers are intended as reference points rather than exact values.) In the vast majority of iterations, step size 1 is accepted for (y k , p k ). For instance, all of the 52 iterations required for DOF = 97643 and η k =η k use full steps; for DOF = 6251 and η k =η k , 86 of the 87 iterations use step size 1.     Table 3 displays the effect of fixing (σ i ) ≡ σ in Algorithm 2. The mesh uses DOF = 24443 and is the same as in Table 1.
For σ = 0.1 the iterates failed to converge for both forcing terms once γ 12 = 10 −12 is reached because u h (−p k ) could not be computed to sufficient accuracy within the 200 iterations that we allow for this process. Together with the case σ = 0.2 in Table 3 this shows that small values of σ i can increase the number of steps required for u and even prevent convergence. We therefore enforce σ i ≥ 0.25 for all i in all other experiments, although this diminishes the efficacy of Algorithm 2 in some cases.
We now turn to the robustness of Algorithm 2. We emphasize that in our numerical experience the robustness of algorithms for optimal control problems involving the TV seminorm in the objective is a delicate issue. Table 4 displays the iteration numbers required by Algorithm 2 for different values of β on the mesh with DOF = 24443 along with the error E u for η k =η k for the two choices (γ i /δ i ) ≡ 10 2 and (γ i /δ i ) ≡ 1. The omitted values for β = 10 −3 and (γ i /δ i ) ≡ 10 2 are identical to those from Table 1 for DOF = 24443 and η k =η k . Table 5 provides iteration numbers and errors for various fixed choices of (γ i /δ i ) on the mesh with DOF = 24443 for β = 10 −3 , η k =η k and (σ i ) ≡ 0.5. For the ratios 10 −1 and 10 −2 we increased κ from 10 −3 to 5 · 10 −3 to obtain convergence. Since our goal is to demonstrate robustness, no further changes are made although this would lower the iteration numbers.    Table 4 and 5 suggest that Algorithm 2 is able to handle a range of parameter values without modification of its internal parameters.

Example 2
From section 3 onward we have used that Ω is of class C 1,1 . To show that Algorithm 2 can still solve (OC) if Ω is only Lipschitz, we consider an example from [23, Figure 1 depicts the optimal controlū h , optimal stateȳ h and negative optimal adjoint state −p h , which were computed with n = 1024. Apparently,ū h is piecewise constant.
Throughout, we use the fixed ratio (γ i /δ i ) ≡ 10 −2 and apply Algorithm 2 with (γ 0 , δ 0 ) = (0.01, 1) and (ŷ 0 ,p 0 ) = (0, 0). As in example 1, cf. Table 5, other ratios for γ i /δ i can be employed as well. We only provide results forη k since the forcing termη k does not yield lower runtimes in this example; both forcing terms produce the same errors, though. Table 6 displays iteration numbers and  errors for different grids, while Table 7 shows details for n = 256. Table 6 hints at possible mesh independence for (y, p), but indicates that the number of Newton steps for u increases with n. The depicted errors are computed by use of a reference solution that is obtained by Algorithm 2 with η k =η k on the mesh with n = 1024. As in the first example it seems that convergence with respect to h takes place at certain rates. The majority of iterations use full Newton steps for (y, p). For instance, all but one of the 50 iterations for n = 256 use step length one. We also repeated the experiments  from Table 6 with y Ω rotated by 30, respectively, 45 degree. The omitted results are similar to those in Table 6, which further illustrates the robustness of our approach. Table 8 shows the outcome of Algorithm 2 if a sequence of nested grids is used, where the grids are refined once γ i < 10 −4 , γ i < 10 −6 and γ i < 10 −8 , respectively. In this example, nesting reduces the runtime by about 57% while providing the same accuracy as a run for n = 512, cf. the last line of Table 6.    Table 9 addresses the robustness of Algorithm 2 with respect to β. The computations are carried out on nested grids and the displayed iteration numbers are those for the finest grid, which has n = 128. The reference solution is computed for n = 256. The final grid change happens once γ i < 10 −8 . Table 9 indicates that Algorithm 2 is robust with respect to β. As in example 1 it is possible to achieve lower iteration numbers through manipulation of the algorithmic parameters. For instance, if the final grid change for β = 10 −5 happens once γ i < 10 −9 instead of γ i < 10 −8 , then only (41,638) iterations are needed on the final grid instead of (70, 958).

Summary
We have studied an optimal control problem with controls from BV in which the control costs are given by the TV seminorm, favoring piecewise constant controls. By smoothing the TV seminorm and adding the H 1 norm we obtained a family of auxiliary problems whose solutions converge to the optimal solution of the original problem in appropriate function spaces. For fixed smoothing and regularization parameter we showed local convergence of an infinite-dimensional inexact Newton method applied to a reformulation of the optimality system that involves the control as an implicit function of the adjoint state. Based on a convergent Finite Element approximation a practical path-following algorithm was derived, and it was demonstrated that the algorithm is able to robustly compute the optimal solution of the control problem with considerable accuracy. To verify this, a two-dimensional test problem with known solution was constructed.

A Differentiability of ψ δ
The following result follows by standard arguments, but we provide its proof for convenience.

Lemma 15
Let δ > 0, N ∈ N and let Ω ⊂ R N be open. The functional is Lipschitz continuously Fréchet differentiable and twice Gâteaux differentiable. Its first derivative at u in direction v and its second derivative at u in directions v, w are given by , we obtain for all t ∈ [−1, 1], t = 0, Thus, we can apply the theorem of dominated convergence, which yields From Ω (∇u, ∇v) we see that the functional v → ψ δ (u)v is linear and continuous.

Second Gâteaux derivative
Let u, v, w ∈ H 1 (Ω). Since g : R N → R, g(y) := (y,z) √ δ+|y| 2 , with z ∈ R N fixed, is Lipschitz continuous on R N with constant 2 √ δ |z|, we obtain for all t ∈ R, t = 0, 1 t (∇u + t∇w, ∇v) Dominated convergence yields where we used the directional derivative of g to derive the last equality. From (15) we deduce the boundedness of the bilinear mapping (v, w) → ψ δ (u) [v, w] by Lipschitz continuous Fréchet differentiability Denoting by · B the standard norm for bounded bilinear forms on H 1 (Ω) × H 1 (Ω), we infer from (16) B Hölder continuity for quasilinear partial differential equations To prove results on the Hölder continuity of solutions to quasilinear elliptic PDEs, we first discuss linear elliptic PDEs.
We collect elementary estimates for Hölder continuous functions.
We can now establish the desired regularity and continuity result for (2).
Proof We did not find a proof of Theorem 15 in the literature, so we provide it here. To this end, let b 0 > 0 and let p 1 , p 2 ∈ L ∞ (Ω) with p 1 L ∞ (Ω) , p 2 L ∞ (Ω) < b 0 . Part 1: Showing existence of u 1 , u 2 ∈ H 1 (Ω). For i = 1, 2 we define As F i is strongly convex, it has a unique minimizer u i ∈ H 1 (Ω). Since F i is Fréchet differentiable by Lemma 15, we have F (u i ) = 0 in H 1 (Ω) * , which is equivalent to (22).

C The original problem: Optimality conditions
The first order optimality conditions of (ROC) can be obtained by use of [9]. The space W q 0 (div; Ω), q ∈ [1, ∞), that appears in the following is defined in [9, Definition 10]. Here, the first, second and third equation correspond to the absolutely continuous part, the jump part, respectively, the Cantor part of the vector measure ∇ū. Also, σ Cū is the Radon-Nikodym density of ∇ūc with respect to |∇ūc|, cf. e.g. Proof The well-known optimality condition 0 ∈ ∂j(ū) from convex analysis can be expressed as −p β ∈ ∂|ū| BV(Ω) , so the claim follows from [9, Proposition 8].
Since {x : |h(x)| = β} typically has small Lebesgue measure (often: measure 0),ū is usually constant a.e. in large parts (often: all) of Ω; cf. also the example in section D.