A nonsmooth primal-dual method with interwoven PDE constraint solver

We introduce an efficient first-order primal-dual method for the solution of nonsmooth PDE-constrained optimization problems. We achieve this efficiency through not solving the PDE or its linearisation on each iteration of the optimization method. Instead, we run the method interwoven with a simple conventional linear system solver (Jacobi, Gauss-Seidel, conjugate gradients), always taking only one step of the linear system solver for each step of the optimization method. The control parameter is updated on each iteration as determined by the optimization method. We prove linear convergence under a second-order growth condition, and numerically demonstrate the performance on a variety of PDEs related to inverse problems involving boundary measurements.


introduction
Our objective is to develop efficient first-order algorithms for the solution of PDE-constrained optimization problems of the type min ,  () +  () +  () subject to (, ; ) =  for all , where  is a linear operator and the functions  , , and  are convex but the first two possibly nonsmooth.The functionals  and  model a partial differential equation in weak form, parametrised by ; for example, (, ; ) = ⟨∇, ∇⟩.
Semismooth Newton methods [28,30] are conventionally used for such problems when a suitable reformulation exists [19,21,34,35,20].Reformulations may not always be available, or yield effective algorithms.The solution of large linear systems may also pose scaling challenges.Therefore, first-order methods for PDE-constrained optimization have been proposed [8,6,27,7] based on the primal-dual proximal splitting (PDPS) of [5].The original version applies to convex problems of the form (1.1) min   () +  ().
The primal-dual expansion permits efficient treatment of  •  for nonsmooth .In [8,6,27,7]  may be nonlinear, such as the solution operator of a nonlinear PDE.However, first-order methods generally require a very large number of iterations to exhibit convergence.If the iterations are cheap, they can, nevertheless, achieve good performance.If the iterations are expensive, such as when a PDE needs to be solved on each step, their performance can be poor.Therefore, especially in inverse problems research, Gauss-Newton -type approaches are common

notation and basic results
Let  be a normed space.We write ⟨ • | • ⟩ for the dual product and, in a Hilbert space, ⟨ • , • ⟩ for the inner product.The order of the arguments in the dual product is not important when the action is obvious from context.For  a Hilbert space, we denote by In  :  ↩→  * the canonical injection, ⟨In   | x⟩ = ⟨, x⟩ for all , x ∈  .
These expressions hold in Hilbert spaces also with the inner product in place of the dual product.We write  ★ for the inner product adjoint of , and  * for the dual product adjoint.We write dom  for the effective domain, and  * for the Fenchel conjugate of  :  → ℝ := [−∞, ∞].We write  ′ () ∈  * for the Fréchet derivative at  when it exists, and, if  is a Hilbert space, ∇ () ∈  for its Riesz presentation.For convex  on a Hilbert space  , we write  () ⊂  for the subdifferential at  ∈  (or, more precisely, the corresponding set of Riesz representations, but aside from a single proof in Appendix a, we will not be needing subderivatives as elements of  * ).We then define the proximal map prox  () := (Id + ) −1 () = arg min We denote the {0, ∞}-valued indicator function of a set  by   .We occasionally apply operations on  ∈  to all elements of sets  ⊂  , writing ⟨ + |⟩ := {⟨ + |⟩ |  ∈ }.For  ⊂ ℝ, we write  ≥  if  ≥  for all  ∈ .

problem and proposed algorithm
We start by introducing in detail the type of problem we are trying to solve.We then rewrite in Section 2.1 its optimality conditions in a form suitable for developing our proposed method in Section 2. 3. Before this we recall the structure and derivation of the basic PDPS in Section 2.2.
To ensure the coercivity of ( • , • ; ), and hence the existence of unique solutions to (2.2), we will further need to restrict  through dom  .
We require the sum and chain rules for convex subdifferentials to hold on  +  • .This is the case when (2.3) there exists an  ∈ dom( • ) ∩ dom  with  ∈ int(dom ).
We may now write (2.1) as  1 If the PDE (2.2) does not have a solution  for any  ∈ dom  ∩ dom( • ), the inner "max" will be infinite, not reached, and technically, therefore, a "sup".In this case also (2.1) has no solution.If (2.1) has a solution, there must exist some (, ) for which (any)  reaches the "max".Likewise,  reaching the corresponding "max" exists for any  ∈ dom( • ) by basic properties of Fenchel conjugates of convex, proper, lower semicontinuous functions.In terms of ( ū, w, x, ȳ) ∈  ×  ×  ×  , subject to a qualification condition, this problem has the necessary first-order optimality conditions (2.9) This is our principal form of optimality conditions for (2.1).
It is easy to see that (2.9) are necessary for ( ū, w, x, ȳ) to be a saddle point of (2.8).The next theorem shows, subject to qualification conditions, that (2.9) are also necessary for a solution to (2.8) (which may not be a saddle point in the non-convex-concave setting).Note that  ∈  is inconsequential in (2.8).If one choice forms a part of a solution of the problem, so does any other (or else the problem has no solution at all).However, w solving (2.9) is more precisely determined.
After an affine shift and restriction of  to a subspace, the condition int dom[ + •] ≠ ∅ can always be relaxed to the corresponding relative interior being non-empty.Since the proof of Theorem 2.3 is long and depends on techniques not needed in our main line of work, we relegate it to Appendix a. Example 2.4.If  =  , taking ℎ  = /∥ ∥ and ℎ  = 0, we see that the qualification conditions (2.10) hold when   ( • , • ; x) is coercive.Similarly, also when  ≠  , if the weak coercivity conditions of the Babuška-Lax-Milgram theorem hold for (, ℎ  ) ↦ →   (ℎ  , ; x), then so do (2.10).

algorithm derivation
The derivation of the PDPS and the optimality conditions (2.9) suggest to solve (2.9) by iteratively solving (2.13) We have made the step length and over-relaxation parameters iteration-dependent for acceleration purposes.The indexing   and  +1 is off-by-one to maintain the symmetric update rules from [5].
The next assumption encodes our conditions on the PDE splittings.Assumption 3.3 (Splitting).Let Assumption 3.1 hold.For  ∈ ℕ, for which this assumption is to hold, we are given splitting operators Γ  , Υ  :  ×  ×  → ℝ and   = (  ,   ,   ,   ) ∈  ×  ×  ×  such that: (i) Γ  is linear in the second argument, Υ  in the first.
(iii) For some   > 0 and   ,   ,   ≥ 0, we have We verify the assumption for standard splittings in Section 4.2.The verification will introduce the assumption that  ′ be Lipschitz.The Lipschitz factor then appears in   , justifying the -subscript notation.Generally   and   model the -sensitivity of  and   .For linear PDEs, such as Example 2.1,   does not depend on .In that case most iterative solvers for the adjoint PDE would also be independent of  and have   = 0.The factor   relates to the contractivity of the iterative solver.
The next, final, assumption introduces testing parameters that encode convergence rates and restrict the step length parameters in the standard primal-dual component of our method.It has no difference to the treatment of the PDPS in [37,9].Dependent on whether both, one, or none of γ > 0 and γ * > 0, the parameters can be chosen to yield varying modes and rates of convergence.Assumption 3.4 (Primal-dual parameters).Let Assumption 3.1 hold.For all  ∈ ℕ, the testing parameters   ,  > 0, step length parameters   ,   > 0, and the over-relaxation parameter   ∈ (0, 1] satisfy for some γ ∈ [0,   ] and γ * ∈ [0,   * ], and  ∈ (0, 1) that

the testing operator
To complement the primal-dual testing parameters in Assumption 3.4, we introduce testing parameters   ,   > 0 corresponding to the PDE updates in our method; the first two lines of (2.19).We combine all of them into the testing operator Recalling   and   from (2.18) and (3.1), thanks to Assumption 3.4, we have Therefore, for skew-symmetric Assumption 3.4 ensures     to be positive semi-definite.The proof is exactly as for the PDPS, see, e.g., [9], but we include it for completeness.Lemma 3.5.Let  ∈ ℕ and suppose Assumption 3.4 holds.Then Proof.By Young's inequality, for any  = (, , , ), , the claim follows.□ page 10 of 32

Since both
where all the terms are non-negative.
Proof.Lemma 3.6 gives the estimate By

main results
We can now state our main convergence theorems.In terms of assumptions, the only fundamental difference between the accelerated  (1/ ) and the linear convergence result is that the latter requires  * to be strongly convex and the former doesn't.Both require sufficient second order growth in terms of the respective technical conditions (3.16b) or (3.19b).The step length parameters differ.Theorem 3.10 (Accelerated convergence).Suppose Assumptions 3.1 and 3.3 hold with   > 0. Put γ * = 0 and pick  0 ,  0 , ,  > 0 and 0 < γ <   satisfying where  0 is defined as part of the update rules  +1 :=     ,  +1 :=   /  , and Let { +1 }  ∈ℕ be generated by Algorithm 2.1 for any  0 ∈  ×  ×  ×  .Then   → x in  ;   → ū in  ; and   → w in  , all strongly at the rate  (1/ ).
Proof.As shown in [37,9], Assumption 3.4 is satisfied for  0 = 1,  0 =  −1 ,  +1 :=   /  , and  +1 :=   /  .Moreover, both {  }  ∈ℕ and {  }  ∈ℕ grow exponentially and  +1 ≤  −1   .Thus (3.19) verifies (3.9) with  =  −1 so that Lemma 3.7 verifies (3.5).The rest follows as in the proof of Theorem 3.10.□ Theorems 3.10 and 3.11 show global convergence, but may require a very constricted dom  through the constant   in Assumption 3.1 (iv).In Appendix b we relax the constant by localizing the convergence.Remark 3.12 (Linear and sufficiently linear PDEs).For linear PDEs, i.e., when   does not depend on , we have   = 0 and ( w) = 0, as observed in Remark 3.2.Moreover, for typical solvers for the adjoint PDE, we would have   = 0, as   does not then depend on .In that case, by taking  → 0, (3.16b) (and likewise (3.19b)) reduces to   >  −1 0 .Practically this means that the convergence rate factor  −1 0 has to be bounded by the inverse contractivity factor   of the linear system solver.If   > 1, as we should have, this condition can be satisfied by suitable choices of γ ∈ (0,   ] and γ * .By extension then, the conditions (3.16b) and (3.19b) are satisfiable for small  when the PDE is "sufficiently linear".Remark 3.13 (Weak convergence).It is possible to prove weak convergence when  ≡ 1 and  ≡  0 ,  ≡  0 satisfy (3.16).The proof is based on an extension of Opial's lemma to the quantitative Féjer monotonicity (3.14).We have not included the proof since it is technical, and does not permit reducing assumptions from those of Theorems 3.10 and 3.11.We refer to [6] for the corresponding proof for the NL-PDPS.
page 14 of 32

splittings and partial differential equations
We now prove Assumption 3.1 and derive explicit expressions for the operator ∇  from (2.6).We do this in Section 4.1 for some sample PDEs.Then in Section 4.2 we study the satisfaction of Assumption 3.3 for Gauss-Seidel and Jacobi splitting, as well as a simple infinite-dimensional example without splitting.
We briefly discuss a quasi-conjugate gradient splitting to illustrate the generality of our approach.We conclude with a discussion of the convergence theory and discretisation in Section 4.3.
< ∞, and ȳ ∈  for a Hilbert space  , then also: We include both to emphasise that the latter defines the Hilbert space structure and topology that we generally work with, while the former is a technical restriction that arises from our proofs.Under appropriate smoothness conditions on x, the boundary of Ω, as well as the boundary data, standard elliptic theory proves that ū ∈  1 (Ω) is a classical solution, hence Lipschitz and  1,∞ (Ω) on the whole domain; see, e.g., [11].
Proof.Assumption 3.3 (i) holds by construction, and (ii) by the assumed invertibility of   for  ∈ dom  .We only consider the second inequality of (iii) for Υ, the proof of the first inequality for Γ being analogous with − ′ () replaced by  −   .We thus need to prove (4.5) Using (2.15) with  * x w = − ′ ( ū) and  *   w =  *   w +  *   w, we expand Expanding ∥ +1 − w ∥ 2  and applying the triangle inequality, and Young's inequality thrice, yields Note that the first part of (4.3) and the second part (4.4) hold also for the adjoints  *  and  *  in the corresponding spaces.Therefore, we establish Taking   and   as stated, we therefore obtain (4.5).□ For our first, infinite-dimensional example of the satisfaction of the conditions of Theorem 4.6, and hence of Assumption 3.3, note that we have in general Example 4.7 (No splitting of a weighted Laplacian in  1 ).Let  =  =  1 0 (Ω),  = ℝ, and   =   = ∇ * ∇ ∈ ( 1 0 (Ω);  −1 (Ω)) be the Laplacian weighted by  ∈ (0, ∞).Then .
Primal-dual method with interwoven PDE constraint solver Therefore, assuming inf dom  > 0, we can in (4.4) take   = inf  ∈dom   for  the infimum of the spectrum of the Laplacian as a bounded self-adjoint operator in  1 0 (Ω); see, e.g., [25,.Clearly also  = 0 due to   = 0.For (4.3), we get Thus we can take   as the supremum of the spectrum of the Laplacian as a bounded self-adjoint operator in  1 0 (Ω).In the following examples, we take  =  = ℝ  with the standard Euclidean norm.Then (4.4) can be rewritten as the spectral radius bound and positivity condition     ) < 1.Now, for every large enough  > 0, for .
Since 0 ≤  < 1, the right hand side tends to zero as  → ∞.Since also 1/ For standard conjugate gradients    ≡  permits a recursive residual update optimization that we are unable to perform.We have ⟨    +1 ,   ⟩ = 0 for all , although no "-conjugacy" relationship necessarily exists between  +1 and   for  < .
The next lemma molds the updates (4.6) into our overall framework.Lemma 4.12.The update (4.6) corresponds to Line 3 of Algorithm 2.1 with for Proof.Indeed, expanding  +1 , the -update of (4.6) may be rewritten as Applying the invertible matrix    and expanding   , this is and, adding      −  on both sides, further Since ( +1 , • ;   ) = ⟨    +1 , • ⟩, and ( • ) = ⟨, • ⟩, the claim follows.□ Unless   is independent of , a simple approach as in Theorem 4.6 can only verify Assumption 3.3 with   < 1.We hence leave the verification of convergence of Algorithm 2.1 with quasi-conjugate gradient updates to future research.

discussion
Before we embark on numerical experiments, it is time to make a few unifying observations about the disparate results above, with regard to the main conditions (3.16b) and (3.19b) of the convergence Theorems 3.10 and 3.11, and their connection to the fundamentally discrete viewpoint of Examples 4.9 and 4.10.As we have already noted in Remark 3.12, (i) The main conditions (3.16b) and (3.19b) are easily satisfied for linear PDEs, i.e., when   does not depend on .In Section 4.2, this corresponds to   =  (while   may still depend on ).The only condition given in Remark 3.12 was that   = 0, which is satisfied in Examples 4.8 to 4.10 due to   = 0.
For linear PDEs, ( w) = 0. Together with   = 0, this causes also ( ū) and   to disappear from the convergence conditions.All of these quantities might depend on the discretisation.As we have seen in Section 4.1, ( ū) and ( w) require the use of ∞-norm bounds on the solutions, even when the underlying space is   .Such bounds may not always hold in infinite dimensions (however, see Remark 4.2), although they do always hold in finite-dimensional subspaces.In our numerical experiments, we have, however, not observed any grid dependency of ( ū) and ( w) (calculated a posteriori, after a very large number of iterations).
On a more negative note, with  =  = ℝ  (ℎ, ) equipped with the standard Euclidean norm, consider   = −Δ ℎ for a scalar  with Δ ℎ a finite differences discretisation of the Laplacian on a -dimensional square grid of cell width ℎ and (ℎ, ) nodes.Then, for both Jacobi and Gauss-Seidel splitting, as well as the trivial splitting (gradient descent)   ∝ Id, the spectral radius  ( −1    ) → 1 as ℎ → 0; see, e.g., [26,Chapter 4.2.1].By simple numerical experiments,  2  / 2  nevertheless stays roughly constant, so the result is that   ,   → ∞ as ℎ → 0. For "no splitting", i.e.,   =   , instead  2  / 2  → ∞ due to the worsening condition number of Δ ℎ .This latter negative result is, however, dependent on taking  =  = ℝ  (ℎ, ) with the standard Euclidean norm: in Example 4.7 we showed that "no splitting" is applicable to the same problem in  1 .It is, therefore, an interesting question for future research, whether a change of norms would remove the grid dependency of Jacobi and Gauss-Seidel.Our guess is that it would not.
The above indicates that, for nonlinear PDEs, whether our methods even convergence, can depend on the level of discretisation.Nevertheless, to help comes the successive over-relaxation of Example 4.11, which shows that (ii) By letting the over-relaxation parameter  → ∞, we get   ,   ,   → 0, and therefore may be able to obtain convergence (with a comparable iteration count) for any magnitude of ( ū), ( w).
With over-relaxation   → 1 as  → ∞, so even then, to satisfy (3.16b) and (3.19b), it is necessary to have very small   .However, (iii) In Sections 3 and 4.1, we have bounded   through dom  , obtaining global convergence when (3.16b) and (3.19b) hold.With a more refined analysis, it is possible to make   arbitrary small by sufficiently good initialisation, i.e., by being content with mere local convergence.
We include a sketch of this analysis in Appendix b.Finally, although convergence rates ( (1/ 2 ) or linear) are unaffected by the discretisation level, constant factors of convergence depend on   M through the bound (3.17).This operator, written out in (3.13), depends on the constants   and   .They inversely scale the magnitude of the testing parameters   and   as chosen in (3.11).By (3.10), the term   +     +     in (3.13) is, however, independent of   and   .Smaller   and   are, hence, better for the convergence of  and  (by weighing down the  and  initialisation errors on the right hand side of (3.17)), and higher   and   are better for the convergence of  and  (by weighing down  and  initialisation errors).Even for linear PDEs, therefore (iv) Convergence speed may depend on the level of discretisation through the -sensitivity factors   and   of the splitting method for the PDE.This is to be expected: the linear system solvers that Section 4.2 is based on, are fundamentally discrete, and their convergence depends on the eigenvalues of  −1    and   .In "standard" optimisation methods, the dimensionally-dependent linear system solver is taken as a black box, and its computational cost is hidden from the estimates for the optimisation method.The estimates for our method, by contrast, include the solver.page 21 of 32

numerical results
We now illustrate the numerical performance of Algorithm 2.1.We first describe our experimental setup, and then discuss the results.

experimental setup
The PDEs in our numerical experiments take one of the forms of Section 4.1 on the domain Ω = [0, 1] × [0, 1] with nonhomogeneous Dirichlet boundary conditions.We discretize the domain as a regular grid and the PDEs by backward differences.We use both a coarse and a fine grid.
The function  and the PDE vary by experiment, but in each one we take the regularization term for the control parameter  and the data fitting term as for some , ,  > 0 as well as is the average of the measurement data   .The norms here are in function spaces, but in the numerical experiments the variables are, of course, taken to be in a finite-dimensional (finite element) subspace.
The variables   correspond to multiple copies of the same PDE with different boundary conditions   =   on Ω, ( = 1, . . ., ), for the same control .Parametrizing Ω by  : (0, 1) → Ω, we take as boundary data To produce the synthetic measurement   , we solve for û the PDE corresponding to the experiment with the ground truth control parameter x = ( Â, ĉ) and boundary data   .To this we add Gaussian noise of standard deviation 0.01∥ û ∥  2 (Ω) to get   .
We next describe the PDEs for each of our experiments.Experiment 1 (Scalar coefficient).In our first numerical experiment, we aim to determine the scalar coefficient  ∈ ℝ for the PDEs where  = 1, . . ., .For this problem we choose  () = 0. Thus the objective is (5.4) min subject to (5.3).
Our parameter choices can be found in Table 1.Then ∇ (, ) =  =1 ⟨  ,  ,Ω ⟩  2 (Ω) following Example 4.5.For data generation we take ĉ = 1.0.Since we are dealing with an ill-posed inverse problem, an optimal control parameter c for (5.4) does not in general equal ĉ.Therefore, to compare algorithm progress, we take as surrogate for the unknown c the iterate c :=  50,000 on the coarse grid and c :=  500,000 on the fine grid, each computed using Algorithm 2.1 without splitting.
Proof.The chosen  ,  and either  satisfy Assumption 3.1(i).The boundary conditions   ∈  1/2 (Ω) along with the constraint  ∈ [,  −1 ] ensure the condition Lemma 4.1(ii ′ ).In the discretized setting, also (iii ′ ) and (iv ′ ) also hold.In conclusion, Lemma   (5.7) Note that, although we take the total variation of , which is natural in the space of functions of bounded variation, we consider  to lie in (as per Example 2.2 a finite-dimensional subspace of)  2 (Ω).
Thus the total variation term has value +∞ in  2 (Ω) \ BV(Ω).Nevertheless, the term is weakly lower semicontinuous even in  2 due to Poincaré's inequalities (for example, [1, Theorem 3.44]), so the problem is well-defined.Subdifferentiation in  2 (Ω) is a slightly more delicate issue, but not a problem for optimality conditions of problems of the type (5.7), as discussed in [38,Remark 4.7].Moreover, as said, in practise we work in a finite-dimensional subspace that corresponds to the backward differences discretisation of the gradient in the total variation term.The convergence of discretisations is discussed in [4].
As above for Experiment 1, the next theorem verifies the basic structural conditions of the convergence Theorems 3.10 and 3.11.The proofs is analogous to that Theorem 5.1.Likewise, the splitting Assumption 3.3 is verified as before through Example 4.9 (Jacobi), 4.10 (Gauss-Seidel), or 4.8 (no splitting), while Remark 5.2 applies for the remaining step length and growth conditions.Theorem 5.3.Let  be a finite-dimensional subspace of  2 (Ω) × ℝ,  a finite-dimensional subspace of  1 (Ω) and  a finite-dimensional subspace of  1 0 (Ω) ×  1/2 (Ω).Let  and  be given by (5.1) along with the PDE (5.6) with the boundary conditions   defined as in (5.2) and  be ∥ • ∥ 1 .Then Assumption 3.1 holds.

algorithm parametrisation
We apply Algorithm 2.1 with no splitting (full inversion), and with Jacobi and Gauss-Seidel splitting, and quasi conjugate gradients, as discussed in Section 4.2.We fix  = 1.0,  = 1.0,  = 0.For the initial iterate ( 0 ,  0 ,  0 ,  0 ) we make an experiment-specific choice of the control parameter  0 .Then we determine  0 by solving the PDE, and  0 by solving the adjoint PDE.We set  0 =  0 .For Experiment 1 we take the initial  0 = 4.0 and run the algorithm for 20,000 iterations on the coarse grid and 125,000 on the fine.For Experiment 2 we take the initial  0 ≡ 1.0 a constant function, and  0 = 2.0.The algorithm is run for 200,000 iterations on the coarse grid, and 500,000 on the fine.
We implemented the algorithm in Julia.The implementation is available on Zenodo [23].The experiments were run on a ThinkPad laptop with Intel Core i5-8265U CPU at 1.60GHz ×4 and 15.3 GiB memory.

results
The results for Experiment 1 with the above algorithm parametrisations are in Figure 1 for the coarse grid and Figure 2 for the fin grid.In the figures we illustrate the evolution of the coefficient   as the algorithm iterates.We also show the evolution of the relative error of the coefficient and the functional value.
The results for Experiment 2 are available in Figures 3 and 5 for the coarse grid and Figures 4 and 6 for the fine grid.In Figures 3 and 4 are shown the evolution of the relative error of the coefficient and the functional value.In Figures 5 and 6 are the reconstructed coefficients   at the final iterates and for comparison the phantom used for the data generation.
The performance plots have time on the -axis rather than the number of iterations, as the main difference between the splittings is expected to be in the computational effort for linear system solution, i.e., Lines 3 and 4 of Algorithm 2.1.For fairness, we limited the number of threads used by Julia/OpenBLAS to one.
In all experiments the splittings outperform full matrix inversion: the best splittings require roughly half of the computational effort for an iterate of the same quality.No particular splitting completely dominates another, however, Jacobi appear to be more prone to overstepping and oscillatory patterns.On the other hand, quasi-CG currently has no convergence theory, and we have observed situations where it does not exhibit convergence while Jacobi and Gauss-Seidel splittings do.Therefore, Gauss-Seidel is our recommended option.

appendix a optimality conditions
We prove here the necessity of (2.9) for solutions to (2.8).with  ( x) = ⟨ x, ȳ⟩  −  * ( ȳ).By the Fenchel-Young theorem, the latter is equivalent to the last line of (2.9).Clearly ( x, ū) ∈ , or else there is no solution.Therefore also the first line of (2.9) holds.
It follows from the linearity/affinity and continuity, hence continuous differentiability of  that  is strictly differentiable.Since Moreover, as a bounded linear operator,  ′ ( x, ū) is closed, i.e., has closed graph.Therefore, by [3,Theorem 2.20],  ′ ( x, ū) is surjective.With this, [29,Theorem 1.17] gives Primal-dual method with interwoven PDE constraint solver Here we denote by   () =     () the limiting normal cone to a set  at .Since limiting subdifferentials agree with convex subdifferentials on convex functions, and we have assumed that int dom  0 ≠ ∅, we can easily calculate    0 .We will then use the sum rule [29,Theorem 3.36] to estimate   , which requires verifying that  0 is "sequentially normally epicompact" (SNEC), and that the "horizon subdifferentials", defined for  : Indeed, convex functions whose domains have a non-empty interior, such as  0 , are SNEC by [ After appropriate Riesz representations, this inclusion expands as the middle two lines of (2.9).□

appendix b localization
Theorems 3.10 and 3.11 are global convergence results, but also depend on the global constant   in Assumption 3.1 (iv).To satisfy the conditions of the theorems, dom  may need to be small for   to be small.We now develop local convergence results that allow replacing   by a small initializationdependent value without restricting dom  .We replace Assumption 3.1 with the following: Assumption b.1.We assume Assumption 3.1 to hold with (iv) replaced by (iv ′ ) For some C ≥ 0, for all (, ) ∈  ×  and  ∈ dom  we have the bound This estimate uses the standard norm in  , which is a 2-norm in the examples of Sections 4.1 and 5.However, Section 4.1 gives estimates involving an ∞-norm for   .Therefore some finite-dimensionality of the parameters is required to satisfy Assumption b.1 (iv ′ ).This can take the form of a finite element discretisation of a function parameter , or the parameter being a scalar constant.In the latter case, the examples of Section 4.1 readily verify Assumption b.1.
Proof.We need to prove (b.1) for all  = 0, . . .,  − 1.The rest follows as in the proof of Lemma 3.9.Assumption 3.3 (iii) with (3.13) and Lemma 3.5 establish for all  = 0, . . .,  − 1 the a priori bounds   where we recall that  > 0 is a free balancing parameter, and We now immediately obtain local versions of the main results.By initializing close enough to a solution, i.e., with small , we can possibly obtain convergence more often than from the global versions.

Figure 2 :
Figure 2: Performance of various splittings in fine grid Experiment 1.

Figure 4 :
Figure 4: Performance of various splittings in the fine grid Experiment 2.
1, *   *  ) ≤  and  *    ≥  2  .The first example also works in general spaces, as seen in a special case in Example 4.7, but   and   depend on the norms chosen.Theorem 4.6 now shows that Assumption 3.3 holds.Example 4.8 (No splitting).If   =   ∈ ℝ × , (4.4) holds with  = 0 and   the minimal eigenvalue of   , assumed symmetric positive definite.Theorem 4.6 now shows that Assumption 3.3 holds, where for any   > 1 and  > 0, we can take
[12]i-conjugate gradients With   = 0 for simplicity, motivated by the conjugate gradient method for solving    = , see, e.g.,[12], we propose to perform on Line 3 of Algorithm 2.1, and analogously 4.1 verifies Assumption 3.1.□Remark5.2.It remains to verify(3.16)or(3.19),dependingonthe convergence theorem used.The condition (3.16a) is readily verified by appropriate choice of the primal and dual step length parameters  0 ,  0 > 0. We also take γ = 0 (slightly violating the assumptions), so that   ≡ 1, and   ≡  0 and   ≡  0 .The condition (3.16b) (and likewise(3.19b) for linear convergence) is very difficult to verify a priori for nonlinear PDEs, as it depends on the knowledge of a solution to the optimisation problem through ( ū) and ( w).This is akin to the difficulty of verifying (a priori) a positive Hessian at a solution for standard nonconvex optimisation methods.Hence we do not attempt to verify (3.16b).Experiment 2 (Diffusion + scalar coefficient).

Table 1 :
Parameter choices for all examples.