Convergence and quasi-optimal cost of adaptive algorithms for nonlinear operators including iterative linearization and algebraic solver

We consider a second-order elliptic boundary value problem with strongly monotone and Lipschitz-continuous nonlinearity. We design and study its adaptive numerical approximation interconnecting a finite element discretization, the Banach–Picard linearization, and a contractive linear algebraic solver. In particular, we identify stopping criteria for the algebraic solver that on the one hand do not request an overly tight tolerance but on the other hand are sufficient for the inexact (perturbed) Banach–Picard linearization to remain contractive. Similarly, we identify suitable stopping criteria for the Banach–Picard iteration that leave an amount of linearization error that is not harmful for the residual a posteriori error estimate to steer reliably the adaptive mesh-refinement. For the resulting algorithm, we prove a contraction of the (doubly) inexact iterates after some amount of steps of mesh-refinement/linearization/algebraic solver, leading to its linear convergence. Moreover, for usual mesh-refinement rules, we also prove that the overall error decays at the optimal rate with respect to the number of elements (degrees of freedom) added with respect to the initial mesh. Finally, we prove that our fully adaptive algorithm drives the overall error down with the same optimal rate also with respect to the overall algorithmic cost expressed as the cumulated sum of the number of mesh elements over all mesh-refinement, linearization, and algebraic solver steps. Numerical experiments support these theoretical findings and illustrate the optimal overall algorithmic cost of the fully adaptive algorithm on several test cases.


Introduction
Let Ω ⊂ R d with d ≥ 1 be a bounded Lipschitz domain with polytopal boundary. Given f ∈ L 2 (Ω), we aim to numerically approximate the weak solution u ∈ H 1 0 (Ω) of the nonlinear boundary value problem −div A(∇u ) = f in Ω, To this end, we propose an adaptive algorithm of the type estimate total error and its components ↓ advance algebra/advance linearization/mark and refine mesh elements (2) which monitors and adequately stops the iterative linearization and the linear algebraic solver as well as steers the local mesh-refinement. The goal of this contribution is to perform a first rigorous mathematical analysis of this algorithm in terms of convergence and quasi-optimal computational costs.
1.1. Finite element approximation and Banach-Picard iteration. Suppose that the nonlinearity A in (1) is Lipschitz-continuous (with constant L > 0) and strongly monotone (with constant α > 0); see Section 2 for details. Then, the main theorem on monotone operators yields the existence and uniqueness of the weak solution u ∈ H 1 0 (Ω); see, e.g., [Zei90,Theorem 25.B]. Given a triangulation T H of Ω, the lowest-order finite element approximation to problem (1) reads as follows: Find u H ∈ X H := v H ∈ C(Ω) : v H | T is affine for all T ∈ T H and v H | ∂Ω = 0 such that The discrete solution u H ∈ X H again exists and is unique, but (3) corresponds to a nonlinear discrete system which can typically only be solved inexactly.
The most straightforward algorithm for iterative linearization of (3) stems from the proof of the main theorem on monotone operators which is constructive and relies on the Banach fixed point theorem: Define the (nonlinear) operator Φ H : X H → X H by for all w H , v H ∈ X H . Note that (4) corresponds to a discrete Poisson problem and hence Φ H (w H ) ∈ X H is well-defined. Then, it holds that ∇(u H − Φ H (w H )) L 2 (Ω) ≤ q Pic ∇(u H − w H ) L 2 (Ω) with q Pic := (1 − α 2 /L 2 ) 1/2 < 1; (5) see, e.g., [Zei90,Section 25.4]. Based on the contraction Φ H , the Banach-Picard iteration starts from an arbitrary discrete initial guess and applies Φ H inductively to generate a sequence of discrete functions which hence converge towards u H . Note that the computation of Φ H (w h ) by means of the discrete variational formulation (4) corresponds to the solution of a (generically large) linear discrete system with symmetric and positive definite matrix that does not change during the iterations. In this work, we suppose that also (4) is solved inexactly by means of a contractive iterative algebraic solver (with contraction factor q alg < 1), e.g., PCG with optimal preconditioner; see, e.g., [OT14].
1.2. Fully adaptive algorithm. In our approach, we compute a sequence of discrete approximations u k,j of u that have an index for the mesh-refinement, an index k for the Banach-Picard linearization iteration, and an index j for the algebraic solver iteration.
First, we design a stopping criterion for the algebraic solver such that, at linearization step k − 1 ∈ N 0 on the mesh T , we stop for some index j ∈ N. At the next linearization step k ∈ N, the arising linear system reads as follows: Find u k, ∈ X such that, for all v ∈ X , with uniquely defined but not computed exact solution u k, = Φ (u k−1,j ) and computed iterates u k,j that approximate u k, . Note that (6) is a perturbed Banach-Picard iteration since it starts from the available u k−1,j , typically not equal to the unavailable u k−1, .
Second, we design a stopping criterion for the perturbed Banach-Picard iteration at some index k, producing a discrete approximation u k,j .
Finally, we locally refine the triangulation T on the basis of the Dörfer marking criterion for the local contributions of the residual error estimator η (u k,j ), and, to lower the computational effort, employ nested iteration in that the continuation on the new triangulation T +1 is started with the initial guess u 0,0 +1 := u k,j .

Previous contributions.
Solving the linear and nonlinear discrete systems "exactly" is often not possible in practical situations due to the size of the considered systems, and, actually, performing inexact solves on purpose is a traditional and popular approach to speed-up the simulations. Focusing on the inexact solve of the linear systems gives in particular rise to the "inexact Newton method"; see, e.g., [Deu91,EW94], and the references therein. Under appropriate conditions, these can asymptotically preserve the convergence speed of the "exact" method. Note that these approaches only focus on the finite-dimensional system of nonlinear algebraic equations of the form (3) but do not see/take into account the continuous problem (1).
Taking into account the error from numerical discretization and distinguishing the linearization and discretization errors sets a new level of difficulty as, at this moment, one leaves the finite-dimensional world of (3) and the overall error is evaluated with respect to (1). For strongly monotone nonlinear model problems, this has been done in, e.g., [CS06,CS07]; see also the references therein. Later, reliable (actually guaranteed) and efficient (actually robust with respect to the size of the nonlinearity) a posteriori error estimates in such a framework were obtained in [EAEV11]. Therein, adaptive algorithms balancing the estimates of the linearization and discretization error components are proposed and their optimal performance is observed numerically, but no theoretical proofs of convergence and optimality of the arising approximate solutions are given. Similar ideas and achievements are presented in [BDMS15,BCL15,CW17], and in [HW18b], where an adaptive choice of the damping parameter in the Newton method is studied in the context of semilinear singularly-perturbed reaction-diffusion problems.
Recently, theoretical analyses of algorithms balancing linearization and discretization components have been undertaken. The works [GMZ11,HW19] prove convergence of the combined iterative linearization and finite element (Galerkin) discretization, where [HW19] builds on the unified framework of [HW18a] encompassing also Kačanov and (damped) Newton linearizations. Moreover, [GHPS18,GHPS19] prove linear convergence, optimal decay rate in terms of the number of degrees of freedom, and (almost) optimal decay rate in terms of the overall computational cost for a fixed-point (Banach-Picard) iterative scheme. These last references extend concepts from [Vee02,GMZ12,BDK12] in order to take into account inexact linearization solvers, whereas the linear algebraic solver is supposed exact.
Taking into account all algebraic, linearization, and discretization error components is in the heart of the "adaptive inexact Newton method"; see [EV13] and the references therein. Here dedicated stopping criteria are used both for the outer linearization loop and the inner algebraic solver loop, in conjunction with adaptive mesh-refinement. Extensions to more complicated problems are presented in [CPV14,DPVY15,DPFVY14]; see also [Pol16] for regularizations on coarse meshes ensuring well-posedness of the discrete systems in Newton-like linearizations. Again reliability (and efficiency) of the estimates are theoretically established and optimal performances of the fully adaptive algorithms are numerically observed, but no theoretical proofs of the latter are presented. Instead, this is our goal in the present work. We stress that such results have already been derived for adaptive wavelet discretizations [CDD03,Ste14] which provide inherent control of the residual error in terms of the wavelet coefficients, while the present analysis for standard finite element discretizations has to rely on the local information of appropriate a posteriori error estimators.
1.4. Main results: linear convergence, optimal decay rate, and optimal cost. The present contribution appears to be the first work that provides a thorough convergence analysis of fully adaptive strategies for nonlinear equations. To describe more precisely our results, note that the sequential nature of the fully adaptive algorithm of Section 1.2 gives rise to an index set Q := ( , k, j) ∈ N 3 0 : discrete approximation u k,j is computed by the algorithm together with an ordering |( , k, j)| < |( , k , j )| def ⇐⇒ u k,j is computed earlier than u k ,j .
Then, our first main result, formulated in Theorem 4 below, proves that the proposed adaptive strategy is contractive after some amount of steps and linearly convergent in the sense of where C lin ≥ 1 and 0 < q lin < 1 are generic constants and ∆ k,j is an appropriate quasierror quantity involving the error ∇(u −u k,j ) L 2 (Ω) as well as the error estimator η (u k,j ). The estimate (7) appears to be the key argument to prove the optimal error decay rate with respect to the number of degrees of freedom added with respect to the initial mesh in the sense that, in particular, whenever u is approximable at rate s; see Theorem 5 below for the details. Finally, our most eminent result is the optimal error decay rate with respect to the overall cost of the fully adaptive algorithm which steers the mesh-refinement, the perturbed Banach-Picard linearization, and the algebraic solver. In short, this reads whenever u is approximable at rate s; see Theorem 6 below for the details.
1.5. Outline. The remainder of the paper is organised as follows. In Section 2, we introduce an abstract setting in which all our results will be formulated, define the exact weak and finite elements solutions (none of which is available in our setting), and introduce our requirements on mesh-refinement and error estimator. We also give here precise requirements on the algebraic solver, state our adaptive algorithm and stopping criteria in all details, and present our main results, including some discussions. The proofs of some auxiliary results and of Proposition 3 (reliability in Algorithm 1), Theorem 4 (linear convergence), Theorem 5 (decay rate wrt. degrees of freedom), and Theorem 6 (decay rate wrt. computational cost) are respectively given in Sections 3, 4, 5, and 6. Finally numerical experiments in Section 7 underline the theoretical findings.
Throughout our work, we apply the following convention: In statements of theorems, lemmas, etc., we explicitly state all constants together with their dependencies. In proofs, however, we abbreviate A ≤ cB with a generic constant c > 0 by writing A B. Moreover, A B abbreviates A B A.

Adaptive algorithm and main results
In this section, we introduce an abstract setting, in which all our results will be formulated, define the exact weak and finite elements solutions, introduce our requirements on mesh-refinement, error estimator, and algebraic solver, state our adaptive algorithm, and present our main results, including some discussions.
2.1. Abstract setting. Let X be a Hilbert space over K ∈ {R, C} with scalar product (· , ·), corresponding norm ||| · |||, and dual space X (with canonical operator norm ||| · ||| ). Let P : X → K be Gâteaux-differentiable with derivative A := dP : X → X , i.e., We suppose that the operator A is strongly monotone and Lipschitz-continuous, i.e., for all v, w ∈ X , where 0 < α ≤ L are generic real constants. Given a linear and continuous functional F ∈ X , the main theorem on monotone operators [Zei90, Section 25.4] yields existence and uniqueness of the solution u ∈ X of The result actually holds true for any closed subspace X H ⊆ X , which also gives rise to a unique u H ∈ X H such that Finally, with the energy functional E := Re (P − F ), it holds that see, e.g., [GHPS18,Lemma 5.1]. In particular, u ∈ X (resp. u H ∈ X H ) is the unique minimizer of the minimization problem As for linear elliptic problems, it follows from (8)-(10) that the present setting guarantees the Céa lemma (see, e.g., [Zei90,Section 25.4]) 2.2. Mesh-refinement. Let T H be a conforming simplicial mesh of Ω, i.e., a partition of Ω into closed simplices T such that T ∈T H T = Ω and such that the intersection of two different simplices is either empty or their common vertex, edge, or face. We assume that refine(·) is a fixed mesh-refinement strategy, e.g., newest vertex bisection [Ste08]. We write T h = refine(T H , M H ) for the coarsest one-level refinement of T H , where all marked elements M H ⊆ T H have been refined, i.e., M H ⊆ T H \T h . We write T h ∈ refine(T H ), if T h can be obtained by finitely many steps of one-level refinement (with appropriate, yet arbitrary marked elements in each step). We define T := refine(T 0 ) as the set of all meshes which can be generated from the initial simplicial mesh T 0 of Ω by use of refine(·). Finally, we associate to each T H ∈ T a corresponding finite-dimensional subspace X H X , where we suppose that X H ⊆ X h whenever T H , T h ∈ T with T h ∈ refine(T H ).
For our analysis, we only employ that the shape-regularity of all meshes T H ∈ T is uniformly bounded by that of T 0 together with the following structural properties (R1)-(R3), where C son ≥ 2 and C mesh > 0 are generic constants: (R1) splitting property: Each refined element is split into finitely many sons, i.e., for all T H ∈ T and all M H ⊆ T H , the mesh T h = refine(T H , M H ) satisfies that (R2) overlay estimate: For all meshes T ∈ T and T H , T h ∈ refine(T ), there exists a common refinement (R3) mesh-closure estimate: For each sequence (T ) ∈N 0 of successively refined meshes, i.e., T +1 := refine(T , M ) with M ⊆ T for all ∈ N 0 , it holds that For newest vertex bisection, we refer to [BDD04, Ste07, Ste08, CKNS08, KPP13, GSS14] for the validity of (R1)-(R3). For red-refinement with first-order hanging nodes, details are found in [BN10].

Error estimator.
For each mesh T H ∈ T, suppose that we can compute refinement indicators We denote and abbreviate η H (v H ) := η H (T H , v H ). As far as the estimator is concerned, we assume the following axioms of adaptivity from [CFPP14] for all T H ∈ T and all T h ∈ refine(T H ), where C stab , C rel > 0, and 0 < q red < 1 are generic constants: We stress that the exact discrete solutions u H (resp. u h ) in (A3)-(A4) will never be computed but are only auxiliary quantities for the analysis.
We refer to Section 7.1 below for precise assumptions on the nonlinearity A(·) of problem (1) such that the standard residual error estimator satisfies (A1)-(A4) for lowest-order Courant finite elements; see also Section 7.2-7.3.
2.4. Algebraic solver. For given linear and continuous functionals G ∈ X , we consider linear systems of algebraic equations of the type with unique (but not computed) exact solution u H ∈ X H . We suppose here that we have at hand a contractive iterative algebraic solver for problems of the form (16). More precisely, let u 0 H ∈ X H be an initial guess and let the solver produce a sequence u j H ∈ X H , j ≥ 1. Then, we suppose that there exists a generic constant 0 < q alg < 1 such that Examples for such solvers are suitably preconditioned conjugate gradients or multigrid; see, e.g., Olshanskii and Tyrtyshnikov [OT14] and the references therein.
2.5. Adaptive algorithm. The present work considers an adaptive algorithm for numerical approximation of problem (9) which steers mesh-refinement with index , a (perturbed) contractive Banach-Picard iteration with index k, and a contractive algebraic solver with index j. On each step ( , k, j), it yields an approximation u k,j ∈ X to the unique but unavailable u ∈ X on the mesh T defined by Reporting for the summary of notation to Table 1, the algorithm reads as follows: Algorithm 1. Input: Initial mesh T 0 and initial guess u 0,0 0 = u 0,j 0 ∈ X 0 , parameters 0 < θ ≤ 1, 0 < λ alg < 1, 0 < λ Pic , and C mark ≥ 1, counters = k = j = 0.  (I) Update counter j := j + 1. (II) Consider the problem of finding u k, ∈ X such that, for all v ∈ X , and do one step of the algebraic solver applied to (19) starting from u k,j−1 , which yields u k,j (an approximation to u k, ).
(III) Compute the local indicators η (T, u k,j ) for all T ∈ T .
(iv) Determine a set M ⊆ T with up to the multiplicative constant C mark minimal cardinality such that (v) Generate T +1 := refine(T , M ) and define u 0,0 +1 := u 0,j +1 := u k,j .
Output: Sequence of discrete solutions u k,j and corresponding error estimators η (u k,j ).
Some remarks are in order to explain the nature of Algorithm 1. The innermost loop (Algorithm 1(ib)) steers the algebraic solver. Note here that the exact solution u k, of (19) is not computed but only approximated by the computed iterates u k,j . For the linear system (19), the contraction assumption (17) reads as Then, the triangle inequality implies that Hence, the term |||u k,j − u k,j−1 ||| provides a means to estimate the algebraic error |||u k, − u k,j |||. Thus, the approximation u k,j is accepted and the algebraic solver is stopped if the algebraic error estimate |||u k,j − u k,j−1 ||| is, up to the threshold λ alg , below the estimate on the sum η (u k,j ) + |||u k,j − u k−1,j ||| of the discretization and linearization errors; see (20).
The middle loop (Algorithm 1(i)) steers the linearization by means of the (perturbed) Banach-Picard iteration. Lemma 7 below shows that the term |||u k,j − u k−1,j ||| estimates the linearization error |||u − u k,j |||. Note here that, a priori, only the non-perturbed Banach-Picard iteration corresponding to the (unavailable) exact solve of (19) yielding u k, would lead to the contraction where 0 < q Pic := (1 − α 2 /L 2 ) 1/2 < 1. The approximation u k,j is accepted and the linearization is stopped if the linearization error estimate |||u k,j − u k−1,j ||| is, up to the threshold λ Pic , below the discretization error estimate η (u k,j ); see (21).
Finally, the outermost adaptive loop steers the local mesh-refinement. To this end, the Dörfler marking criterion (22) from [Dör96] is employed to mark elements T ∈ M for refinement, unless η (u k,j ) = 0, in which case Proposition 3 below ensures that the approximation u k,j coincides with the exact solution u of (9).
Remark 2. In a practical implementation, Algorithm 1 has to be complemented by appropriate stopping criteria in all of the loops so that the computation is terminated if u k,j ∈ X is a sufficiently accurate approximation of u . This can be done with the help of the reliable a posteriori error estimates summarized in Proposition 3 below.
2.6. Index set Q for the triple loop. To analyze Algorithm 1, define the index set Since Algorithm 1 is sequential, the index set Q is naturally ordered. For indices ( , k, j), ( , k , j ) ∈ Q, we write ( , k, j) < ( , k , j ) def ⇐⇒ ( , k, j) appears earlier in Algorithm 1 than ( , k , j ). (27) April 29, 2020 With this order, we can define which is the total step number of Algorithm 1. We make the following definitions, which are consistent with that of Algorithm 1, and additionally define j( , 0) := 0: Generically, it holds that = ∞, i.e., infinitely many steps of mesh-refinement take place. However, our analysis also covers the cases that either the k-loop (linearization) or the j-loop (algebraic solver) do not terminate, i.e., or that the exact solution u is hit at step (iii) of Algorithm 1 (recall that η (u k,j ) = 0 implies u = u k,j by virtue of Proposition 3 below). To abbreviate notation, we make the following convention: If the mesh index ∈ N 0 is clear from the context, we simply write k := k( ), e.g., u k,j := u k( ),j . Similarly, we simply write j := j( , k), e.g., u k,j := u k,j( ,k) .
Note that there in particular holds u Hence, these approximate solutions are indexed three times. This is our notational choice that will not be harmful for what follows; alternatively, one could only index the approximate solutions that appear on step (i.b.II) of Algorithm 1.

Main results.
Our first proposition provides computable upper bounds for the energy error |||u − u k,j ||| of the iterates u k,j of Algorithm 1 at any step ( , k, j) ∈ Q. In particular, we note that the stopping criteria (20)-(21) ensure reliability of η (u k,j ) for the final perturbed Banach-Picard iterates u k,j . The proof ist postponed to Section 3.3.
Proposition 3 (Reliability at various stages of Algorithm 1). Suppose (A1) and (A3). Then, for all ( , k, j) ∈ Q, it holds that The constant C rel > 0 depends only on C rel , C stab , q alg , λ alg , q Pic , and λ Pic .
The first main theorem states linear convergence in each step of the adaptive algorithm, i.e., algebraic solver or linearization or mesh-refinement. The proof is given in Section 4.
Note that ∆ k ,j = ∆ k,j when ( , k , j ) = ( , k, j), and then (30) holds with equality for C lin = 1. There are other cases where u k ,j = u k,j and where u k ,j = u k,j together with T = T , and consequently η (u k ,j ) = η (u k,j ), related to our notational choice for Q in (26) that also indexes nested iterates. The case with = arises for instance when j = j, j = 0, and k = k + 1; see step (ia) of Algorithm 1. Note, however, that in such a situation, typically u k , = u k, , and consequently ∆ k ,j = ∆ k,j . A situation where ∆ k ,j = ∆ k,j for ( , k , j ) = ( , k, j) can nevertheless also appear, and is covered in (30). For instance, in the above example, when j = j, j = 0, k = k + 1, and = , and where moreover u k,j = u k, = u (so that u k,j = u k, = u k , = u k ,j = u ), Algorithm 1 only effectuates one step of the algebraic solver on the linearization step k , so that C lin = 1/q lin leads to equality in (30) where now |( , k , j )| − |( , k, j)| = 1.
The second main result states optimal decay rate of the quasi-error ∆ k,j of (29) (and consequently of the total error |||u − u k,j |||) in terms of the number of degrees of freedom added in the space X with respect to X 0 . More precisely, the result states that if the unknown weak solution u of (9) can be approximated at algebraic decay rate s with respect to the number of mesh elements added in the refinement of T 0 (plus one) for a best-possible mesh, then Algorithm 1 achieves the same decay rate s with respect to the number of elements actually added in Algorithm 1, (#T − #T 0 + 1), up to a generic multiplicative constant. The proof of the following Theorem 5 is given in Section 5.
Note that ∆ 0,0 0 can be arbitrarily bad with bad initial guess u 0,0 0 . However, u As as well as the constant C opt are independent of the initial guess, so that the upper bound in (33) cannot avoid max{ u As , ∆ 0,0 0 } for the case = 0. Such a phenomenon does not appear at later stages, since the stopping criteria (20) and (21) ensure that, though u k,j does not in general coincide with u , it is sufficiently accurate. If one restricts the indices to ( , k, j) ∈ Q with ≥ 1, then the upper bound in (33) may omit ∆ 0,0 0 . Our last main result states that Algorithm 1 drives the quasi-error down at each possible rate s not only with respect to the number of degrees of freedom added in the space X in comparison with X 0 , but actually also with respect to the overall computational cost expressed as a cumulated sum of the number of degrees of freedom. This is an important improvement of Theorem 5. More precisely, under the same conditions as above, i.e., if the unknown weak solution u of (9) can be approximated at algebraic decay rate s with respect to the number of mesh elements added in the refinement of T 0 (plus one), then Algorithm 1 generates a sequence of triple-( , k, j)-indexed approximations (mesh, linearization, algebraic solver) such that the quasi-error decays down at rate s with respect to the overall algorithmic cost expressed as the sum of the number of simplices #T over all steps ( , k, j) ∈ Q effectuated by Algorithm 1. The proof of the following Theorem 6 is given in Section 6.
Theorem 6 (optimal decay rate wrt. overall computational cost). Let the assumptions of Theorem 5 be verified. Then The maximum in the right inequality is only needed if = 0. If ≥ 1, the maximum max{ u As , ∆ 0,0 0 } can be replaced by u As . While c opt > 0 is the constant of Theorem 5, the constant C opt > 0 reads C opt : Analogously to the comments after Theorem 5, the upper estimate in (34) cannot avoid max{ u As , ∆ 0,0 0 } for the case = = 0. As above, if one restricts the indices to ( , k , j ), ( , k, j) ∈ Q with , ≥ 1, then the upper bound in (34) may omit ∆ 0,0 0 .

Auxiliary results
3.1. Some observations on Algorithm 1. This section collects some elementary observations on Algorithm 1 in what concerns nested iteration and stopping criteria. The given initial value of Algorithm 1 reads If ( , 0, 0) ∈ Q with ≥ 1, then If ( , k, 0) ∈ Q, then the initial guess for the algebraic solver reads i.e., the algebraic solver employs nested iteration. The stopping criterion (20) of Algorithm 1 guarantees that j( , k) ≥ 1 if k > 0 and, for all ( , k, j) ∈ Q, it holds that i.e., the algebraic error estimate |||u k,j − u k,j−1 ||| only drops below the discretization plus linearization error estimate at the stopping iteration j = j( , k).
The final iterates u k,j of the algebraic solver are used to obtain the perturbed Banach-Picard iterates u k+1,j for k > 0; see (19). The stopping criterion (21) of Algorithm 1 guarantees that k( ) ≥ 1 and, for all ( , k, j) ∈ Q, it holds that i.e., the linearization error estimate |||u k,j − u k−1,j ||| only drops below the discretization error estimate at the stopping iteration k = k( ).
3.2. Contraction of the perturbed Banach-Picard iteration. Assumption (17) immediately implies the algebraic solver contraction (23) and reliability (24) of the algebraic error estimate |||u k,j − u k,j−1 |||. Similarly, one step of the non-perturbed Banach-Picard iteration (19) (i.e., with an exact algebraic solve of problem (19) with the datum u k−1,j ) leads to contraction (25) and consequently to the reliability of the unavailable linearization error estimate |||u k, − u k−1,j |||. As our first result, we now show that, for sufficiently small stopping parameters 0 < λ alg in (20), we also get that the perturbed Banach-Picard iteration is a contraction. Recall that u ∈ X is the (unavailable) exact discrete solution given by (18), that u k, ∈ X is the (unavailable) exact linearization solution given by (19), and that u k,j ∈ X is the computed solution for which the algebraic solver is stopped; see (20) (resp. (38)-(39)) for the stopping criterion.
3.3. Proof of Proposition 3 (reliable error control in Algorithm 1). We are now ready to prove the estimates (28).
As a result of Lemma 8 and Proposition 3, we get the following lemma. Please note that when < ∞, the summation below only goes to − 1, as the arguments rely on (51) which needs finite stopping indices k and j on each mesh T .
In comparison with (29), ∆ k omits the algebraic error term. With Proposition 3 and the linear convergence (51), we get that Hence, it only remains to prove that By definition (29), it holds that Hence, it only remains to show that |||u k, − u k,j ||| ∆ k . To this end, note that This proves (53) and concludes the proof.

Proof of Theorem 4 (linear convergence)
This section is dedicated to the proof of Theorem 4. The core is the following lemma that extends Lemma 9 to our setting with the triple indices.
Step 1. We prove that Note that A k,j and ∆ k,j only differ in the first term, where the overall error is replaced by the (inexact) linearization error. According to the Céa lemma (13), it holds that This implies that A k,j ∆ k,j . To see the converse inequality, note that This proves ∆ k,j A k,j and concludes this step.

This proves (56).
Second, we consider the use of nested iteration when passing to the next perturbed Banach-Picard step. We prove that To this end, note that This proves (57). Third, we prove that related to the algebraic error contraction. Note that k = 0 implies j = 0, so that (58) trivially holds for k = 0 in the form of equality. Let now k ≥ 1. We first consider the last but one algebraic iteration step j = j( , k) − 1 ≥ 0. There holds that This proves (58) for j = j( , k) − 1 ≥ 0. Note that this argument also applies when j = 1.
Since 1 ≤ k < k( ), we obtain that where we employ Lemma 7 and hence require 0 < λ alg + λ alg /λ Pic to be sufficiently small. This proves (59). Fifth, we consider the use of nested iteration when refining the mesh. We prove that To this end, note that Next, recall from (36) that u 0, = u 0,j = u k,j −1 . Hence, it follows from (A1) used on non-refined mesh elements and (A2) used on refined mesh elements that This proves (60).
Sixth, we prove that related to the linearization error contraction. We first consider k = k( ) − 1 ≥ 0. Note that Hence, the triangle inequality leads to This proves (62) for k = k( ) − 1. Note that the same argument also applies when k = 1.
also using that q Pic ≤ 1. This concludes the proof of (62). Seventh, we consider the use of nested iteration when passing to the next perturbed Banach-Picard step. We prove that Using (57) and recalling the definition u k,0 = u k−1,j , it holds that which is the claim (64).
Step 3. This step collects auxiliary estimates following from the geometric series and the contraction properties of the linearization and the algebraic solver. First, it holds This follows immediately from We note that (65) also holds for j( , k) = ∞ (with the convention that then j( , k) − 1 = ∞). Analogously, the contraction (44) of the perturbed Banach-Picard iteration leads to This follows immediately from We note that (66) also holds for k( ) = ∞ (with the convention that then k( ) − 1 = ∞).
With the analogous convention − 1 = ∞ when = ∞, we finally prove that This follows from Step 1 and Step 4. From now on, let ( , k , j ) ∈ Q be arbitrary. Suppose first that = ∞, i.e., both algebraic and linearization solvers terminate at some finite values k( ) for all ≥ 0 and j( , k) for all ≥ 0 and all k ≤ k( ), whereas infinitely many steps of meshrefinement take place. By the definition of our index set Q in (26) (which in particular features nested iterates), it holds that where we have employed estimates (60) and (64) in order to start all the summations from k = 1 and j = 1.
We consider the three summands in (68) separately. For the first sum, we infer that If k = k( ), the second sum in the bound (68) disappears. If k < k( ), we infer that If j = j( , k ), the third sum in the bound (68) disappears. If j < j( , k ), we infer that Summing up (68)-(71), we see that ( ,k,j)∈Q ( ,k,j)>( ,k ,j ) A k,j A k ,j provided that = ∞.
Step 5. Suppose that < ∞ and k( ) = ∞, i.e., for the mesh T , the linearization loop does not terminate, and, moreover, < . Then, it holds that We argue as before to see that Altogether, we hence obtain that ( ,k,j)∈Q ( ,k,j)>( ,k ,j ) A k,j A k ,j provided that < < ∞ and k( ) = ∞.
Step 6. Suppose that < ∞ and k( ) = ∞, i.e., for the mesh T , the linearization loop does not terminate, and moreover, = . Arguing as in (74) and (71), it holds that (75) Step 7. Suppose that < ∞, where k( ) < ∞ and hence j( , k) = ∞, i.e., the linear solver does not terminate for the linearization step k( ). Suppose moreover < . Then, it holds that We argue as before to see that A k ,j , and A k ,j .

April 29, 2020
For the first sum in (76), we get that Hence, it only remains to estimate to estimate the second sum in (76), which can be treated analogously to (74) in Step 5. This proves that A k ,j .
Step 8. Suppose that < ∞, where k( ) < ∞ and hence j( , k) = ∞, i.e., the linear solver does not terminate for the linearization step k( ). Suppose moreover = but k < k( ). Then, it holds that We argue as before to see that A k ,j , and A k ,j .
Step 9. Suppose that < ∞, where k( ) < ∞ and hence j( , k) = ∞, i.e., the linear solver does not terminate for the linearization step k( ). Suppose = and k = k( ). Then, it holds that Step 10. Suppose that , k( ), j( , k( )) < ∞, so that Algorithm 1 finished on step (iii) when η (u k,j ) = 0. From (28), we see that η (u k,j ) = 0 implies u = u k,j , i.e., the exact solution was found. Moreover, through the stopping criteria (21) and (20), we see that u k−1,j = u k,j−1 = u k,j , so that (45) gives u = u k,j , and finally (19) gives u k, = u k,j . Thus Let < . Then, as in (72), Here, the last three terms are estimated as in (73), whereas for the first one, we can proceed as in (74), crucially noting that the last summand A k,j is zero.
If = , three cases are possible. The first case is k < k. Then which is controlled as in (73). The second case is k = k but j < j, where directly again using A k ,j = 0. Finally, in the third case, k = k and j = j, but then the sum is void. Altogether also holds in this case.
Step 11. Combining Steps 4-10 that cover all possible runs of Algorithm 1 with Step 1, we finally see that Inductively, it follows for all N, m ∈ N 0 with N + m < min{M + 1, ∞} that We thus conclude for all N, m ∈ N 0 with N + m < min{M + 1, ∞} that Step 2. Since the index set Q is linearly ordered with respect to the total step counter |(·, ·, ·)|, Lemma 10 and Step 1 imply that where C lin = 1 + C sum and q lin = C sum /(C sum + 1). This concludes the proof.
5. Proof of Theorem 5 (optimal decay rate wrt. degrees of freedom) The first result of this section proves the left inequality in (33): Lemma 11. Suppose (R1) as well as (A1), (A2), and (A4). Let s > 0 and assume u As > 0. Then, it holds that where the constant c opt > 0 depends only on C Céa = L/α, C stab , C rel , C son , #T 0 , s, and, if < ∞, additionally on .
Due to the uniqueness of the limit and the Céa lemma (13), we obtain that u = u = u k, . From stability (A1), it follows that Hence, we see that η (u ) = η (u k, ) = 0.
Algorithm 1 then guarantees that #T → ∞ as → ∞. Thus, we can argue analogously to the proof of [CFPP14, Theorem 4.1]: Let N ∈ N. Choose the maximal ∈ N 0 such that #T − #T 0 + 1 ≤ N . Then, T ∈ T(N ). The choice of N guarantees that This leads to and we immediately see that this also holds for N = 0 with = 0. Taking the supremum over all N ∈ N 0 , we conclude that Step 3. With stability (A1) and the Céa lemma (13), we see for all ( , 0, 0) ∈ Q that With (83) and (85), we thus obtain that u As sup This concludes the proof.
To prove the upper estimate in (33), we need the comparison lemma from [CFPP14, Lemma 4.14] for the error estimator of the exact discrete solution u ∈ X .
Proof of Theorem 5. The proof is split into four steps. Without loss of generality, we may assume that u As < ∞.
As the next case, if ( , k, j) ∈ Q with = = 1, we can rely on the inequality Thus, (91) holds for this case as well.
As the final case, if ( , k, j) ∈ Q with = = 0, we get with the linear convergence (30) that Hence, (91) also holds for this case, and we conclude the proof of (33) 6. Proof of Theorem 6 (optimal decay rate wrt. computational cost) Proof of Theorem 6. Note that #T − #T 0 + 1 = 1 ≤ #T 0 for = 0 and #T − #T 0 + 1 ≤ #T for > 0, so that the left inequality in (34) immediately follows from the left inequality in (33). In order to prove the right inequality in (34), let ( , k , j ) ∈ Q. Employing (91) from Step 5 of the proof of Theorem 5, the geometric series proves that Rearranging this estimate, we end up with where the hidden constant depends only on C stab , C rel , C mark , 1 − λ Pic /λ Pic , C Céa = L/α, C rel , C mesh , C lin , q lin , #T 0 , and s. This proves the right inequality in (34).
As an algebraic solver for the linear problems arising from the Banach-Picard iteration, we use PCG with multilevel additive Schwarz preconditioner from [Füh14,Section 7.4.1] which is an optimal preconditioner, i.e., the condition number of the preconditioned system is uniformly bounded; cf. also [GHPS19, Section 2.9].
7.3. Discretization and a posteriori error estimator. Let T 0 be a conforming initial triangulation of Ω into simplices T ∈ T 0 . For each T H ∈ T, consider the lowestorder FEM space As in [GMZ12, Section 3.2], we define for all T ∈ T H and all v H ∈ X H , the corresponding weighted residual error indicators where [·] denotes the usual jump of discrete functions across element interfaces, and n is the outer normal vector of the considered element. Due to (M3), the error estimator is well-posed, since the nonlinearity µ(x, t) is Lipschitz continuous in x. Then, reliability (A3) and discrete reliability (A4) are proved as in the linear case; see, e.g., [CKNS08] for the linear case or [GMZ12, Theorem 3.3] and [GMZ12, Theorem 3.4], respectively, for strongly monotone nonlinearities.
In Figure 7, we consider the total number of PCG iterations cumulated over all Picard steps on the given mesh. We observe that independently of the choice θ, λ alg , and λ Pic , the total number of PCG iterations stays uniformely bounded. Additionally, we see that for larger values of λ alg and λ Pic , as well as for smaller values of θ, the total number of PCG iterations is smaller.