A link between the steepest descent method and fixed-point iterations

We will make a link between the steepest descent method for an unconstrained minimisation problem and fixed-point iterations for its Euler–Lagrange equation. In this context, we shall rediscover the preconditioned algebraic conjugate gradient method for the discretised problem. The benefit of the connection of those concepts will be illustrated by a numerical experiment.


Introduction
Throughout this work, let X be a Hilbert space endowed with an inner-product denoted by (·, ·) X and induced norm · X . Furthermore, we consider a functional H : X → R and are interested in the optimisation problem arg min u∈X H(u). (1) In general, there may exist several (local) minimisers or possibly none at all. In this work, we shall make the following assumptions on the functional H: (H1) H is Gateaux-differentiable; (H2) H is strictly convex; (H3) H is weakly coercive, i.e., H(u) → ∞ as u X → ∞. Those assumptions (H1)-(H3) imply that H has a unique minimiser u ∈ X ; see, e.g., [57,Thm. 25.E]. A well-known procedure to approximate a minimiser of such a functional H is the steepest descent method, introduced by Augustin Cauchy in the work [7]. The main idea of this method is very intuitive: at each iteration step, we move in direction of the steepest descent. In particular, if u n ∈ X is a given iterate, then we set where ∇H(u n ), to be specified in Sect. 2, is the gradient of H at u n and δ n > 0 is an appropriate step-size such that H(u n+1 ) ≤ H(u n ). The optimal choice of the step-size is given by δ n := arg min t≥0 H(u n − t∇H(u n )), which requires the solution of a one-dimensional optimisation problem. In practice, we may often only approximate the optimal step-size. More comments on that issue will be provided in Remark 3.1 below. It is well-known, see, e.g., [57,Thm. 25.F], that under the assumptions (H1)-(H3) the unconstrained minimisation problem (1) is equivalent to the operator equation where F := H : X → X is the Gateaux-derivative of the functional H and X denotes the dual space of X ; i.e., the set of all continuous linear functionals from X to R. There exists a wide variety of fixed-point iterations for the numerical solution of the problem (3) and, as was shown in [18,19], in many cases they can be interpreted as an iterative local linearisation procedure, which can be obtained by applying a suitable preconditioning operator to the original Eq. (3). In particular, for any given u ∈ X , let P[u] : X → X be a linear and invertible operator. Then, the operator Eq. (3) is equivalent to the fixed-point equation This, in turn, gives rise to the fixed-point iteration where u 0 ∈ X is an initial guess. In practice, we seldom invert P[u n ] (in the finitedimensional setting), but rather solve the linear problem by applying an iterative linear solver. However, for simplicity, the algebraic error will be neglected in this work; i.e., we assume that the linear Eq. (6) is solved exactly. Moreover, problem (6) can be stated equivalently as where here, ·, · denotes the duality pairing in X × X . We note that some prominent iteration schemes, such as the Zarantonello, 1 Kačanov, 2 and Newton methods, can be cast into this unified framework, and we refer to [18]. In the following we assume that the bilinear form a(u; ·, ·) : X × X → R is (A1) uniformly coercive, i.e., there exists a constant α > 0 such that (A2) uniformly bounded, i.e., there exists a constant β > 0 such that (A3) symmetric, i.e., a(u; v, w) = a(u; w, v) for all u, v, w ∈ X .
We note that those assumptions imply, thanks to the Lax-Milgram theorem, that the operator P[u] : X → X is invertible for any u ∈ X and that the Eq. (7) has a unique solution for each n = 0, 1, 2, . . . . We further assume that the operator F : X → X is (F1) strongly monotone, i.e., there exists a constant ν > 0 such that (F2) Lipschitz continuous, i.e., there exists a constant L F > 0 such that Under those assumptions, the theory on strongly monotone operator equations yields that Eq. (3) has a unique solution u ∈ X ; see, e.g., [57, §25.4]. We recall that this solution is as well the unique minimiser of H in X . We further note that the strong monotonicity (F1) of the operator F implies the strict convexity (H2) of its potential H.
Under suitable assumptions it can be shown that the potential H decreases along the sequence {u n } n generated by the unified iteration scheme (7) in the sense that there exists a constant C H > 0 such that see [19, §2.1] for a general discussion of the property (8) and [16, §2.4] for the required assumptions guaranteeing (8) for the Zarantonello, Kačanov, and Newton methods. In particular, given the monotonicity property (8), the update −P[u n ] −1 F(u n ) from the fixed-point iteration (5) can be considered as a descent direction of the potential H at the given iterate u n ∈ X . This indicates that there might be a link between the steepest descent method (2) for the optimisation problem (1) and the fixed-point iteration (5) for the solution of its Euler-Lagrange Eq. (3). Indeed, this relation of the operator preconditioning in the context of fixed-point iterations and the steepest descent methods for generalised Sobolev gradients is already known in the existing literature; here, we simply refer to the monograph [9], which discusses this link in an extensive way, and will provide further references at a later stage. This connection will be recovered in Sect. 2 of the present manuscript. Subsequently, in Sect. 3, in the context of second order partial differential equations in divergence form, we will further relate this to the preconditioning of the algebraic gradient descent method of the discretised problem -also this link is already known in the existing literature on Sobolev gradient methods, see, e.g. [9,Ch. 8]. In regard of this insight, we will shortly discuss the nonlinear conjugate gradient method in Sect. 3.4. Subsequently, a numerical experiment will be performed in Sect. 4 in order to demonstrate the usefulness of our theoretical observations. Finally, we will round off our work with some conclusions in Sect. 5.
As should be apparent from the discussion above, the present work rarely contains any new insights for itself, but the purpose is to synthesise the relations and perceptions of different concepts in a condensed and well-understandable way, with an application focus on the unified iteration scheme (6) studied in the author's previous works [16,18,19]. Furthermore, we want to highlight the advantages of the different viewpoints and how this could be used to find efficient iteration schemes. Throughout our paper, we will embed our work in the large body of literature, and point to existing results of the different concepts that are connected herein.

Link between the steepest descent method and the unified iteration scheme
We will now make the link between the steepest descent method (2) and the unified iteration scheme (5) visible. For that purpose, let us first recall the steepest descent method (2), which involves the gradient ∇H(u n ) ∈ X of H at u n ∈ X . By definition it holds that, for fixed u ∈ X , In particular, the gradient depends on the considered inner-product, which shall be indicated by a subscript in the following; i.e., we write ∇ X H(u) for the gradient of H at u ∈ X with respect to the inner-product (·, ·) X . Indeed, this follows the Sobolev gradient approach as introduced in a series of papers by Neuberger, see, e.g., [37,38,40,41], further developed with a focus on weighted inner products by Mahavier [27][28][29][30], or Nittka and Sauter [42] in the context of discrete algebraic equations, and applied in various situations by Raza, Sial and co-authors [31,32,[47][48][49][50][51][52][53]. Actually, since the pioneering work of Neuberger, the Sobolev gradient methods have been widely studied and applied for numerous problems, meaning that the reference list above is far from complete. We further refer to the monograph [39] for a very nice summary of the works of Neuberger and Mahavier, and many more results on Sobolev gradient flows. If we denote by J X : X → X the Riesz isometry with respect to the inner-product . In turn, the steepest descent method reads as which coincides with the fixed-point iteration (5) for the preconditioning operator If a : X × X → R is a symmetric, coercive, and bounded bilinear form on X × X , then a(·, ·) can be considered as an inner-product on X whose corresponding norm · a is equivalent to the norm · X ; i.e., X endowed with the inner-product a(·, ·) and norm · a is a Hilbert space as well. We further note that, in turn, the bilinear form a : X × X → R induces a linear and invertible operator P : X → X defined by We may then consider the gradient of H with respect to the inner-product a(·, ·), i.e., for given u ∈ X , In view of (11) we have that and therefore ∇ a H(u) = P −1 F(u). In this case, the steepest descent method (2) coincides, up to some damping parameter, with the unified iteration scheme (5) for the preconditioner from (11). In particular, we obtain a preconditioning (of the simple iteration) by a change of the inner-product for the steepest descent method; this was already pointed out in [9,Ch. 7.3]. Finally, similarly as was done in [20] in the context of Sobolev gradient flows for the Gross-Pitaevskii equation, we may consider an inner-product that changes with the iteration; we further refer, e.g., to [9,Ch. 7.3(b)] and [39,Ch. 29.5] for variable inner products in the context of Sobolev gradients, and see the references at the end of the section as well. For fixed u ∈ X , let a u = a(u; ·, ·) : X × X → R be a symmetric, uniformly coercive and bounded bilinear form, cf. (A1)-(A3). Consequently, for any u ∈ X , the operator P[u] : X → X defined by is linear and invertible. Then, we can define the gradient of H at a given element u ∈ X with respect to the inner-product a u (·, ·) by i.e., we have that ∇ a u H(u) = P[u] −1 F(u). In turn, the steepest descent method is given by which, for δ(u n ) ≡ 1, n = 0, 1, 2, . . . , matches our unified iteration scheme (5). In particular, we have shown the result below; in this context, we also want to refer to a very close discussion in [9, Ch. 5].

Remark 2.2
There may result some advantages from this relation between the steepest descent method and the unified fixed-point iteration.
(1) The step-size function of the (modified) steepest descent method in the context of Proposition 2.1 is simply given by δ ≡ 1, and thus we do not need to employ, e.g., a line search or trusted region method to determine δ n , n = 0, 1, 2, . . . . We note that the preconditioning operator P[u] may implicitly include a damping parameter δ(u); however, in many cases, this damping parameter can be prescribed or can easily be chosen adaptively in such a way that the decay property (8) is satisfied in each iteration step. (2) The convergence of fixed-point iterations is well studied in the literature, also in the context of the adaptive interplay with finite element discretisations, see, e.g., [11][12][13][14]16,19]. By the identification of the unified iteration scheme (5) and the steepest descent method (2) (with constant step-size function δ ≡ 1), those results also apply to the latter. (3) On the other hand, the steepest descent method serves as basis of the superior (nonlinear) conjugate gradient method. Hence, it might be sensible to consider the nonlinear conjugate gradient method in the case that the gradient is taken with respect to a variable inner-product (induced by a preconditioning operator P[u] : X → X from a fixed-point iteration), cf. (13). Indeed, we will show in the next section that this gives rise to the known preconditioned nonlinear conjugate gradient (PNCG) method in the discrete setting.

Nonlinear conjugate gradient method for variable inner-products
For the purpose of examining the nonlinear conjugate gradient method in the context of variable inner-products we will consider a model problem, which will now be introduced.

Model problem
As our model problem, we will consider the following quasilinear second order elliptic partial differential equation: where , is an open, bounded, and polygonal domain. More precisely, we set and X := H 1 0 ( ), which is the Sobolev space of H 1 -functions with zero trace along the boundary ∂ , in (3). Here, for u, v ∈ X , the inner-product and norm on X are defined by (u, v) X := (∇u, ∇v) L 2 ( ) and u X := ∇u L 2 ( ) , respectively. Furthermore, g ∈ L 2 ( ), considered as an element in the dual space H −1 ( ) := H 1 0 ( ) , is a given source function and the diffusion coefficient μ ∈ C 1 ([0, ∞)) satisfies the monotonicity condition for some constants M μ ≥ m μ > 0. Given those assumptions, the nonlinear operator F : X → X from (15) satisfies the conditions (F1) and (F2) with ν = m μ and L F = 3M μ ; see, e.g., [57,Prop. 25.26]. We note the weak form of our model problem (14): It is straightforward to verify that F is a potential operator with the potential given by where ψ(s) = 1 /2 s 0 μ(t) dt for s ≥ 0. As F is strongly monotone it immediately follows that H is strictly convex. Furthermore, the Cauchy-Schwarz inequality, the Poincaré-Friedrich inequality (with constant denoted by C P ), and the assumption (16) imply that thus H is weakly coercive. In particular, (H1)-(H3) are satisfied. If we further assume that μ is monotonically decreasing, i.e., μ (t) ≤ 0 for all t ≥ 0, then the Zarantonello iteration, the Kačanov scheme, and the damped Newton method all satisfy -for suitable damping functions -assumptions (A1)-(A3) as well as (8); see [16,18,19].

Discretisation of the model problem
Since X = H 1 0 ( ) is an infinite-dimensional space, we cannot compute the sequence generated by any of the iteration schemes presented before. In order to cast them into a computational framework, we will consider the discretisation by a conforming finite element method. In particular, for a given triangulation T h of , the corresponding finite element space is given by where, for fixed p ∈ N, P p (K ) signifies the space of all polynomials of total degree at most p on K ∈ T h . Then, the discretisation of the weak problem (17) reads as Furthermore, upon defining the discrete weak problem (22) can be stated equivalently as follows: We emphasise that, for any u ∈ X h , B h (u; ·, ·) : X h × X h → R is a symmetric, uniformly coercive and bounded bilinear form. In particular, we have that Consequently, thanks to the Lax-Milgram theorem, (23) has a unique solution. Moreover, if we define F h : then we can state (23) in form of an operator equation:  . . . , c m h ) T ∈ R m h , is one-to-one; here, T denotes the matrix transposition. By invoking this isomorphism we may consider the discrete weak Eq. (23) as a problem in R m h : We note that, for any here and in the following, in the context of matrices and vectors in R m h ×m h and R m h , respectively, we denote by '·' the usual matrix product.

Algebraic gradient descent method and preconditioning
By invoking the isomorphism : R m h → X h , the operator F h : X h → X h from (24) can be considered as an operator F h : R m h → R m h given by In particular, this is the algebraic gradient with respect to the Euclidean inner-product on R m h and the corresponding gradient descent method for (26) reads as where u 0 ∈ R m h is an initial guess and δ(u n ) > 0, for n = 0, 1, 2, . . . , are a suitable step-sizes. We emphasise that the algebraic gradient is completely detached from the original partial differential equation (arising as the mathematical model of, e.g., a physical problem). Therefore, we should rather consider the discrete (vector) version of the generalised gradient from (13). In particular, for given u ∈ R m h , let ∇ a u H(u) ∈ R m h be such that and, in turn, Especially, if P h (u) = P h is independent of u ∈ R m h , then this procedure coincides with the algebraic gradient descent method for the preconditioned problem In particular, in the light of our observations from Sect. 2, the algebraic preconditioner arises as the discretisation of the Sobolev preconditioner; or said differently, the preconditioned algebraic gradient method (27) matches, up to the damping parameter, the discretisation of the unified iteration scheme (6). We remark that this is not a new insight, but is already known in the literature: for linear problems, this and many more observations are well presented in [33], and in the context of nonlinear problems we refer to [24] and [9,Ch. 8].
We note that the approach above, i.e., to consider the discretisation of an operator preconditioner as the algebraic preconditioner, has some major advantages: First of all, this provides a natural way to choose an algebraic preconditioner and does not require any prior knowledge about the structure of the matrices. Furthermore, in the context of the interplay with a mesh refinement procedure, we simply have to adapt the discretisation of the operator preconditioner, which is straightforward. Most importantly, since the convergence of the (fixed-point) iteration schemes are established, in general, in the continuous setting, those results (most often) apply in the discrete setting as well and are independent of the mesh size; for the latter we further refer to [21,22,24], where the mesh independence of the condition number in the context of the Sobolev (gradient) preconditioners is verified in a rather general setting.

Preconditioned nonlinear conjugate gradient method
As we have seen before, at least for our model problem, the gradient descent method with respect to an inner-product a(·, ·), induced by a linear and invertible operator P : X → X , simply leads to the preconditioned algebraic gradient descent method in the discretise setting, whereby the algebraic preconditioner is given by the discretisation of the operator P. Consequently, if we want to derive the conjugate gradient method for the case that the gradient is taken with respect to some (variable) inner-product a u (·, ·), u ∈ X , on X , this simply leads to the known preconditioned nonlinear conjugate gradient method, see Algorithm 1. More details about (the derivation of) this method can be found, e.g., in the book [46], see also the article [6]. We further refer to [1] for a convergence analysis of the PNCG method. Compute the step-length α(u n ) and set u n+1 = u n + α(u n )d n .

Remark 3.1
Without going into too much details, we shall provide some comments on Algorithm 1.
(1) Ideally, the step-size α(u n ) ≥ 0 is chosen so that α(u n ) = arg min α≥0 H( (u n + αd n )). (29) In practice, however, this minimiser will only be approximated by using, e.g., a line search or a trusted region method; we refer, e.g., to [26, §3.2]. Often, especially for convergence proofs, it is required that the choice of the step-size satisfies some version of the Wolfe conditions. The standard Wolfe conditions were introduced in [54,55]. Later on, several modified Wolfe conditions were presented; we refer to [15] and the references therein.
(2) Many different choices for the conjugate gradient update parameter β n have been proposed in the literature, see, e.g., the extensive survey of Hager and Zhang [15] and the references therein. For the PNCG method, two of the most popular choices are the ones proposed by Fletcher and Reeves [10], and by Polak and Ribière [43] and Polyak [44], Later on, Powell proposed in the article [45] the following modified (and improved) version of the parameter β n P R :

Numerical experiment
In this section, we run a numerical experiment to compare the performance of the various iteration schemes introduced in Sect. 3.1 and their conjugated counterparts. To this end we consider our model problem (17), where := (−1, 1) 2 \[0, 1]×[−1, 0] ⊂ R 2 is an L-shaped domain and the diffusion coefficient μ obeys the Carreau law; i.e., we have that with μ 0 > μ ∞ > 0, λ > 0, and r ∈ (1, 2). It is straightforward to verify that this choice of the diffusion coefficient satisfies (16) with m μ = μ ∞ and M μ = μ 0 . Moreover, since r ∈ (1, 2), the diffusion coefficient is decreasing. Therefore, the Zarantonello, Kačanov, and Newton methods converge for appropriate damping parameters. The source term g ∈ L 2 ( ) is chosen in such a way that the unique solution of (17) is given by the smooth function where (x, y) ∈ R 2 denote the Euclidean coordinates. Furthermore, for the discretisation of problem (17), we consider the conforming P 1 -finite element method, i.e., we set p = 1 in (21), whereby the mesh T h consists of 196,608 triangles. In our experiment below, we choose the parameters μ ∞ = 1, μ 0 = 100, λ = 2, and (a) r = 1.4 or (b) r = 1.05, respectively. In order to approximate the corresponding solutions of the discretised problem (22) for the parameters from (a) and (b), respectively, we will apply the Kačanov method with 1000 iteration steps. Subsequently, we will examine how many iteration steps are required by the Zarantonello, Kačanov,  Zarantonello  61  15  15  -36  36   Kačanov  25  9  10  90  19  24   Newton  5  7  6  7  15  8 Here, 'FP' signifies the usual fixed-point iteration. If the prescribed accuracy was not obtained within 100 steps, then this is remarked by the symbol '-' in the table and Newton methods, as well as their conjugated counterparts with update parameters from (30) and (31), respectively, in order to obtain an error tolerance of 10 −6 with respect to the norm · X in X . In each case we choose the function u 0 ≡ 0 ∈ X h as our initial guess. Moreover, the one-dimensional optimisation problem from line 4 in the PNCG Algorithm 1, cf. (29), is solved by the Matlab subroutine fmincon from the optimisation toolbox with standard options; this part of the algorithm could certainly be improved by a more sophisticated minimisation procedure.
In Table 1 we record the number of iteration steps that were performed by our nonlinear solvers to obtain an error tolerance of 10 −6 . If this accuracy was not achieved within 100 iteration steps, then the calculations were aborted, signified by '-' in the table. The damping parameters for the Zarantonello iteration (18) were chosen to be δ Z = 0.01 in (a) and δ Z = 0.02 in (b), respectively, as they seemed to be close to optimal. Morever, in both cases we set δ N ≡ 1 in (19), i.e., we considered the classical (undamped) Newton method. We emphasise that neither the algebraic gradient descent nor the conjugate gradient method (without preconditioning) converged in a reasonable number of iteration steps; hence, they are not included in the table.
As we can see from Table 1, the Newton method outperformed the other iteration schemes in the specific problem considered. Indeed, the classical Newton method was even slightly superior to its conjugated counterparts; this was possibly caused by a suboptimal numerical solution of the one-dimensional optimisation problem (29) in our computations, whereas the classical Newton method exhibits quadratic convergence close to a solution. In contrast, considering the Kačanov and Zarantonello schemes, we observe that their corresponding PNCG methods require significantly less iterations to obtain the prescribed error tolerance, at least in the given experiment. Indeed, as can be seen in Fig. 1, the quotient of two successive errors, measured in the X -norm, are always smaller for the corresponding PNCG methods compared to the Zarantonello and Kačanov schemes, which, in contrast, is not the case for the Newton method. However, we should be aware that the preconditioned conjugate gradient methods require an additional solution of a one-dimensional minimisation problem, which, in general, is not for free.
It remains to experimentally highlight the mesh independence of our iterative solvers. For that purpose, we rerun our experiment (a) from before for both a coarser and a finer mesh; we remark that they are hierarchical meshes, i.e., the finer meshes are obtained by a (uniform) mesh refinement of the coarser meshes. As shown in Table 2,  Here, "noe" is used as an abbrevation for "number of elements" the number of iteration steps required to obtain an error tolerance of 10 −6 is indeed independent of the mesh size, at least for the experiment considered herein. Finally, we note that for more complicated problems the domain of convergence for the Newton scheme can be rather small, and thus we have to consider other nonlinear solvers such as the Kačanov and Zarantonello methods; see, e.g., [17, §5.1]. Hence, it is certainly worth to study those iteration schemes. Moreover, as we have observed above, their conjugated counterparts are able to accelerate the convergence (in view of the number of iteration steps), at least for the model problem considered.

Conclusion
As we have seen, up to a damping function, the fixed-point iteration obtained by a preconditioning operator coincides with the steepest descent method using the variable inner-product that is induced by this preconditioning operator. Moreover, in view of the corresponding discretised problem in R m h , the operator preconditioner acts as an algebraic preconditioner and, in turn, leads to the preconditioned algebraic gradient descent method. Our numerical experiment illustrated that the choice of a problem related (operator) preconditioner may significantly improve the convergence of the nonlinear conjugate gradient method. Furthermore, in that case, the convergence rate is independent of the mesh size.