Nonlinear Forward-Backward Splitting with Momentum Correction

The nonlinear, or warped, resolvent recently explored by Giselsson and Bùi-Combettes has been used to model a large set of existing and new monotone inclusion algorithms. To establish convergent algorithms based on these resolvents, corrective projection steps are utilized in both works. We present a different way of ensuring convergence by means of a nonlinear momentum term, which in many cases leads to cheaper per-iteration cost. The expressiveness of our method is demonstrated by deriving a wide range of special cases. These cases cover and expand on the forward-reflected-backward method of Malitsky-Tam, the primal-dual methods of Vũ-Condat and Chambolle-Pock, and the forward-reflected-Douglas-Rachford method of Ryu-Vũ. A new primal-dual method that uses an extra resolvent step is also presented as well as a general approach for adding momentum to any special case of our nonlinear forward-backward method, in particular all the algorithms listed above.


Introduction
Given a real Hilbert space H, we consider the problem of finding a zero x ∈ H of the sum of a maximally monotone operator A : H → 2 H and a cocoercive operator C : H → H, i.e., 0 ∈ Ax + Cx. ( If the resolvent (Id +A) −1 of A is easily computable, this problem can be solved with the forward-backward splitting method [1,2].Since this might not be the case, great effort has been devoted to constructing other splitting methods that can exploit any additional structure of A, sometimes further assuming C = 0 [3][4][5][6][7][8][9].This work presents an alternative approach for analyzing and constructing such splitting methods by formulating them as different instances of a forward-backward method with a nonlinear resolvent (M + A) −1 • M where M : H → H is a (potentially) nonlinear kernel.Nonlinear resolvents-or warped resolvents in the terminology of [10]-were recently explored in [10,11] with precursors available in [12,13].These works are preceded by, or developed in parallel with, several other generalizations to the concept of a resolvent.Using a resolvent with a strongly positive self-adjoint bounded linear kernel P in the standard forward-backward method has long been known to converge.In fact, it is simply forwardbackward splitting applied to the scaled problem 0 ∈ P −1 Ax + P −1 Cx, which is a monotone inclusion problem in the Hilbert space given by the inner product P (•), • .The conditions on the kernel have been further relaxed in [14], which allows for non-self-adjoint linear kernels.In the multiple works on Bregman-distance based resolvents, for instance [15][16][17][18][19][20][21], the linearity condition is dropped altogether by allowing the kernel to be the gradient of some differentiable convex function.These relaxations allow the resolvent to be adapted to a particular problem, either to improve the speed of convergence or to make an otherwise intractable resolvent evaluation tractable.However, this extra freedom may come at a cost.The algorithms of [10,11,13,14] all need an extra corrective projection step to ensure that any nonlinearities and asymmetries of the kernel do not prevent convergence.The primary contribution of this paper is a different approach for correcting the update, removing the need to perform a potentially expensive projection.Convergence is instead ensured with a corrective momentum term that reuses information from previous iterations, making it possible to achieve lower per-iteration costs.
The strength of nonlinear resolvents lies in their substantial modeling power which allows for a unified view of a large set of algorithms.Both [10,11] present numerous algorithms that can be interpreted as forward-backward methods with nonlinear resolvents.Our new nonlinear forward-backward method further expands on these modeling capabilities and the second half of this paper is dedicated to deriving both new and existing algorithms as special cases.
Among already existing methods, we show that the forward-(half)-reflected-backward method in [22] is a special case of our method and highlight its connection to the similar forward-backward-(half)-forward method [23,24] via the nonlinear resolvent.We present two new four-operator primal-dual splitting methods, the first of which has, among others, Vũ-Condat [25,26] and Chambolle-Pock [27] as special cases.Vũ-Condat and Chambolle-Pock have been shown to be ordinary forward-backward methods [28] and to have Douglas--Rachford splitting [3] as a special case. 1 Our first primal-dual method is an expansion of this to the nonlinear resolvent setting, giving us the forward-reflected-Douglas-Rachford method of [29] and the novel forward-half-reflected-Douglas-Rachford method as special cases.Our second primal-dual method solves the same problem as the first one but utilizes three resolvent steps, two of which are of the same operator.This method is, as far as we know, completely novel.
Different kinds of momentum have long been used to accelerate the convergence of first-order methods [30][31][32][33][34][35][36][37] and, due to the use of a momentum-like correction term, our nonlinear forward-backward method naturally lend itself to modeling momentum methods.Momentum can be incorporated directly into the design of a special case of our main algorithm but we also present an approach to add momentum to any special case, regardless of whether it initially was designed with momentum or not.The approach is demonstrated on the forward-half-reflected-backward method of [22], which gives a novel momentum algorithm that extends the relaxed momentum algorithm in [22] to include a cocoercive term.Our convergence conditions compare favorably to previous work with a larger range of possible choices of the momentum parameter, even in the more restrictive special case of ordinary forward-backward splitting with momentum.

Outline
We start by presenting basic notation, preliminary results, and define some operator properties.The proposed nonlinear forward-backward algorithm, along with all necessary assumptions on both the problem (1) and the different design parameters, is presented in Sect. 2. Section 3 contains the main convergence proof.
In the remainder of the paper, we present and discuss new or already existing special cases of our nonlinear forward-backward method.Section 4 presents a way of adding momentum to any special case of our main algorithm.Section 5 derives the forward-halfreflected-backward method of [22] as a special case and uses the previously presented approach to add momentum to it.Two new primal-dual methods are derived in Sect.6. Section 6.1 contains an algorithm that expands on the methods of Vũ-Condat and Chambolle--Pock as well as the forward-reflected-Douglas-Rachford of [29].In Sect.6.2 a, to the authors' knowledge, completely new primal-dual method that uses one additional resolvent evaluation per iteration is derived.We end the paper with a brief conclusion.

Notation and Preliminaries
Let R be the set of real numbers, N = {0, 1, . . .} be the set of natural numbers, N + = {1, 2, . . .} be the set of non-zero natural numbers, and let H be a real Hilbert space.The set P(H) is the set of bounded linear operators S : H → H that are self-adjoint and strongly positive, i.e., there exists m > 0 such that If S ∈ P(H), then S is invertible and S −1 ∈ P(H).
For the remainder of this section, we let S ∈ P(H).The scaled inner product is defined as •, • S = S(•), • and the scaled norm as • S = •, • S .The unscaled and scaled norms are equivalent, i.e., there exist M, m > 0 such that M x ≥ x S ≥ m x for all x ∈ H.For all a, b, c, d ∈ H, we have the identity A set-valued operator where gra A = {(x, u) | u ∈ Ax} is the graph of A. An operator A is maximally monotone if it is monotone and its graph is not a proper subset of the graph of another monotone operator.
For μ > 0, a maximally monotone operator This definition is equivalent to ordinary μ-strong monotonicity of S −1 • A in the Hilbert space given by the scaled inner product •, • S .The analogous equivalences hold for the two following definitions as well.For L ≥ 0, an operator B : For > 0, an operator C : An −1 -cocoercive operator w.r.t.S is -Lipschitz continuous w.r.t. S. For all operator properties, if no scaling S is explicitly stated, we mean S = Id.Let C be an −1 -cocoercive operator w.r.t. S. Then the following three-point inequality holds: This is shown by inserting x − x in the inner product on the left-hand side and using cocoercivity and Young's inequality, where > 0. Selecting = 2 −1 yields the desired inequality (3).If C = 0 or is constant, (3) holds with = 0.

Problem and Algorithm
Apart form the general problem structure of (1), we further assume that the operators satisfy the following standard assumptions.
Algorithm 1 Nonlinear Forward-Backward with Momentum Correction Consider problem (1) and let S be such that Assumption 2.1 is satisfied.With x 0 , u 0 ∈ H, for all k ∈ N iteratively perform where M k : H → H and γ k > 0.
Since dom C = H, the sum A + C is maximally monotone and the problem could be reformulated as finding a zero of the single maximally monotone operator A + C.However, as in ordinary forward-backward splitting, separating the problem into a maximally monotone and a cocoercive term and utilizing this structure will prove beneficial.The fact that we assume cocoercivity w.r.t.S entails no real restriction on the problem since the scaled norm • S is equivalent to • .A cocoercive operator w.r.t.S is therefore also cocoercive w.r.t.all other Ŝ ∈ P(H) and vice versa, but with different cocoercivity constants.
The cocoercivity scaling S is utilized directly in our algorithm.In the simplest setting, S acts as a form of preconditioning used to better adapt the algorithm to the specific geometry of the problem.It can also be used as a more general design parameter with different choices of S yielding different instances of our algorithm, see the primal-dual methods in Sect.6 for examples of this.Along with the scaling S, the algorithm has two additional iterationdependent design parameters, a nonlinear kernel M k : H → H and a positive momentum parameter γ k > 0: Compared to [10,11], the elements of the sequence (x k ) k∈N are given directly by a nonlinear forward-backward step and do not need an extra projection step.Convergence is instead ensured by the addition of the corrective term u k to the forward step.The main benefit of this approach is in how the corrective term u k is computed.Both Algorithm 1 and the corresponding algorithm with projection correction [11,Algorithm 3.1] will in general need to evaluate M k at two points.For Algorithm 1, the two points are x k and x k+1 but this means that M k and M k+1 are evaluated at the same point, i.e., x k+1 .The cost of one of these evaluations can then be reduced if M k and M k+1 are similar, for instance if M k+1 x k+1 is a scalar multiplication of M k x k+1 .In order for [11,Algorithm 3.1] to also evaluate M k at x k and x k+1 , it is required that all M k = α −1 k S with S ∈ P(H) and α k > 0 for all k ∈ N. The only instance of [11,Algorithm 3.1] that satisfies this condition is ordinary forward-backward splitting in the scaled metric given by • S .This is in contrast to our work where all but one-Algorithm 6-of the special cases we cover have kernels that allow this reduction in computational cost.
The more similar M k and γ −1 k S are in Algorithm 1, the more similar the nonlinear resolvent is to an ordinary scaled resolvent (γ −1 k S + A) −1 • γ −1 k S and the smaller the corrective term u k+1 will be.No correction, i.e., u k+1 = 0, is applied when M k = γ −1 k S and Algorithm 1 then reduces to ordinary forward-backward splitting.We quantify the difference between M k and γ −1 k S in the following assumption on the design parameters of Algorithm 1.

Assumption 2.2 Assume that:
(i) The sequence (γ k ) k∈N is positively lower bounded, i.e., for each k ∈ N, γ k ≥ γ for some γ > 0. (ii) For each k ∈ N, the nonlinear kernel These assumptions will form the basis of our convergence analysis.First, we will use them to infer a few useful properties of the nonlinear kernel M k .Proposition 2.1 Let Assumption 2.2 hold with L k ∈ [0, 1) for all k ∈ N. Then M k is 2γ −1 -Lipschitz continuous w.r.t.S, maximally monotone, and strongly monotone w.r.t.S for all k ∈ N.
where we have used is monotone and that M k is ρ k -strongly monotone w.r.t. S. Maximality of M k follows from its continuity and monotonicity [38,Corollary 20.28].

Convergence
The convergence of Algorithm 1 will be established by the convergence of a quantity V k , defined in Lemma 3.1.The quantity V k consists of the distance from the corrected iterate x k + S −1 u k to an arbitrary solution (measured in the scaled norm • S ) and a residual term.Theorem 3.1 will then establish the main convergence result.Before that, we show that the algorithm generates a well-defined infinite sequence.
for all k ∈ N + where Proof By Proposition 3.1 we have that sequences (x k ) k∈N and (u k ) k∈N are well-defined, which implies that all quantities of the lemma are well-defined.Let k ∈ N + be arbitrary.
From Algorithm 1 we know that Using the definition of (M k + A) −1 , multiplying with γ k and rearranging yields Since z ∈ zer(A + C), we have −Cz ∈ Az.Using monotonicity of γ k A and multiplying by 2 gives Applying (3) on the last term gives where we have set ( We can expand the second to last norm, assume L k−1 > 0 and use Young's inequality to get By definition we have We also note that this inequality holds when L k−1 = 0 since u k = 0 in that case. Inserting this back into (5) and using Lipschitz continuity of γ k M k − S on the last term yield Rearranging this expression gives the inequality of the lemma.
Theorem 3.1 Let Assumptions 2.1 and 2.2 hold.If there exists an > 0 such that for all k ∈ N + , then Algorithm 1 satisfies the following as k → ∞: x for some x ∈ zer(A + C).
Proof Let z ∈ zer(A + C).Applying Lemma 3.1 and adding the inequality (4 The second to last inequality holds since 0 ≤ L k < 1 for all k ∈ N by the assumptions and the condition (6) of the theorem and therefore is 2 ) ≥ > 0 for all k ∈ N + by the condition of the theorem.Item (ii) follows from (i), the definition of u k , and from the L k -Lipschitz continuity of γ k M k − S where L k < 1 for all k ∈ N.
Let k ∈ N.For (iii), we first note from the nonlinear forward-backward step in Algorithm 1 that which, by adding Cx k+1 to both sides, gives The result then follows from (i) and (ii) since for all k ∈ N, γ k > γ and M k and C are Lipschitz continuous w.r.t.S with constants 2γ −1 and respectively, see Proposition 2.1 and Assumption 2.1.
Since A + C is maximally monotone, (iii) implies that all weak sequential cluster points of (x k ) k∈N belong to zer(A + C) due to weak-strong sequential closedness of graphs of maximal monotone operators [38,Proposition 20.38].To show the weak convergence result in (iv), in view of [38,Lemma 2.47], it is enough to show that ( x k − z S ) k∈N converges for all z ∈ zer(A + C).The proof of [38,Lemma 2.47] actually only covers the case when ( x k − z ) k∈N converges but the generalization is straightforward.
For any z ∈ zer(A + C), Lemma 3.1 and the condition is a nonincreasing nonnegative sequence which therefore converges, say, V k → ν.This convergence implies where M k : H → H, γ k > 0 and θ < 1.
due to (i) and 0 ≤ L k−1 < 1.The sequence {x k + S −1 u k − z} k∈N is then bounded, which, together with (ii), yields

Explicit Iterate Momentum
Consider the following variant of Algorithm 1 that adds an additional scaled momentum term γ −1 k θS(x k − x k−1 ).We will show in Corollary 4.1 that there always exists a θ = 0-possibly negative-such that if Algorithm 1 converges, so does Algorithm 2. This shows that it is always possible to add this type of iterate momentum to an instance of Algorithm 1.We will use this in the next section to develop a new momentum variant of the Forward-Half-Reflected-Backward method.Although it might seem like Algorithm 2 has more degrees of freedom than Algorithm 1, this is not the case.In fact, Algorithm 2 is equivalent to Algorithm 1-we show and use this in the proofs below.Algorithm 2 is therefore first and foremost a tool for adding momentum to an already known instance of Algorithm 1 and the usefulness comes via the following corollary that gives an explicit convergence condition.Corollary 4.1 Let Assumptions 2.1 and 2.2 hold and let θ < 1.If there exists an ε > 0 such that for all k ∈ N + , then Algorithm 2 satisfies the following as k → ∞: x for some x ∈ zer(A + C).
Proof By defining γk = γ k 1−θ and ûk+1 = 1 1−θ u k+1 + θ 1−θ S(x k+1 − x k ), the update of Algorithm 2 can equivalently be written as (8) which is the same as the update of Algorithm 1 but with γk and ûk instead of γ k and u k respectively.Algorithm 2 is therefore equivalent to Algorithm 1. Since, by Assumption 2.2, we conclude that γk M k − S is L k +|θ| 1−θ -Lipschitz continuous w.r.t. S. We further have that γk = γ k 1−θ ≥ γ 1−θ > 0 and Assumption 2.2 is therefore satisfied for (8).The convergence condition (6) from Theorem 3.1 for the algorithm update ( 8) is then that there exists an > 0 such that Multiplication of both sides by 1 − θ and noting that θ < 1 gives the equivalent condition that there exists an ε > 0 such that The convergence results for Algorithm 2 follow directly from Theorem 3.1.

Corollary 4.2
If the conditions of Theorem 3.1 hold-implying that Algorithm 1 converges to a solution of (1)-there exists a θ = 0 with θ < 1 such that the conditions of Corollary 4.1 also hold and the additional momentum method in Algorithm 2 converges to a solution of (1).
Proof The assumptions on A, C, S, M k , and γ k of Theorem 3.1 and Corollary 4.1 are identical so it is enough to conclude that there exists a θ = 0 and θ < 1 such that convergence condition (7) of Corollary 4.1 is implied by the conditions of Theorem 3.1.Since Theorem 3.1 holds, we know that Since > 0 there exist a θ such that − 1 2 < θ < 1  6 , θ = 0, and θ < 1. Selecting such a θ yields 1  2 > θ + 2|θ | > 0 and which is the convergence condition (7) for Algorithm 2.
Remark 4.1 From Corollary 4.2, we know that we can always add momentum to an instance of Algorithm 1 and still get a convergent algorithm.In most cases, the per iteration computational cost of the momentum variant is similar to that of the basic method.However, it is possible for the momentum variant not to be tractable.More precisely, it might not be possible to cheaply evaluate k u k .We will show an example of this in Algorithm 6.For Algorithm 6, this problem can be handled by introducing a θ -dependent term in the nonlinear kernel.

Forward-Half-Reflected-Backward with Iterate Momentum
We present the forward-half-reflected-backward with iterate momentum algorithm in Algorithm 3 as a special case of Algorithm 2. It is a method for finding x ∈ H such that 0 ∈ Bx + Dx + Cx (9) for which the following assumption holds.
By letting A = B + D, problem (9) can be seen as an instance of our standard problem formulation (1).By letting S = Id, Assumption 5.1 implies that Assumption 2.1 holds with = β.With these choices, Algorithm 3 is obtained from Algorithm 2 by choosing M k = α −1 k Id −D and γ k = α k for some step-size α k > 0. The backward step of the algorithm becomes Note, the backward step is independent of D and the algorithm will, as we will show next, only depend on D through the forward step.The operator γ k M k − S used in the correction term becomes and the complete forward step with momentum correction is , for all k ∈ N, then x k x where x is a solution to (9).
Proof After Assumption 5.1, we concluded that Assumption 2.1 holds for the reformulation of ( 9) into (1) via A = B + D. Assumption 2.2 also holds since Inserting γ k , β, δ, and θ into (7) of Corollary 4.1 then directly gives the step-size condition and the results follow from the corollary.
The forward-half-reflected-backward (FHRB) method and its special case, the forwardreflected-backward (FRB) method 2 in [22], are special cases of Algorithm 3.They are obtained by setting θ = 0 (FHRB) and θ = 0 and C = 0 (FRB).Our analysis assumes that B and B + D are maximally monotone.In [22], they instead assume that B and D are both maximally monotone which implies that B + D is maximally monotone since D is also Lipschitz continuous with full domain.Our assumptions are slightly more general since we can allow for non-monotone D as long as B can compensate for it.
Our step-size conditions are slightly relaxed compared to the ones in [22].Our conditions match these when a constant step-size α k = α is chosen.However, the original work only provides convergence conditions for non-constant step-sizes in the FRB case, i.e., C = 0.In that case, [22] proved convergence if ≤ 2α k ≤ δ −1 − for some > 0 and all k ∈ N which is slightly more restrictive than our condition.
When C = 0, Algorithm 3 is [22, Equation 4.1] without relaxation and when D = 0 it is forward-backward splitting with momentum.Both of these special cases have been shown to converge under certain conditions but our results expand these conditions in both settings.In the FRB with momentum case, Corollary 5.1 allows for step-sizes that depend on the iteration index k while [22, Theorem 4.3] only allows for constant step-size, α k = α for all k ∈ N. In the forward-backward with momentum case, Corollary 5.1 makes it possible to find a convergent step-size α k for all θ ∈ (−1, 1  3 ), which is the only result we know of that allows for negative momentum.This is especially interesting considering that the magnitude of negative momentum is allowed to be larger than the magnitude of positive momentum.Our upper bound on the momentum matches other results in the literature for weak sequence convergence- [22] when C = 0, [31] when C = D = 0, and [32] when C = 0 and D = 0. 3  In the gradient-descent case, larger upper bounds on θ and α k have been shown to work [39].These results guarantee ergodic convergence of function values and are not applicable to general monotone inclusion problems.

Remark 5.1
The same nonlinear kernel that in this case generates FHRB and FRB yields the forward-backward-half-forward [24] and forward-backward-forward [23] methods when used in the nonlinear forward-backward scheme with projection correction [11].The two sets of algorithms can therefore be seen to have the same nonlinear forward-backward step 2 FHRB was referred to as a three-operator splitting variant of FRB in the original work. 3The work in [32] does not present an explicit convergence condition for a fixed choice of θ .Instead, they present a criterion for selecting an iteration dependent θ k adaptively.However, in a remark they mention results from [31] which, when combined with their results, yield a convergence criteria for a fixed choice of θ .
but with different correction methods to guarantee convergence.Due to the momentum correction's reuse of old information, FHRB and FRB have cheaper per-iteration costs compared to the projection correction counterparts.

Two Novel Primal-Dual Methods
We will present two new primal-dual methods for solving the problem of finding y ∈ K such that where the following assumptions hold.Assumption 6.1 Let K and G be real Hilbert spaces.The operators of (10) satisfy: If F = 0, we set β = 0.
By a primal-dual method, we mean a method that, instead of solving (10) directly, solves the equivalent primal-dual problem of finding y ∈ K and z ∈ G such that The two primal-dual methods are derived by reformulating this primal-dual problem into our standard form (1) and then applying Algorithm 1 with different sets of design parameters.
There is no unique way of reformulating ( 11) into (1) but we set H = K × G and define, with some abuse of block matrix notation, A : Assuming A + C has at least one zero, these operators satisfy Assumption 2.1 since A = A + E + V is the sum of a maximally monotone operator A and two maximally monotone operators E and V with full domains.The properties of A, E, and V are results of the following: maximal monotonicity of B and D; monotonicity and Lipschitz continuity of E; and the skew-adjointness and linearity of V .The first assumption of Assumption 2.1 is then satisfied and the second assumption regarding the cocoercivity of C is easily verified in the standard metric of K × G.However, the algorithms in Sects.6.1 and 6.2 will use different scaling operators S and we will therefore defer the derivation of more precise cocoercivity constants to the respective sections since the constants depend on S.

Primal-Dual Method Block-Triangular Resolvent
To derive our first primal-dual algorithm, we decompose the iterates of Algorithm 1 as x k = (y k , z k ) with y k ∈ K and z k ∈ G for all k ∈ N. The algorithm is given by the following design parameters where τ, σ > 0 such that τ σ V 2 < 1 and λ k ∈ R for all k ∈ N. The assumption on τ and σ guarantees that S ∈ P(K × G).The forward step operator and the correction operator are Inserting these operators into the complete forward step with correction, , What remains to compute is the backward step.The kernel M k is designed to cancel out the E and V terms, making only the forward step depend on these operators, This is the inverse of a lower block triangular operator and it can therefore be computed with back substitution according to Inserting the expressions for ŷk and ẑk results in the following algorithm.Due to the lower block-triangular structure of the operator in the backward step, the primal update of y k+1 is independent of the dual update of z k+1 but the opposite statement does not hold in general.This dependency is controlled by λ k and manifests itself as a correction v k+1 added to the primal iterate used in the dual update.When Fig. 1 Update of the corrected primal iterate y k + v k+1 in Algorithm 4 Algorithm 4 Primal-Dual Method with Block Triangular Resolvent Consider problem (10).With y 0 , y −1 ∈ K, z 0 ∈ G and λ −1 ∈ for all k ∈ N iteratively perform where τ, σ > 0 and λ k ∈ R.
correction v k+1 is an affine combination of an extrapolation step based either on the current or previous primal update, see Fig. 1.When λ k = λ k−1 , the correction can be an arbitrary linear combination of the two different extrapolations.However, the choice of the sequence (λ k ) k∈N will affect the range of allowed step-sizes.The more λ k differs from 2, the smaller the upper bound on the step-sizes is in the following convergence result.
Corollary 6.1 Let Assumption 6.1 hold and consider problem (10) and Algorithm 4. If there exists > 0 such that for all k ∈ N, then y k y and z k z where y is a solution to (10) and (y , z ) is a solution to (11).
Before proceeding to the proof of Corollary 6.1, we present the following lemma on which the proof relies.Lemma 6.1 Let S ∈ P(K × G) be from (13).The inverse of S satisfies The following inequalities hold for all y ∈ K and z ∈ G: The inverse is easily verified and we note that, since τ σ V 2 < 1 by assumption, Id −τ σ V * V ∈ P(K) and Id −τ σ V V * ∈ P(G) and hence they are invertible.Let y ∈ K, then which proves the first inequality of the The last step holds since which proves the second inequality of the lemma.Again, the last step holds since 1 which proves the third inequality of the lemma.

Proof of Corollary 6.1
As previously stated, the choice of A and C in (12) satisfies Assumption 2.1 since we assume that a solution exists.What remains to verify of Assumption 2.1 is to derive a cocoercivity constant of C. The first inequality of Lemma 6.1 directly gives The assumptions placed on the design parameters, Assumption 2.2, also need to hold.For item (i) of Assumption 2.2, we directly see that γ k = τ > 0. We prove (ii) of Assumption 2.2, the Lipschitz continuity of by showing Lipschitz continuity of τ E and of τ ( M k − V ) − S separately.The Lipschitz continuity of γ k M k − S then follows from the Lipschitz continuity of a sum of Lipschitz continuous operators.Starting with τ E and using the first and third inequalities from Lemma 6.1 and the Lipschitz continuity of E gives and can use the second inequality of Lemma 6.1: Adding these two Lipschitz constants yields that and Assumption 2.2 is satisfied.The result of the corollary now follows from Theorem 3.1 after inserting the expressions for and L k into the convergence criterion 0

Related Algorithms
From Algorithm 4, when E = 0 and λ k = 2 for all k ∈ {−1, 0, . . .}, we obtain an instance of the Vũ-Condat algorithm [25,26].If F = 0 as well, we get the method of Chambolle--Pock [27].This is not surprising since both of these methods are special cases of ordinary forward-backward splitting and the kernel M k , see (13), is linear, self-adjoint, and can be made strongly positive when E = 0 and λ k = 2. Furthermore, we have that γ k M k − S = 0, which implies that the momentum-correction term is zero and that Algorithm 1 has reduced to the ordinary forward-backward method.Both when F = 0 and when F = 0, Corollary 6.1 regains the convergence criteria of Vũ-Condat and Chambolle-Pock respectively.When E = 0, Algorithm 4 shares similarities with the asymmetric-kernel primal-dual method of Latafat and Patrinos [14,Algorithm 3].They use the same resolvent kernel, but [11] showed that the Latafat-Patrinos algorithm is a special case of nonlinear forwardbackward splitting with projection correction instead of momentum correction.As discussed in Sect. 2 when comparing momentum and projection corrections, the main benefit of Algorithm 4 is that the momentum correction generally yields cheaper iterations.In Algorithm 4, the linear composition term V and its adjoint V * only need to be evaluated once each, while they need to be evaluated twice each for the Latafat-Patrinos method.
We can also relate Algorithm 4 to projective splitting methods [8,40].It has been shown in [41,42] that these methods are nonlinear forward-backward method with projection correction.In fact, the synchronous projective splitting considered in [41] is using the same Algorithm Forward-Half-Reflected-Douglas-Rachford Consider problem (10) with V = Id.With y 0 , y −1 ∈ K and z 0 ∈ G, for all k ∈ N iteratively perform where τ, > 0. kernel as in Algorithm 4 with E = 0 and λ k = 0. We can therefore think of Algorithm 4 with E = F = 0 and λ k = 0 for all k ∈ {−1, 0, . . .} as a projective splitting method with momentum correction instead of a projection correction.The benefit of projective splitting methods compared to Chambolle-Pock-like primal-dual methods is that the primal and dual updates do not depend on each other and can therefore be performed in parallel.The same holds for Algorithm 4 since the correction v k+1 does not depend on y k+1 when λ k = 0.The reason for this becomes evident when examining the backward step (M k + A) −1 = ( M k + A) −1 since both M k and A are block-diagonal when λ k = 0, see ( 12) and (13).
Algorithm 5 converges as per the following result.
then y k y and z k z where y is a solution to (10) and (y , z ) is a solution to (11).
Proof Follows directly from Corollary 6.1 with V = Id and λ k−1 = 2 for all k ∈ N.
These convergence conditions match those of [29] when F = 0.When E = F = 0, the standard Douglas-Rachford is retrieved from Algorithm 5 if the step-sizes τ = ς are chosen and the variable change y k − τ z k → z k is made.However, this step-size choice makes the step-size condition of Corollary 6.2 impossible to satisfy.The reason for this is that the scaling S of the underlying nonlinear forward-backward method becomes singular, which violates Assumption 2.1.Dealing with this singularity is possible if it is explicitly assumed that E = F = 0, but this is beyond the scope of this article, where the positive definiteness of S is assumed.
When E = 0, Algorithm 5 is applicable to the same of problems as the David-Yin method in [4].However, the algorithms are different, although they can both reduce to the Douglas-Rachford iterations when also F = 0.

Primal-Dual Method with Resolvent-Compensated Kernel
Our second method for solving (10) through the primal-dual problem (11) will make further use of the nonlinearity of the kernel by including resolvent evaluations in the kernel itself.As in the previous case, we reformulate the primal-dual problem to our standard problem (1) by defining H, A, C, A, E, and V as in (12).The iterates of Algorithm 1 are decomposed as x k = (y k , z k ) with y k ∈ K and z k ∈ G for all k ∈ N. The second primal-dual algorithm is then given by Algorithm 1 with the following design parameters: where τ, σ > 0 and T a : G → G : z → z − a is the translation by a ∈ G.Note that the current iterate z k is used in the construction of M k and that S ∈ P(K × G) for all τ, σ > 0. With these design parameters, the correction operator becomes Inserting this and the other operators into the forward step, To see that the backward step can be evaluated efficiently requires some extra attention.The operator M k + A + V does not have the lower block-triangular structure as in the algorithm in Sect.6.1.We can therefore not evaluate its inverse using the same back substitution approach as before and computing it Algorithm 6 Primal-Dual Method with Resolvent Corrected Kernel problem (10).With y 0 , y −1 ∈ K and z 0 , ν 0 ∈ G, for all k ∈ N iteratively perform where τ, σ > 0.
at a general point seems intractable.However, ( M k + A + V ) −1 is only evaluated at ( ŷk , ẑk ) and the kernel has been specifically designed such that the backward step can be efficiently evaluated in this point.First use Writing out the inclusion problem explicitly yields Using that z k = σ ẑk in the first row and solving for z k+1 in the second row results in Inserting the second row into the first and solving for y k+1 gives y k+1 = (Id +τ B) −1 (τ ŷk ), z k+1 = (Id +σ D −1 ) −1 (σ ẑk + σ V y k+1 ).
Finally, inserting the expressions for ŷk and ẑk gives us the following algorithm.We see that, compared to our other primal-dual method Algorithm 4, we require one extra evaluation of the resolvent of D −1 each iteration.Apart from that, Algorithm 6, also only requires one evaluation of (Id +τ B) −1 , V and V * , given that V y k+1 is stored for the next iteration.Still, the resulting per-iteration computational cost is higher compared to Algorithm 4 and most other primal-dual methods.Exactly how much more expensive this method is will depend on the problem being solved and in some cases it is negligible.The main reason for presenting Algorithm 6, apart from its novelty, is to further demonstrate the flexibility of the nonlinear kernel framework.for all (y, z) ∈ K × G.We have previously established that A is maximally monotone and, since we assume a solution exists, Assumption 2.1 holds.For Assumption 2.2, we first note that γ k = τ > 0 and, hence, that the first assumption is satisfied.For the Lipschitz continuity of γ k M k − S we recall the definition of the operator in (15).The operator E is, by assumption, δ-Lipschitz continuous, and (Id +σ D −1 ) −1 • T − z k is 1-Lipschitz since both the resolvent and translation are 1-Lipschitz continuous.The operator

Remark 6.1
As stated in Remark 4.1, the approach for adding momentum presented in Sect. 4 and Algorithm 2 does not yield a tractable algorithm when applied to Algorithm 6.The kernel of Algorithm 6 was designed in such a way that the backward step is only cheaply computed at the point given by the forward step and it is therefore not straightforward to apply the latter to the forward step with momentum.However, this is easily fixed.We regain computability of the backward step if we add θ(z k − z k−1 ) according to and use this kernel in Algorithm 2 instead.Since this operator only differs from the one in ( 14) by a translation, it does not modify any Lipschitz constants, and the convergence can be proved using the same approach as in Corollary 4.1.

Conclusion
We have presented a forward-backward method with a nonlinear resolvent and a novel momentum correction.The design freedom of the nonlinear resolvent allows us to interpret numerous methods as special cases of this forward-backward method.Existing special cases include the forward-(half)-reflected-backward method, the forward-reflected-Douglas--Rachford method and the primal-dual methods of Vũ-Condat and Chambolle-Pock.New algorithms include momentum versions of the previously mentioned algorithms and new four-operator primal-dual splitting methods.Our convergence conditions either regain or improve on the already known conditions for the existing methods, establishing parity of our more general analysis with the more specialized approaches.We believe that this parity of analysis and the great amount of freedom in the parameter choices of our algorithm can prove useful for the understanding of existing algorithms and the development of new ones.

Corollary 6 . 2
Let V = Id and let Assumption 5.1 hold.Consider problem (10) and Algorithm 5.If the step-sizes satisfy

ν
Algorithm 2 Nonlinear Forward-Backward with Momentum Correction and Additional Iterate Momentum Consider problem (1) and let S be such that Assumption 2.1 is satisfied.With x 0 , x −1 , u 0 ∈ H, for all k ∈ N iteratively perform