Convergence Rates of Forward–Douglas–Rachford Splitting Method

Over the past decades, operator splitting methods have become ubiquitous for non-smooth optimization owing to their simplicity and efficiency. In this paper, we consider the Forward–Douglas–Rachford splitting method and study both global and local convergence rates of this method. For the global rate, we establish a sublinear convergence rate in terms of a Bregman divergence suitably designed for the objective function. Moreover, when specializing to the Forward–Backward splitting, we prove a stronger convergence rate result for the objective function value. Then locally, based on the assumption that the non-smooth part of the optimization problem is partly smooth, we establish local linear convergence of the method. More precisely, we show that the sequence generated by Forward–Douglas–Rachford first (i) identifies a smooth manifold in a finite number of iteration and then (ii) enters a local linear convergence regime, which is for instance characterized in terms of the structure of the underlying active smooth manifold. To exemplify the usefulness of the obtained result, we consider several concrete numerical experiments arising from applicative fields including, for instance, signal/image processing, inverse problems and machine learning.


Problem statement
Splitting methods are iterative algorithms to solve inclusion problems and optimization problems by decoupling the original problem into subproblems that are fast and easy to solve.These schemes evaluate the individual operators, their resolvents, the linear operators, all separately at various points in the course of iteration, but never the resolvents of sums nor of composition by a linear operator.Since the first operator splitting method developed in the 70's for solving structured monotone inclusion problems, the class of splitting methods has been regularly enriched with increasingly sophisticated algorithms as the structure of problems to handle becomes more complex.We refer the reader to [4] and references therein for a through account of operator splitting methods.
Nowadays, operator splitting methods have become ubiquitous for solving problems arising from fields through science and engineering, such as inverse problems, signal/image processing and machine learning, among others.Numerous applications encountered in these fields may end up with solving certain types of optimization problems.In this paper, we consider the following one: where H is a real Hilbert space equipped with scalar product •, • and norm || • ||.We suppose the following assumptions : (A.1) The function R belongs to Γ 0 (H), with Γ 0 (H) being the set of proper convex and lower semicontinuous functions from H to the extended real line ] − ∞, +∞].(A.2) F : H → R is convex continuously differentiable with ∇F being (1/β)-Lipschitz continuous.(A.3)The constraint set V is a closed vector subspace of H. (A.4)The set of minimizers is not empty.Typical examples of (1.1) can be found in the numerical experiment section.

Forward-Douglas-Rachford splitting method
When V = H, problem (1.1) can be handled by the classical Forward-Backward (FB) splitting method [33], whose iteration, in its relaxed form, reads where γ ∈]0, 2β[ is the step-size and λ k ∈]0, 4β−γ 2β [ is the relaxation parameter.The term prox γR is called the proximity operator of γR and is defined by ( When V is merely a subspace of H, in principle we still can apply FB splitting method to solve (1.1).However, even if prox γR is very easy to compute, the proximity operator of R + ι V in general may be rather difficult to calculate.Therefore, new splitting algorithms are needed, and one possible choice is the Forward-Douglas-Rachford splitting method [10] which will be presented shortly in the next.Let us first define P V as the orthogonal projector onto the subspace V , and the function G (1.4) Remark 1.1.From the assumption on F , we have that also G is convex and continuously differentiable with The observation of using G instead of F to achieve a better Lipschitz condition was first considered in [17].
The iteration of FDR method for solving (1.4) reads where γ is the step-size and λ k is the relaxation parameter.Recall that, under the additional assumptions that γ ∈]0, , the sequences {u k } k∈N and {x k } k∈N converge to a solution; see [10,Theorem 4.2].
In this paper, we consider a non-stationary version of (1.5), namely when the step-size γ may change along the iterations.The method is described below in Algorithm 1.

Algorithm 1: Non-stationary Forward-Douglas-Rachford splitting
Initial: k = 0, z 0 ∈ H, x 0 = P V (z 0 ).repeat Let x k+1 = P V (z k+1 ). (1.6) For global convergence, one can also consider an inexact version of (1.6) by incorporating additive errors in the computation of u k and x k , though we do not elaborate more on this for the sake of local convergence analysis.One can consult [29] for more details on this aspect.
In the next, we suppose the following main assumption on the algorithm parameters: (A.5) The sequence of the step-sizes {γ k } k∈N and the one of the relaxation parameters {λ k } k∈N verify: Notice that, for the stationary case (i.e. for γ k constant), assumption (A.5) is equivalent to the conditions required in [10,Theorem 4.2] for the convergence of iteration (1.5).Moreover, to satisfy (A.5) in absence of relaxation (i.e. when the relaxation parameter is fixed to λ k ≡ 1), the sequence of the step-sizes has just to verify γ k ∈]γ, γ[ with k∈N |γ k − γ| < +∞.On the other hand, in general, the summability assumption of {λ k |γ k − γ|} k∈N in (A.5) is weaker than imposing it without λ k .Indeed, following the discussion in [13,Remark 5.7], take q ∈]0, 1], let θ = 4β V −γ 4β V > 1  2 and Then it can be verified that = +∞.
As revealed by its name, FDR recovers the Douglas-Rachford splitting method [18] when F = 0, and also the FB splitting method [33] when V = H.FDR method is also closely related to other two operator splitting methods: generalized Forward-Backward splitting (GFB) [40] and Three-operators splitting (TOS) [17], which are discussed below.
Generalized Forward-Backward splitting Let m > 0 be a positive integer.Now for problem (1.1), let be V = H and suppose we have m non-smooth functionals.The problem then becomes where for each i = 1, ..., m, we have R i ∈ Γ 0 (H).Similar to the situation of FDR algorithm, even if the proximity operator of each R i can be solved easily, the proximity of the sum of them can be intractable.In [40], the authors propose the GFB algorithm, which achieves the full splitting of the evaluation of the proximity operator of each We refer to [40] for more details of the GFB algorithm.Now define the product space R def = H × ... × H, equipped with proper inner product and norm, the subspace Then it can be shown that GFB algorithm is equivalent to applying FDR to the following problem: We refer to [10] for more connections between FDR and GFB.
Three-operator splitting When in problem (1.7) we choose specifically m = 2, it becomes (1.9) Notice that problem (1.9) can be handled by GFB as it is only a special case of (1.7).In [17] the author proposed a splitting scheme which resembles FDR yet different: given γ ∈]0, 2β[ and λ k ∈]0, 4β−γ 2β [, the iteration of Three-operator splitting (TOS) reads as follows: (1.10) It can be observed that the projection operator P V of FDR is replaced by the proximity operator prox γR 2 .Though the difference is only for the update of x k+1 , as we will see in Section 4.4, the fixed-point operator of the two methods are quite different.

Previous works
Over the past years, convergence rate analysis of some operator splitting methods have been extensively studied, typically in terms of the objective values or the sequence residuals.Typical examples are the FB splitting method and its accelerated version [37,38,6,11,2].There are however many splitting algorithms where studying convergence rate of the objective value is not easy or impossible, unless additional assumptions on the involved functions are assumed.This is typically the case for splitting algorithms that are not of descent-type.Thus designing an appropriate criterion (e.g.deriving from some Lyapunov function) is challenging (as we will see for the FDR algorithm in Section 3).
The convergence rate of the objective value for FDR algorithm has been studied in [16].There, the author presented some ergodic and pointwise convergence rates on the objective value under different (more or less stringent) assumptions imposed on the function R in the objective (1.1).Without any further assumptions other than (A.1)-(A.5),the author proved a pointwise o(1/ √ k) convergence rate on the criterion [16,Theorem 3.5]).However, beside the fact this rate suggests that FDR seems quite pessimistic (it suggests that FDR is as slow as subgradient descent), there is no nonnegativity guarantee for such a criterion and the obtained rate is thus of a quite limited interest.Improving this rate on the objective value requires quite strong assumptions on R.
As far as local linear convergence of the sequence in absence of strong convexity is concerned, it has received an increasing attention in the past few years in the context of first-order proximal splitting methods.
The key idea here is to exploit the geometry of the underlying objective around its minimizers.This has been done for instance in [28,30,31,32] for the FB scheme, Douglas-Rachford splitting/ADMM and Primal-Dual splitting, under the umbrella of partial smoothness.The error bound property1 , as highlighted in the seminal work of [34,35], were used by several authors to study linear convergence of first-order descent-type algorithms, and in particular FB splitting, see e.g.[8,45,27,19].However, to the best of our knowledge, we are not aware of local linear convergence results for the FDR algorithm.

Main contributions
The main contributions of this paper can be summarized as follows.

Global convergence
Convergence of the non-stationary FDR In Section 3, we first prove the convergence of the newly proposed non-stationary FDR scheme (1.6).This is achieved by capturing non-stationarity as an error term.The proof exploits a general result on inexact and non-stationary Krasnosel'skiȋ-Mann fixed-point iteration developed in [29].

Convergence rate of a Bregman divergence
We design a Bregman divergence D(u k ) (see (3.4)) as a meaningful convergence criterion.Under the sole assumptions (A.1)-(A.5),we show a pointwise o(1/(k+1)) and ergodic O(1/(k + 1)) convergence rates of this criterion (Theorem 3.6).More precisely, for k ∈ N, we show that When specializing the result to the unrelaxed FB splitting method, we show that the convergence rate is actually Φ(x k ) − Φ(x ) = o(1/(k + 1)) even for the larger interval ]0, 2β[ on the step-size γ k .

Local convergence
We then turn to analyzing the local convergence behaviour where now H is assumed finite-dimensional, i.e.H = R n .Let x ∈ Argmin(Φ V ) be a global minimizer of (1.4).
Finite time activity identification Under a non-degeneracy condition (see (4.2)), and provided that R is partly smooth relative to a C 2 -smooth manifold M R x near x (see Definition 2.10), we show that the FDR scheme has a finite time identification property.More precisely, the FDR generated sequence {u k } k∈N , which converges to x , will identify in finite time the manifold M R x (Theorem 4.1).

Local linear convergence
Capitalizing on this finite identification result, we first show that the globally non-linear iteration (1.6) locally linearizes along the identified manifold M R x .Then, we show that the convergence becomes locally linear.The rate of linear convergence is characterized precisely in terms of the Riemannian structure of M R x .
Paper organization The rest of the paper is organized as follows.In Section 2 we recall some classical material on convex analysis, operator theory that are essential to our exposition.We then introduce the notion of partial smoothness.The global convergence analysis is presented in Section 3, followed by finite identification and local convergence analysis in Section 4. Several numerical experiments are presented in Section 5. To keep the readability of the paper, very long proofs are collected in the appendix.

Preliminaries
Throughout the paper, H is a Hilbert space equipped with scalar product

Definition 2.2 (Bregman divergence).
Given a function R ∈ Γ 0 (H) and two points x, y in its effective domain, the Bregman divergence is defined by where p ∈ ∂R(x) is a sub-gradient of R.
Notice that the Bregman divergence is not a distance in the usual sense, since it is in general not symmetric 2 .However, it measures the distance of two points in the sense that D p R (x, x) = 0 and D p R (y, x) ≥ 0 for any x, y in the domain of R.Moreover, D p R (y, x) ≥ D p R (w, x) for all the points w that are in the line segment between x and y.
Operators Given a set-valued mapping A : H ⇒ H, define its range ran(A) = {y ∈ H : ∃x ∈ H s.t.y ∈ A(x)} and its graph as gph (A)

Definition 2.3 (Monotone operator).
A set-valued mapping A : H ⇒ H is said to be monotone if, It is maximal monotone if gph (A) can not be contained in the graph of any other monotone operators.
In particular, when α = The class of α-averaged operators is closed under relaxation, convex combination and composition [4].The next lemma shows the composition of two averaged non-expansive operators..

Lemma 2.8 ([39, Theorem 3]).
Let F 1 : H → H be α 1 -averaged and F 2 : H → H be α 2 -average, then the composite operator The next lemma shows some properties of the set of fixed points for a non-expansive operator.

Partial smoothness
In this part, we assume that H = R n and we briefly introduce the concept of partial smoothness, on which relies our local convergence rate analysis.Partial smoothness was first introduced in [26].This concept, as well as that of identifiable surfaces [44], captures the essential features of the geometry of non-smoothness when it can be localized along the so-called active/identifiable manifold.
Let M be a C 2 -smooth embedded submanifold of R n around a point x.To lighten the notation, henceforth we write C 2 -manifold instead of C 2 -smooth embedded submanifold of R n .The natural embedding of a submanifold M into R n permits to define a Riemannian structure on M, and we simply say that M is a Riemannian manifold.T M (x) denotes the tangent space to M at any point near x in M.More materials on manifolds are given in Section B.1.
Below we present the definition of partly smooth functions in Γ 0 (R n ) setting.

Definition 2.10 (Partly smooth function
(ii) Sharpness: The tangent space T M (x) coincides with T x def = par(∂R(x)) ⊥ ; (iii) Continuity: The set-valued mapping ∂R is continuous at x relative to M. The class of partly smooth functions at x relative to M is denoted as PSF x (M).
Under transversality assumptions, the set of partly smooth functions is closed under addition and precomposition by a linear operator [26].Moreover, absolutely permutation-invariant convex and partly smooth functions of the singular values of a real matrix, i.e. spectral functions, are convex and partly smooth spectral functions of the matrix [15].Popular examples of partly smooth functions are summarized in Section 5 whose details can be found in [30].
The next lemma gives expressions of the Riemannian gradient and Hessian (see Section B.1 for definitions) of a partly smooth function.

In turn, for all
where R is any smooth extension (representative) of R on M, and

Global convergence
In this section, we deliver the global convergence behaviour of the non-stationary FDR algorithm (1.6) in a general real Hilbert space setting, including rate analysis.

Global convergence of the non-stationary FDR
Define the reflection operators of γR and ι V respectively as Moreover, define the following operators: The next lemma shows the property of the fixed-point operator F FDR of FDR algorithm.
The averaged property of F γ is a combination of [10, Proposition 4.1] and Lemma 2.8.For F γ,λ k , it is sufficient to apply the definition of averaged operators.
The (stationary) FDR iteration (1.5) can be written into a fixed-point iteration in terms of z k [10, Theorem 4.2], namely As mentioned in the introduction, from [10, Theorem 4.2] we have the following convergence result.If , and the sequence {x k } k∈N converges weakly to x def = P V (z ), where P V (z ) ∈ Argmin(Φ V ).On the other hand, the non-stationary FDR iteration (1.6) can be written as Then we have the following result about global convergence of the algorithm.

Theorem 3.2 (Global convergence of non-stationary FDR). Consider the non-stationary FDR iteration (1.6). Suppose that Assumptions
Moreover, the sequence {z k } k∈N converges weakly to a point z ∈ fix(F γ ), and {x k } k∈N converges weakly to The main idea of the proof of the theorem (see Section A) is to treat the non-stationarity as a perturbation error of the stationary iteration.

Remark 3.3.
(i) As emphasized in the introduction, Theorem 3.2 remains true if the iteration is carried out approximately, i.e. if F γ k (z k ) is computed approximately, provided that the errors are summable; See [29, Section 6] for more details.(ii) With more assumptions on how fast {γ k } k∈N converges to γ, we can also derive the convergence rate of the sequence of residuals {||z k − z k−1 ||} k∈N .However, as we will study in Section 4 local linear convergence behaviour of {z k } k∈N , we shall forgo the discussion here.Interested readers can consult [29] for more details about the rate of residuals.

Convergence rate of the Bregman divergence
As in Theorem 3.2, let z ∈ fix(F γ ) be a fixed point of (1.6) and Then the following optimality condition holds where is the projection of y onto V ⊥ , and we use the fact that v , x = 0.The motivation of choosing the above function to quantify the convergence rate of FDR algorithm is due to the fact that it measures both the discrepancy of the objective to the optimal value and violation of the constraint on V .Lemma 3.4 hereafter will provide us with a key estimate on D v Φ (u k ) which in turn will be instrumental to derive the convergence rate of ) and the following two auxiliary quantities Lemma 3.4.Considering the non-stationary FDR iteration in (1.6).Suppose that Assumptions (A.1)-(A.5)hold with λ k ≡ 1.Then, (i) We have that D v Φ (x ) = 0, and The proof of the proposition can be found in Section A.
Remark 3.5.If we restrict γ k in ]0, β V ], then the term ξ k can be discarded in (3.5).If we impose monotonicity assumption on the sequence {γ k } k∈N , the term ζ k also disappears.
With the above property of D v Φ (u k ), we are able to present the main result on the convergence rate of the Bregman divergence.

Theorem 3.6. Consider the non-stationary FDR iteration (1.6). Suppose that Assumptions
See Section A for the proof.
(i) A typical situation that ensures boundedness of v is when ∂R(x ) is bounded.Such requirement can be removed if we choose more carefully the element v : for instance we can set v def = x −z γ , which is a sub-gradient of ∇F (x ) + ∂R(x ).
(ii) The main difficulty in establishing the convergence rate directly on D v Φ (u k ) (rather that on the best iterate) is that, for V H, we have no theoretical guarantee that {D v Φ (u k )} k∈N is decreasing, i.e. no descent property on D v Φ (u k ).

Application to FB splitting
Assume now that V = H, in which case problem (1.4) simplifies to In this case, the FDR iteration (1.6) is nothing but the FB splitting scheme (1.2).The non-relaxed and nonstationary version of it reads as Specializing to this case the Bregman divergence of (3.4), we get D v Φ (y) = Φ(y) − Φ(x ), which is simply the objective value error.We have the following result.

Local linear convergence
From now on, we adopt a finite-dimensional setting where H = R n .We investigate the local convergence behaviour of the FDR algorithm.In the sequel, we denote z ∈ fix(F γ ) a fixed point of iteration (1.6) and

Finite activity identification
We start with the finite activity identification result.Under the condition of Theorem 3.2, we know that Moreover, we have the following optimality conditions The condition needed for identification result is built upon these monotone inclusions.Since x k is the projection of z k onto V , we have x k ∈ V for all k ≥ 0. Therefore, we only need to discuss the identification property of u k .
Theorem 4.1.Consider the non-stationary FDR iteration (1.6).Suppose that Assumptions (A.1)-(A.5)hold, so that (u k , x k , z k ) → (x , x , z ) where z ∈ fix(F γ ) and x = P V (z ) ∈ Argmin(Φ V ).Moreover, assume that R ∈ PSF x (M R x ) and that the following non-degeneracy condition holds Then, (i) There exists K ∈ N such that, for all k ≥ K, we have See Section B.2 for the proof.Remark 4.2.As we mentioned before, for the global convergence of the sequence, approximation errors can be allowed, i.e. the proximity operators of R, J and the gradient of G can be computed approximately.However, for the finite activity, we have no identification guarantees for (u k , x k ) if such an approximation is allowed.For example, if we have Then, unless ε k ∈ V , we can no longer guarantee that x k ∈ V .

Locally linearized iteration
With the finite identification result, in the next we show that the globally non-linear fixed-point iteration (3.3) can be locally linearized along the identified manifold M R x .Define the following function We have the following key property of R.
Lemma 4.3.Let x ∈ Argmin(Φ V ), and suppose that R ∈ PSF x (M R x ).Then the Riemannian Hessian of R at x reads as which is symmetric positive semi-definite under either one of the two circumstances: x is an affine subspace.In turn, the matrix Consequently, W R is symmetric positive definite with eigenvalues in ]0, 1].Thus, by virtue of [4, Corollary 4.3(ii)], it is firmly non-expansive.
From now on, we assume that F (and hence G) is locally We have the following theorem for the linearized fixed-point formulation of (1.6).

Theorem 4.4 (Locally linearized iteration).
Consider the non-stationary FDR iteration (1.6) and suppose that (A.1)-(A.5)hold.If moreover, λ k → λ ∈]0, 4β V −γ 2β V [ and F is locally C 2 around x , then for all k large enough we have where 2 for the proof.Before presenting the local linear convergence result, we need to study the spectral properties of M γ,λ , which is presented in the lemma below.

Lemma 4.5 (Convergence properties of
Owing to the convergence property of M γ,λ , we can further simplify the linearized iteration (4.6).
Corollary 4.6.Consider the non-stationary FDR iteration (1.6) and suppose that it is run under the assumptions of Theorem 4.4.Then the following holds: See again Section B.2 for the proof.

Local linear convergence
We are now in position to claim local linear convergence of the FDR iterates.Lemma 4.5).Then the following holds: (ii) If moreover R is locally polyhedral around x , F is quadratic, and  2 in the experiments section for a numerical illustration.(iii) The above result can be easily extended to the case of GFB method, for the sake of simplicity we shall skip the details here.Nevertheless, numerical illustrations will be provided in Section 5.

Extension to the three-operator splitting
So far, we have presented the global and local convergence analysis of the FDR method.As we recalled in the introduction, FDR method is closely related to the three-operator splitting method (TOS) [17].Therefore, it would be interesting to extend the obtained result to TOS method.However, extending the global convergence result to TOS is far from straightforward.Hence, in the following, we mainly focus on the local linear convergence results.
Differently from F γ (see (3.1)), T γ cannot be simplified into a compact form.We have the following lemma for the convergence of the TOS method.Proof.See Proposition 2.1 and Theorem 2.1 of [17].
Similar to (4.1), under Lemma 4.9, we have the following optimality condition at convergence: Following the footprint of Section 4.1-4.3, in the next we present the local linear convergence of TOS method.

Finite activity identification
We start with the finite identification result, since J is no longer the indicator function of a subspace, we have the identification result for both u k and x k .
Corollary 4.10.Consider the TOS iteration (4.13).Suppose it is run under the Assumptions (B.1)-(B.4),so that (u k , x k , z k ) → (x , x , z ) where z ∈ fix(T γ ) and , and the following non-degeneracy condition holds Then, there exists Local linearized iteration The next step is to linearize the TOS iteration.Define functions We start with the following result, corollary from Lemma 4.3.
Corollary 4.11.Suppose that J ∈ PSF x (M J x ) and R ∈ PSF x (M R x ).Then their Riemannian Hessians at x read which are symmetric positive semi-definite under either of the following circumstances: (i) condition (4.16) holds.
(ii) M J x and M R x are affine subspaces.In turn, the matrices are both firmly non-expansive.Now assume F is locally C 2 -smooth around x , and define H , and the matrices  (4.20) If moreover J, R are locally polyhedral around x , F is quadratic and We can also specialize Corollary 4.6 to this context, however we choose to skip it owing to its obviousness.
Local linear convergence Finally, we are able to to present the local linear convergence for (4.13).
, and that F is locally C 2 around x .Then the following holds:

Numerical experiments
In this section, we illustrate our theoretical results on concrete examples arising from statistics, and signal/image processing applications.

Examples of partly smooth functions
Table 1 provides some examples of partly smooth functions that we will use throughout this section.More details about them can be found in [30, Section 5] and references therein.
Table 1: Examples of partly smooth functions.For x ∈ R n and some subset of indices b ⊂ {1, . . ., n}, x b is the restriction of x to the entries indexed in b.D DIF stands for the finite differences operator.

Function
Expression Partial smooth manifold The 1 , ∞ -norms and the anisotropic TV semi-norm are all polyhedral functions, hence the corresponding Riemannian Hessians are simply 0. The 1,2 -norm is not polyhedral yet partly smooth relative to a subspace; the nuclear norm is partly smooth relative to the manifold of fixed-rank matrices, which is not anymore flat.The Riemannian Hessian of these two functions are non-trivial and can be computed following [43].

Global convergence rate of the Bregman distance
We first demonstrate, numerically, the global o(1/k) convergence rate of the Bregman divergence of Section 3. Towards this goal, we consider the linearly constrained LASSO problem [22,20,23], namely (5.1) In the latter, µ > 0 is a weight parameter, K : R n → R m is a linear operator, and V ⊂ R n is a subspace.It is immediate to see that problem (5.1) falls within the scope of the FDR algorithm.We set λ k ≡ 1, and the step-size γ k ≡ 1 7||K|| 2 .The convergence profile of min 0≤i≤k D v Φ (u i ) is shown in Figure 1(a).The plot is in log-log scale, where the red line corresponds to the sub-linear O(1/k) rate and the black line is min 0≤i≤k D v Φ (u i ).One can then confirm numerically the prediction of Theorem 3.6.However, it can be observed that beyond some iteration, typically k > 10 3 the convergence rate regime changes to become linear.We argue in the following section that this is likely to be due to finite activity identification since the 1 -norm is partly smooth (in fact even polyhedral) and that, for all k large enough, FDR enters into a local linear convergence regime.

Local linear convergence of FDR
Following the above discussion, in Figure 1(b) we present the local linear convergence of FDR algorithm in terms of ||z k − z || as we are in the scope of Theorem 4.7(ii).We use the same parameters setting as in Figure 1(a).The red line stands for the estimated rate (see Theorem 4.7), while the black line is the numerical observation.The starting point of the red line is the number of iteration where u k identifies the manifold We now investigate numerically the convergence behaviour of the non-stationary version of FDR and compare it to the stationary one.We fix λ k ≡ 1, i.e. the iteration is unrelaxed.The stationary FDR algorithm is run with γ = β.For the non-stationary ones, four choices of γ k are considered: Case 1: Obviously, we have γ k → γ = β and k∈N |γ k − γ| < +∞ for all the four cases.Problem (5.1) is considered again.The comparison results are displayed in Figure 2(a).For the stationary iteration, the local convergence rate of FDR is ρ = 0.9955.We can make the following observations from the comparison: • In agreement with our analysis, the local convergence behaviour of the non-stationary iteration is no better than the stationary one.This contrasts with the global behaviour where non-stationarity could be beneficial (see last comment hereafter); • As argued in Remark 4.8(ii), the convergence rate is eventually controlled by the error |γ k − γ|, except for "Case 4", Indeed, 0.5 is strictly smaller than the local linear rate of the stationary version (i.e.
The non-stationary FDR seems to lead to faster identification, typically for "Case 3".This is the effect of bigger step-size at the early stage of the algorithm.

Local linear convergence of GFB and TOS
To conclude the numerical experiments, we demonstrate the local convergence behaviour of TOS method.Consider the following problem, recovering a vector which is group sparse and piece-wise constant inside each non-zero group: Parameters µ 1 , µ 2 > 0 are tradeoff weights.This problem is a special case of (4.12) if we let Hence it can be solved by the TOS scheme (4.13) and also by the GFB algorithm (1.8).
Figure 2(b) shows the local convergence behaviour of the TOS method solving problem (5.4).The red line stands for the theoretical rate (estimated by Corollary 4.14), while the black line is the numerical observation.Similar to the observation of FDR method in Figure 1(b), local linear convergence of TOS occurs and our theoretical rate estimation is rather tight.We also plot the convergence profile of the GFB method (blue line).It can be observed that the performances of GFB and TOS are almost the same.

Conclusion
In this paper, we studied global and local convergence properties of the Forward-Douglas-Rachford method.Globally, we established an o(1/k) convergence rate of the best iterate and O(1/k) ergodic rate in terms of a Bregman divergence criterion designed for the method.We also specialized the result to the case of Forward-Backward splitting method, for which we have shown that the objective function of the method converges at an o(1/k) rate.Then, locally, we proved the linear convergence of the sequence when the involved functions are moreover partly smooth.In particular, we demonstrated that the method identifies the active manifolds in finite time and that then it converges locally linearly at a rate that we characterized precisely.We also extended the local linear convergence result to the case of Three-Operator-Splitting method.Out numerical experiments supported the theoretical findings.
(i) Recall that Argmin(Φ V ) is nothing but P V (fix(F γ )).Then, as (A.4) assumes the set of minimizers of (1.4) is nonempty, so it is the set fix(F γ ).(ii) Owing to Lemma 3.1, we have that F γ k ,λ k is (α k λ k )-averaged non-expansive.(iii) Owing to the averageness of F γ and F γ k , we have Let ρ > 0 be a positive number.Then, ∀z ∈ H such that ||z|| ≤ ρ, we have Given γ ∈]0, 2β V [, define the following two operators ).Now we have For the first term of (A.3), where ∇G(0) is obviously bounded.Now consider the second term of (A.3).Denoting z V = P V (z) and Denote w k = prox γ k R (y) and w = prox γR (y).Using the classical resolvent equation [9] and non-expansiveness of the proximity operator, we have This, together with firm non-expansiveness of Id − prox γR yields Using the triangle inequality, the Pythagorean theorem and non-expansiveness of β V ∇G, we obtain Then, putting together (A.2), (A.4), (A.5) and (A.6), we get that ∀ρ ∈ [0, +∞[ As we verified that all the conditions listed in [29,Theorem 4] are met for the non-stationary FDR iteration, weak convergence of the sequence {z k } k∈N then follows.In turn, since P V is linear, weak convergence of {x k } k∈N is obtained.
For the sequence {u k } k∈N , observe from the second equation in (1.6) that u k+1 = (z k+1 − z k )λ k + x k , and thus converges strongly to 0. We thus obtain weak convergence of u k to the same weak cluster point as x k .
If H is finite-dimensional, we use the optimality conditions (4.1) and argue as for (A.5) to obtain We present below the proof of the main energy estimation in Section 3.
Proof of Lemma 3.4.The non-negativity of D v Φ (u k ) is rather obvious, owing to the convexity of Φ.So, in the next, we focus on the second claim of the lemma.Define y From the update of u k in (1.6) and the definition of proximity operator, we get For the convexity of R, we obtain that, for every y ∈ H, Notice that u k+1 = x k + z k+1 − z k .Then, the first inner product of the last line of (A.7) can be re-written as In the latter, we applied to y − z k , z k − z k+1 the usual Pythagoras relation, namely Combining (A.8) with (A.7), we obtain Since G is convex, given any x k and y ∈ H, we have Summing up (A.9) and (A.10) and rearranging the terms, we get Since G has Lipschitz continuous gradient, applying Lemma 2.1 yields Now sum up the above two inequalities and recall that ξ Note that we applied again the equivalence Then, from (A.11), we have (A.12) Recall that x k ∈ V .Hence, P V ⊥ (x k ) = 0.Then, using (A.12), we have the following estimate for the Bregman divergence (defined in (3.4)): where y V def = P V (y), z V k def = P V (z k ) are the projections of y, z k onto V respectively.From above one, we deduce the following result In particular, taking y = x ∈ V in the last inequality and using the fact that P V ⊥ (x ) = 0 lead to the desired result.
The following lemma is very classical, see e.g.[ Summing up the inequality (3.5) up to some k ∈ N yields Lastly, as {z k } k∈N is bounded, so is {||z k − x ||} k∈N .Recall that, by assumptions, {γ k } k∈N converges to some γ ∈ ]0, 2β V [ with {|γ k − γ|} k∈N being summable.Then we have that Summing up the above results, we have that (k + 1)θ k ≤ C < +∞ holds for all k ∈ N, which means Now, owing to the definition of θ k , we have , that is, the sequence {θ k } k∈N is non-increasing.Invoking Lemma A.1 on {θ k } k∈N concludes the proof.
For the ergodic rate, we start again from (3.5) and apply Jensen's inequality to D v Φ which is a convex function, and get , where the right-hand side is bounded by arguing as above.
Proof of Corollary 3.8.First, weak convergence of the non-stationary FB iteration follows from Theorem 3.2.
On the one hand, specializing the inequality (3.5) to the case of FB splitting, we get which means that On the other hand, by virtue of the inequality (A.11) in the proof of Lemma 3.4, given any y ∈ H, we have Choosing y = x k , we obtain where δ = 1 γ − 1 2β > 0 since γ < 2β.This implies that the sequence {Φ(x k ) − Φ(x )} k∈N is positive and nonincreasing.
Summing up both sides of the above inequality and applying Lemma A.1 leads to the claimed result.

B Proofs of Section 4 B.1 Riemannian Geometry
Let M be a C 2 -smooth embedded submanifold of R n around a point x.With some abuse of terminology, we shall state C 2 -manifold instead of C 2 -smooth embedded submanifold of R n .The natural embedding of a submanifold M into R n permits to define a Riemannian structure and to introduce geodesics on M, and we simply say M is a Riemannian manifold.We denote respectively T M (x) and N M (x) the tangent and normal space of M at point near x in M.

Exponential map
Geodesics generalize the concept of straight lines in R n , preserving the zero acceleration characteristic, to manifolds.Roughly speaking, a geodesic is locally the shortest path between two points on M. We denote by g(t; x, h) the value at t ∈ R of the geodesic starting at g(0; x, h) = x ∈ M with velocity ġ(t; x, h) = dg dt (t; x, h) = h ∈ T M (x) (which is uniquely defined).For every h ∈ T M (x), there exists an interval I around 0 and a unique geodesic g(t; x, h) : I → M such that g(0; x, h) = x and ġ(0; x, h) = h.The mapping Parallel translation Given two points x, x ∈ M, let T M (x), T M (x ) be their corresponding tangent spaces.Define τ : T M (x) → T M (x ), the parallel translation along the unique geodesic joining x to x , which is isomorphism and isometry w.r.t. the Riemannian metric.

Riemannian gradient and Hessian
For a vector v ∈ N M (x), the Weingarten map of M at x is the operator where V is any local extension of v to a normal vector field on M. The definition is independent of the choice of the extension V , and W x (•, v) is a symmetric linear operator which is closely tied to the second fundamental form of M, see [12,Proposition II.2.1].
Let J be a real-valued function which is C 2 along the M around x.The covariant gradient of J at x ∈ M is the vector ∇ M J(x ) ∈ T M (x ) defined by , where P M is the projection operator onto M. The covariant Hessian of J at x is the symmetric linear mapping ∇ 2 M J(x ) from T M (x ) to itself which is defined as This definition agrees with the usual definition using geodesics or connections [36].Now assume that M is a Riemannian embedded submanifold of R n , and that a function J has a C 2 -smooth restriction on M.This can be characterized by the existence of a C 2 -smooth extension (representative) of J, i.e. a C 2 -smooth function J on R n such that J agrees with J on M. Thus, the Riemannian gradient ∇ M J(x ) is also given by and ∀h ∈ T M (x ), the Riemannian Hessian reads where the last equality comes from [1, Theorem 1].When M is an affine or linear subspace of R n , then obviously M = x + T M (x), and W x (h, P NM(x ) ∇ J(x )) = 0, hence (B.3) reduces to ∇ 2 M J(x ) = P TM(x ) ∇ 2 J(x )P TM(x ) .See [25,12] for more materials on differential and Riemannian manifolds.
We have the following proposition characterizing the parallel translation and the Riemannian Hessian of two close points in M. Lemma B.1.Let x, x be two close points in M, denote T M (x), T M (x ) be the tangent spaces of M at x, x respectively, and τ : T M (x ) → T M (x) be the parallel translation along the unique geodesic joining from x to x , then for the parallel translation we have, given any bounded vector v ∈ R n The Riemannian Taylor expansion of J ∈ C 2 (M) at x for x reads,

B.2 Proofs of main theorems
Proof of Theorem 4.1.From the updating of u k+1 and the definition of proximity operator, we have that At convergence, we have Theorem 3.2 allows to infer that the right hand side of the inequality converges to 0. In addition, since Proof of Theorem 4.4.From (1.6), since V is a subspace, then for x k , we have Projecting onto V leads to Upon projecting onto the corresponding tangent spaces and applying the parallel translation τ k+1 from u k+1 to x , we get Subtracting both equations, we obtain Proof of Theorem 4.7.
(i) Let K ∈ N sufficiently large such that (4.8) holds.We then have from Corollary 4.6(i) By assumption, χ j = C η j , for some constant C ≥ 0 and η < ρ, and we have

Theorem 4 . 7 .
Consider the non-stationary FDR iteration (1.6) and suppose it is run under the conditions of Theorem 4.4.Let be ρ

Lemma 4 . 12 .
The matrix L γ is 2β 4β−γ -averaged non-expansive.Proof.See [17, Proposition 2.1].The above lemma entails that L γ , L γ,λ are convergent, hence the spectral properties result in Lemma 4.5 applies to them.Denote L ∞ γ def = lim k→+∞ L k γ,λ .Now we are able to present the result on the local linearization of iteration (4.13) along the identified manifolds.

Figure 1 :
Figure 1: (a): convergence rate of the Bregman distance inf 0≤i≤k D v Φ (u i ) of FDR algorithm for solving the subspace constrained LASSO problem (5.1).(b): convergence rate of the ||z k − z || of FDR algorithm for solving the subspace constrained LASSO problem (5.1).
M R x .As shown in the figure, we indeed have local linear convergence behaviour of ||z k − z ||.Moreover, since F = 1 2 ||Kx − f || 2 is quadratic and R = ||x|| 1 is polyhedral, our theoretical rate estimation is tight, i.e. the red line has the same slope as the black line.
sub-differentially continuous at every point in its domain[42, Example 13.30], and in particular at x .It then follows that R(u k ) → R(x ).Altogether, this shows that the conditions of[21, Theorem 5.3] are fulfilled for R, and the finite identification claim follows.(a)In this case, M R x is an affine subspace, i.e.M R x = x + T R x .Since R is partly smooth at x relative to M R x , the sharpness property holds at all nearby points in M Rx[26, Proposition 2.10].Thus for k large enough, i.e. u k sufficiently close to x on M R x , we have indeedT u k (M R x ) = T R x = T R u k as claimed.(b)It is immediate to verify that a locally polyhedral function around x is indeed partly smooth relative to the affine subspace x + T R x , and thus, the first claim follows from (ii)(a).For the rest, it is sufficient to observe that by polyhedrality, for any x ∈ M R x near x , ∂R(x) = ∂R(x ).Therefore, combining local normal sharpness [26, Proposition 2.10] and Lemma B.2 yields the second conclusion.

x
k − x = P V (z k − z ).(B.8)Under the assumptions of Theorem 4.1, there existsK ∈ N large enough such that for all k ≥ K, u k ∈ M R x .Denote T R u k and T R x the tangent spaces corresponding to u k and x ∈ M R x .Denote τ R k : T R u k → T Rx the parallel translation along the unique geodesic on M Rx joining u k to x .Similar to (B.8), owing to [28, Lemma 5.1], we have for u k after identification thatu k − x = P T R x (u k − x ) + o(u k − x ) (B.9)The update of u k+1 in (1.6) and its convergence are respectively equivalent to2x k − z k − u k+1 − γ k ∇G(x k ) ∈ γ k ∂R(u k+1 ) 2x − z − x − γ∇G(x ) ∈ γ∂R(x ).

Lemma 2.1 (Descent lemma [7]). Suppose
•, • and norm || • ||.Id denotes the identity operator on H. Denote ι C the indicator function of C, N C the associated normal cone operator and P C the orthogonal projection operator on C.

Lemma 2.6. Let F : H → H, the following statements are equivalent
.11) For the first case of Theorem 4.7, if M ∞ γ = 0 then we obtain the convergence rate directly on ||z k −z ||.Moreover, we can further derive the convergence rate of ||x k − x || and ||u k − x ||.(ii) The condition on λ k |γ k − γ| in Theorem 4.7(i) amounts to saying that {γ k } k∈N should converge fast enough to γ.Otherwise, the local convergence rate would be dominated by that of λ k |γ k − γ|.Especially, if λ k |γ k − γ| converges sub-linearly to 0, then the local convergence rate will eventually become sub-linear.See Figure 24, Theorem 3.3.1].Let sequence {a k } k∈N be nonnegative, non-increasing and summable.Then a k = o(k −1 ).