Convergence Rates of Forward–Douglas–Rachford Splitting Method
 105 Downloads
Abstract
Over the past decades, operator splitting methods have become ubiquitous for nonsmooth optimization owing to their simplicity and efficiency. In this paper, we consider the Forward–Douglas–Rachford splitting method and study both global and local convergence rates of this method. For the global rate, we establish a sublinear convergence rate in terms of a Bregman divergence suitably designed for the objective function. Moreover, when specializing to the Forward–Backward splitting, we prove a stronger convergence rate result for the objective function value. Then locally, based on the assumption that the nonsmooth part of the optimization problem is partly smooth, we establish local linear convergence of the method. More precisely, we show that the sequence generated by Forward–Douglas–Rachford first (i) identifies a smooth manifold in a finite number of iteration and then (ii) enters a local linear convergence regime, which is for instance characterized in terms of the structure of the underlying active smooth manifold. To exemplify the usefulness of the obtained result, we consider several concrete numerical experiments arising from applicative fields including, for instance, signal/image processing, inverse problems and machine learning.
Keywords
Forward–Douglas–Rachford Forward–Backward Bregman distance Partial smoothness Finite identification Local linear convergenceMathematics Subject Classification
49J52 65K05 65K10 90C251 Introduction
Operator splitting methods are iterative schemes to solve inclusion and optimization problems by decoupling the original problem into subproblems that are easy to solve. These schemes evaluate the individual operators, their resolvents, the linear operators, all separately at various points in the course of iteration, but never the resolvents of sums nor of composition by a linear operator. Since the first operator splitting method developed in the 1970s for solving structured monotone inclusion problems, the class of splitting methods has been regularly enriched with increasingly sophisticated algorithms as the structure of problems to handle become more complex. We refer the readers to [1] and references therein for a through account of operator splitting methods.
In this paper, we consider a subspace constrained optimization problem, where the objective function is the sum of a proper convex and lower semicontinuous function and a convex smooth differentiable function with Lipschitz gradient. To efficiently handle the constraint, a provably convergent algorithm is Forward–Douglas–Rachford splitting algorithm (FDR) [2], which is a hybridization of Douglas–Rachford splitting algorithm (DR) [3] and Forward–Backward splitting algorithm (FB) [4]. FDR is also closely related to the generalized Forward–Backward splitting algorithm (GFB) [5, 6] and the threeoperator splitting method (TOS) [7].
Global sublinear convergence rate to asymptotic regularity of the sequence generated by FDR (hence all the abovementioned algorithms) has been recently established in the literature, from the perspective of Krasnosel’skiĭ–Mann fixedpoint iteration; see, for instance, [8] and the references therein. This allows to exhibit convergence rates of the distance of 0 to the objective subdifferential evaluated at the iterate. However, very limited results have been reported in the literature on the convergence rate of the objective function value for FDR, except for certain specific cases. For instance, the objective convergence rate of Forward–Backward splitting and its accelerated versions are now well understood [9, 10, 11, 12, 13, 14]. These results rely essentially on some monotonicity property of a properly designed Lyapunov function. Given that FDR is fixedpoint algorithm, it is much more difficult or even impossible to study the convergence rate of the objective function value. Indeed, these algorithms generate several different points along the course of iteration, making it rather challenging to design a proper Lyapunov function (as we shall see for the FDR algorithm in Sect. 4).
Recently, local linear convergence of operator splitting algorithms for optimization has recently attracted a lot of attention; see [15] for Forward–Backwardtype methods, [16] for Douglas–Rachford splitting, and [17] for Primal–Dual splitting algorithms. This work particularly exploits the underlying geometric structure of the optimization problems, achieving a local linear convergence result without assuming conditions like strong convexity, unlike what is proved in [8, 18]. In practice, local linear convergence of FDR algorithm is also observed. However, to our knowledge, there is no theoretical explanation available for this local behaviour.

In Sect. 4, we first prove the convergence of the newly proposed nonstationary FDR scheme (6). This is achieved by capturing nonstationarity as an error term. The proof exploits a general result on inexact and nonstationary Krasnosel’skiĭ–Mann fixedpoint iteration developed in [8].

We design a Bregman divergence as a meaningful convergence criterion. Under the standard assumptions, we show pointwise and ergodic convergence rates of this criterion (Theorem 4.2). When specializing the result to Forward–Backward splitting, we obtain a stronger claim for the objective convergence rate of the method. The allowed range of step size for the latter rate to hold is twice larger than the one known in the literature.

Finite Time Activity Identification Under the assumption that the nonsmooth component of the optimization problem is partly smooth around a global minimizer relative to its smooth submanifold (see Definition 2.4) and under a nondegeneracy condition (see (31)), we show in Sect. 5 (Theorem 5.1) that the sequence generated by the nonstationary FDR identifies in finite time the solution submanifold. In plain words, this means that, after a finite number of iterations, the sequence enters the submanifold and never leaves it. We also provide a bound on the number of iterations to achieve identification.

Local Linear Convergence Exploiting the finite identification property, we then show that the sequence generated by nonstationary FDR converges locally linearly. We characterize the convergence rate precisely based on the properties of the identified partial smoothness submanifolds.

Threeoperator Splitting Given the close relation between the threeoperator splitting method and FDR, in Sect. 5.4, we extend the above local linear convergence result to the case of the threeoperator splitting algorithm.
As far as local linear convergence of the sequence in the absence of strong convexity is concerned, it has received an increasing attention in the past few years in the context of firstorder proximal splitting methods. The key idea here is to exploit the geometry of the underlying objective around its minimizers. This has been done for instance in [15, 16, 17, 19] for the FB scheme, Douglas–Rachford splitting/ADMM and Primal–Dual splitting, under the umbrella of partial smoothness. The error bound property,^{1} as highlighted in the seminal work of [22, 23], is used by several authors to study linear convergence of firstorder descenttype algorithms, and in particular FB splitting; see, for example, [20, 21, 24, 25]. However, to the best of our knowledge, we are not aware of local linear convergence results for the FDR algorithm.
Paper Organization The rest of the paper is organized as follows. In Sect. 2, we recall some classical material on convex analysis and operator theory that are essential to our exposition. We then introduce the notion of partial smoothness. The problem statement and FDR algorithm are presented in Sect. 3. The global convergence analysis is presented in Sect. 4, followed by finite identification and local convergence analysis in Sect. 5. Several numerical experiments are presented in Sect. 6. Some introductory material on smooth Riemannian manifolds is gathered in “Appendix”.
2 Preliminaries
Throughout the paper, \(\mathcal {H}\) is a Hilbert space equipped with scalar product \(\langle \cdot ,\,\cdot \rangle \) and norm \({\Vert } \cdot {\Vert }\). \(\text {Id}\) denotes the identity operator on \(\mathcal {H}\). \(\Gamma _0(\mathcal {H})\) denotes the set of proper convex and lower semicontinuous functions on \(\mathcal {H}\).
Sets For a nonempty convex set \(C \subset \mathcal {H}\), \(\text {par}(C) :=\mathbb {R}(CC)\) the smallest subspace parallel to C. Denote \(\iota _{C}\) the indicator function of C, \({\mathcal {N}}_{C}\) the associated normal cone operator and \(\text {P}_{C}\) the orthogonal projection on C. The strong relative interior of C is \(\text {sri}(C)\).
Functions Given \(R \in \Gamma _0(\mathcal {H})\), its subdifferential is a setvalued operator defined by \(\partial R : \mathcal {H}\rightrightarrows \mathcal {H},\, x\mapsto \left\{ v \in \mathcal {H}: R(x') \ge R(x) + \langle v,\,x'x \rangle , \forall x' \in \mathcal {H}\right\} \).
Lemma 2.1
Definition 2.1
Notice that the Bregman divergence is not a distance in the usual sense, as it is in general not symmetric.^{2} However, it measures the distance of two points in the sense that \(\mathcal {D}_{R}^{v} (x, x) = 0\) and \(\mathcal {D}_{R}^{v} (y,x) \ge 0\) for any x, y in \(\text {dom}(R)\). Moreover, \(\mathcal {D}_{R}^{v} (y,x) \ge \mathcal {D}_{R}^{v} (w,x)\) for all w in the line segment between x and y.
Operators Given a setvalued mapping \(A: \mathcal {H}\rightrightarrows \mathcal {H}\), define its graph as \(\text {gph}\,(A) :=\{ (x,u)\in \mathcal {H}\times \mathcal {H}: \ u \in A(x) \}\), and set of zeros \(\text {zer}(A) = \{ x \in \mathcal {H}: 0 \in A(x) \}\). Denote \((\text {Id}+ A)^{1}\) the resolvent of A.
Definition 2.2
(Cocoercive Operator) Let \(\beta > 0\) and \(B:\mathcal {H}\rightarrow \mathcal {H}\), then B is \(\beta \)cocoercive, if \(\langle B(x_1)B(x_2),\,x_1x_2 \rangle \ge \beta {\Vert } B(x_1)B(x_2) {\Vert }^2 ,\, \forall x_1, x_2 \in \mathcal {H}\).
If an operator is \(\beta \)cocoercive, then it is \({\beta }^{1}\)Lipschitz continuous.
Definition 2.3
(Nonexpansive Operator) An operator \(\mathscr {F}: \mathcal {H}\rightarrow \mathcal {H}\) is nonexpansive, if \({\Vert } \mathscr {F}(x)  \mathscr {F}(y) {\Vert } \le {\Vert } xy {\Vert },\,\forall x, y \in \mathcal {H}\). For any \(\alpha \in ]0,1[\), \(\mathscr {F}\) is called \(\alpha \)averaged, if there exists a nonexpansive operator \(\mathscr {F}'\) such that \(\mathscr {F}= \alpha \mathscr {F}' + (1\alpha )\text {Id}\).
In particular, when \(\alpha = \frac{1}{2}\), \(\mathscr {F}\) is called firmly nonexpansive. Several properties of firmly nonexpansive operators are collected in the following lemma.
Lemma 2.2
 (i)
\(\mathscr {F}\) is firmly nonexpansive;
 (ii)
\(2\mathscr {F}\text {Id}\) is nonexpansive;
 (iii)
\(\mathscr {F}\) is the resolvent of a maximal monotone operator \(A: \mathcal {H}\rightrightarrows \mathcal {H}\).
Proof
\((\mathrm{i})\Leftrightarrow (\mathrm{ii})\) follows [1, Proposition 4.2, Corollary 4.29], and \((\mathrm{i})\Leftrightarrow (\mathrm{iii})\) is [1, Corollary 23.8]. \(\square \)
Lemma 2.3
[1, Proposition 4.33] Let \(F: \mathcal {H}\rightarrow \mathbb {R}\) be a convex differentiable function, with \(\frac{1}{\beta }\)Lipschitz continuous gradient, \(\beta \in ]0,+\infty [\), then \(\text {Id} \gamma \nabla F\) is \(\frac{\gamma }{2\beta }\)averaged for \(\gamma \in ]0,2\beta [\).
The next lemma shows the composition of two averaged operators.
Lemma 2.4
[27, Theorem 3] Let \(\mathscr {F}_1, \mathscr {F}_2: \mathcal {H}\rightarrow \mathcal {H}\) be \(\alpha _1, \alpha _2\)averaged, respectively, then \(\mathscr {F}_1 \circ \mathscr {F}_2\) is \(\alpha \)averaged with \(\alpha = \frac{\alpha _1+\alpha _22\alpha _1\alpha _2}{1\alpha _1\alpha _2} \in ]0,1[\).
Sequence The following lemma is very classical, see e.g. [28, Theorem 3.3.1].
Lemma 2.5
Let the nonnegative sequence \(\{a_k\}_{k\in \mathbb {N}}\) be nonincreasing and summable. Then \(a_{k}=o(k^{1})\).
Partial Smoothness In this part, let \(\mathcal {H}=\mathbb {R}^n\). We briefly introduce the concept of partial smoothness, which was introduced in [29] and lays the foundation of our local convergence analysis.
Let \(\mathcal {M}\) be a \(C^2\)smooth manifold of \(\mathbb {R}^n\) around a point x. Denote \(\mathcal {T}_{\mathcal {M}}(x')\) the tangent space to \(\mathcal {M}\) at any point near x in \(\mathcal {M}\); see Sect. 8 for more materials. Below we present the definition of partly smooth functions in \(\Gamma _0(\mathbb {R}^n)\) setting.
Definition 2.4
 (i)
Smoothness\(\mathcal {M}\) is a \(C^2\)manifold around x, \(R_{\mathcal {M}}\) is \(C^2\) around x;
 (ii)
Sharpness The tangent space \(\mathcal {T}_{\mathcal {M}}(x)\) coincides with \(T_x :=\text {par}(\partial R(x))^\perp \);
 (iii)
Continuity The setvalued \(\partial R\) is continuous at x relative to \(\mathcal {M}\).
Popular examples of partly smooth functions are summarized in Sect. 6 whose details can be found in [15].
3 Problem and Algorithms
 (A.1)
R belongs to \(\Gamma _0(\mathcal {H})\).
 (A.2)
\(F: \mathcal {H}\rightarrow \mathbb {R}\) is convex continuously differentiable with \(\nabla F\) being \((1/\beta )\)Lipschitz continuous.
 (A.3)
The constraint set V is a closed vector subspace of \(\mathcal {H}\).
 (A.4)
\(\text {Argmin}_V(F+R)\) is nonempty and \(0 \in \text {sri}(\text {dom}(R)V)\).
Remark 3.1
From the assumption on F, we have that also G is convex and continuously differentiable with \(\nabla G = \text {P}_{V} \circ \nabla F \circ \text {P}_{V}\) being \((1/\beta _{{V}})\)Lipschitz continuous (notice that \(\beta _{{V}}\ge \beta \)). The observation of using G instead of F to achieve a better Lipschitz condition was first considered in [7].
Remark 3.2
For global convergence, one can also consider an inexact version of (6) by incorporating additive errors in the computation of \(u_{k}\) and \(x_{k}\), though we do not elaborate more on this for the sake of local convergence analysis. One can consult [8] for more details on this aspect.
 (A.5)The sequence of the step sizes \(\{\gamma _{k}\}_{k\in \mathbb {N}}\) and the one of the relaxation parameters \(\{\lambda _{k}\}_{k\in \mathbb {N}}\) verify:

\(0< \underline{\gamma }\le \gamma _{k}\le \overline{\gamma }< 2\beta _{{V}}\) and \(\gamma _{k}\rightarrow \gamma \) for some \(\gamma \in [\underline{\gamma },\overline{\gamma }]\);

\(\lambda _{k}\in ]0, \frac{4\beta _{{V}} \gamma _{k}}{2\beta _{{V}}}[\) such that \(\sum _{k\in \mathbb {N}}\lambda _{k}(\frac{4\beta _{{V}} \gamma _{k}}{2\beta _{{V}}}\lambda _{k}) = +\infty \);

\(\sum _{k\in \mathbb {N}} \lambda _{k}{} \gamma _{k}\gamma {} < +\infty \).

4 Global Convergence
In this section, we deliver the global convergence analysis of the nonstationary FDR (6) in a general real Hilbert space setting, including convergence rate.
4.1 Global Convergence of the Nonstationary FDR
Lemma 4.1
For the FDR algorithm (6), let \(\gamma \in ]0, 2\beta _{{V}}[\) and \(\lambda _{k}\in ]0, \frac{4\beta _{{V}}\gamma }{2\beta _{{V}}}[\). Then, we have that \(\mathscr {F}_{\gamma }\) is \(\frac{2\beta _{{V}}}{4\beta _{{V}}\gamma }\)averaged and \(\mathscr {F}_{\gamma ,\lambda _k}\) is \(\frac{2\beta _{{V}}\lambda _{k}}{4\beta _{{V}}\gamma }\)averaged.
Proof
The property of \(\mathscr {F}_{\gamma }\) is a combination of Lemma 2.4 and [2, Proposition 4.1]. For \(\mathscr {F}_{\gamma ,\lambda _k}\), it is sufficient to apply the definition of averaged operators. \(\square \)
Theorem 4.1
Consider the nonstationary FDR iteration (6). Suppose that Assumptions (A.1)–(A.5) hold. Then, \(\sum _{k\in \mathbb {N}}{\Vert } {z}_{k}{z}_{k1} {\Vert }^2 < +\infty \). Moreover, \(\{{z}_{k}\}_{k\in \mathbb {N}}\) converges weakly to a point \({z}^\star \in \text {fix}(\mathscr {F}_{\gamma })\), and \(\{x_{k}\}_{k\in \mathbb {N}}\) converges weakly to \(x^{\star }:=\text {P}_{V}({z}^\star ) \in \text {Argmin}_V(F+R)\). If, in addition, either \(\inf _{k \in \mathbb {N}} \lambda _k > 0\) or \(\mathcal {H}\) is finitedimensional, then \(\{u_{k}\}_{k\in \mathbb {N}}\) converges weakly to \(x^{\star }\).
The main idea of the proof of the theorem (see below) is to treat the nonstationarity as a perturbation error of the stationary iteration.
Remark 4.1

As mentioned in the introduction, Theorem 4.1 remains true if the iteration is carried out inexactly, i.e. if \(\mathscr {F}_{\gamma _k}({z}_{k})\) is computed approximately, provided that the errors are summable; see [8, Sect. 6] for more details.

With more assumptions on how fast \(\{\gamma _{k}\}_{k\in \mathbb {N}}\) converges to \(\gamma \), we can also derive the convergence rate of the residuals \(\{{\Vert } {z}_{k}{z}_{k1} {\Vert }\}_{k\in \mathbb {N}}\). However, as we will study in Sect. 5 local linear convergence behaviour of \(\{{z}_{k}\}_{k\in \mathbb {N}}\), we shall forgo the discussion here. Interested readers can consult [8] for more details about the rate of residuals.
Proof
 (1)
The set of fixed point of \(\text {fix}(\mathscr {F}_{\gamma })\) is nonempty;
 (2)
\(\forall k\in \mathbb {N},\,\mathscr {F}_{\gamma _k}\) is 1Lipschitz, i.e. nonexpansive;
 (3)
\(\lambda _k\in ]0, \frac{4\beta _V\gamma _k}{2\beta _V}[\) such that \(\inf _{k\in \mathbb {N}} \lambda _k (\frac{4\beta _V\gamma _k}{2\beta _V}\lambda _k) > 0\);
 (4)
\(\forall \rho \in [0,+\infty [\) and \(\Delta _{k,\rho } := \sup _{{\Vert } z {\Vert } \le \rho } {\Vert } \mathscr {R}_{\gamma _{k}}(z)  \mathscr {R}_{\gamma }(z) {\Vert }\) with \(\mathscr {R}_{\gamma _{k}}, \mathscr {R}_{\gamma }\) being some nonexpansive operators, there holds \(\sum _{k} {\lambda _k\Delta _{k,\rho }} < +\infty \).

We have \(\text {Argmin}_V(F+R)=\text {zer}(\nabla G + \partial R + {\mathcal {N}}_{V})=\text {P}_{V}(\text {fix}(\mathscr {F}_{\gamma }))\) from the discussion of Assumptions (A.1)–(A.4). It then follows that \(\text {fix}(\mathscr {F}_{\gamma }) \ne \emptyset \).

Owing to Lemma 4.1, we have \(\mathscr {F}_{\gamma _k,\lambda _k}\) is \((\alpha _{k}\lambda _{k})\)averaged nonexpansive.
 Owing to the averageness of \(\mathscr {F}_{\gamma }\) and \(\mathscr {F}_{\gamma _k}\), we haveLet \(\rho > 0\) be a positive number. Then, \(\forall z \in \mathcal {H}\) such that \({\Vert } z {\Vert }\le \rho \),$$\begin{aligned} \mathscr {R}_{\gamma }= \text {Id}+ \tfrac{1}{\alpha } (\mathscr {F}_{\gamma } \text {Id}) \quad \text { and }\quad \mathscr {R}_{\gamma _{k}}= \text {Id}+ \tfrac{1}{\alpha _{k}} (\mathscr {F}_{\gamma _k} \text {Id}) . \end{aligned}$$Given \(\gamma \in ]0, 2\beta _{{V}}[\), define the two operators Open image in new window and \(\mathscr {F}_{2,\gamma }= \text {Id} \gamma \nabla G \). Then \(\mathscr {F}_{1,\gamma }\) is firmly expansive (Lemma 2.2) and \(\mathscr {F}_{2,\gamma }\) is \(\frac{\gamma }{2\beta _{{V}}}\)averaged (Lemma 2.3). Now we have$$\begin{aligned} {\Vert } \mathscr {R}_{\gamma _{k}}(z)  \mathscr {R}_{\gamma }(z) {\Vert }= & {} {\Vert } \tfrac{1}{\alpha _{k}} (\mathscr {F}_{\gamma _k} \text {Id}) (z)  \tfrac{1}{\alpha } (\mathscr {F}_{\gamma } \text {Id}) (z) {\Vert } \nonumber \\\le & {} \tfrac{{} \gamma _{k}\gamma {}}{2\beta _{{V}}}\left( {2\rho + {\Vert } \mathscr {F}_{\gamma }(0) {\Vert }}\right) + \tfrac{1}{\alpha _{k}} {\Vert } {\mathscr {F}_{\gamma _k}(z)  \mathscr {F}_{\gamma }(z)} {\Vert } .\nonumber \\ \end{aligned}$$(15)For the first term of (16),$$\begin{aligned} {\Vert } \mathscr {F}_{\gamma _k}(z)  \mathscr {F}_{\gamma }(z) {\Vert }\le & {} {\Vert } \mathscr {F}_{2,\gamma _{k}}(z)  \mathscr {F}_{2,\gamma }(z) {\Vert }\nonumber \\&+{\Vert } \mathscr {F}_{1,\gamma _{k}}\mathscr {F}_{2,\gamma }(z)  \mathscr {F}_{1,\gamma }\mathscr {F}_{2,\gamma }(z) {\Vert } . \end{aligned}$$(16)where \(\nabla G(0)\) is obviously bounded. Now for the second term of (16), denote \(z^{V}= \text {P}_{V}(z)\) and \(z^{V^\bot }= z  z^{V}\), it can be derived that$$\begin{aligned} \begin{aligned} {\Vert } \mathscr {F}_{2,\gamma _{k}}(z)  \mathscr {F}_{2,\gamma }(z) {\Vert }&= {} \gamma _k\gamma {}{\Vert } \nabla G (z) {\Vert } \\ \text {(Triangle inequality and }\nabla G\text { is }\beta _{{V}}^{1}\text {Lip.)}&\le {} \gamma _k\gamma {} (\beta _{{V}}^{1}\rho + {\Vert } \nabla G(0) {\Vert }) , \end{aligned} \end{aligned}$$(17)Denote \(y = z^{V} z^{V^\bot } \gamma \nabla G (z^{V})\). Then we have$$\begin{aligned} v = \mathscr {F}_{1,\gamma }\mathscr {F}_{2,\gamma }(z) \Longleftrightarrow v = z^{V^\bot }+ \text {prox}_{\gamma R}(z^{V} z^{V^\bot } \gamma \nabla G(z^{V})) . \end{aligned}$$Denote \(w_k=\text {prox}_{\gamma _k R}(y)\) and \(w=\text {prox}_{\gamma R}(y)\). Using the resolvent equation [31] and firm nonexpansiveness of the proximity operator yields$$\begin{aligned} \mathscr {F}_{1,\gamma _{k}}\mathscr {F}_{2,\gamma }(z)  \mathscr {F}_{1,\gamma }\mathscr {F}_{2,\gamma }(z) = \text {prox}_{\gamma _k R}(y)  \text {prox}_{\gamma R}(y) . \end{aligned}$$Using the triangle inequality and nonexpansiveness of \(\beta _{{V}}\nabla G\), we obtain$$\begin{aligned} \begin{aligned} {\Vert } w_kw {\Vert }&= {\Vert } \text {prox}_{\gamma _k R}(\tfrac{\gamma _k}{\gamma } y + \left( 1\tfrac{\gamma _k}{\gamma }\right) w)  \text {prox}_{\gamma _k R}(y) {\Vert } \\&\le {\Vert } \left( 1\tfrac{\gamma _k}{\gamma }\right) (y  w) {\Vert } = \tfrac{{} \gamma _k\gamma {}}{\underline{\gamma }}{\Vert } y  w {\Vert } \\&\le \tfrac{{} \gamma _k\gamma {}}{\underline{\gamma }}{\Vert } (\text {Id} \text {prox}_{\gamma R})y {\Vert } \le \tfrac{{} \gamma _k\gamma {}}{\underline{\gamma }} ({\Vert } y {\Vert }+{\Vert } \text {prox}_{\gamma R}(0) {\Vert }). \end{aligned} \end{aligned}$$(18)Define \(\Delta _{k,\rho } :=\sup _{{\Vert } z {\Vert } \le \rho } {\Vert } \mathscr {R}_{\gamma _{k}}(z)  \mathscr {R}_{\gamma }(z) {\Vert }\). Then, putting together (15), (17), (18) and (19), we get that \(\forall \rho \in [0,+\infty [\) where \(C=\tfrac{2\rho +{\Vert } \mathscr {F}_{\gamma }(0) {\Vert }}{4\beta _{{V}}\overline{\gamma }} + \tfrac{\rho }{\beta _{{V}}}(1 + \tfrac{\beta _{{V}}}{\underline{\gamma }} + \tfrac{\overline{\gamma }}{\underline{\gamma }} ) + (1 + \tfrac{\overline{\gamma }}{\underline{\gamma }}) {\Vert } \nabla G(0) {\Vert } + \tfrac{1}{\underline{\gamma }} {\Vert } \text {prox}_{\gamma R}(0) {\Vert }\) is finite valued.$$\begin{aligned} {\Vert } y {\Vert }&\le {\Vert } z^{V} z^{V^\bot } {\Vert } + \gamma {\Vert } \nabla G(z^{V}) {\Vert } \le \rho + \gamma {\Vert } \nabla G(z^{V})  \nabla G(0) {\Vert } + \gamma {\Vert } \nabla G(0) {\Vert } \nonumber \\&\le \rho + \gamma \beta _{{V}}^{1} {\Vert } z {\Vert } + \gamma {\Vert } \nabla G(0) {\Vert } \le \rho + \overline{\gamma }\beta _{{V}}^{1} \rho + \overline{\gamma }{\Vert } \nabla G(0) {\Vert } . \end{aligned}$$(19)
For the sequence \(\{u_{k}\}_{k\in \mathbb {N}}\), observe from the second equation in (6) that \(u_{k+1}= ({z}_{k+1}{z}_{k})/{\lambda _k} + x_{k}\), hence \({\Vert } u_{k+1}x_{k} {\Vert } \le {{\Vert } {z}_{k+1}{z}_{k} {\Vert }}/{\lambda _k}\). It follows from \({\Vert } {z}_{k+1}{z}_{k} {\Vert } \rightarrow 0\) and the condition \(\inf _{k \in \mathbb {N}} \lambda _k > 0\) that \(u_{k+1}x_{k}\) converges strongly to 0. We thus obtain weak convergence of \(u_{k}\). If \(\mathcal {H}\) is finitedimensional, using (30) and the same argument as for inequality (18), we get \( {\Vert } u_{k+1}x^{\star } {\Vert } \le \tfrac{{} \gamma _k\gamma {}}{\underline{\gamma }} ((2+\overline{\gamma }\beta _{{V}}){\Vert } x_{k}x^{\star } {\Vert }+{\Vert } {z}_{k}{z}^\star {\Vert }+{\Vert } \text {prox}_{\gamma R}(0) {\Vert }) \rightarrow 0 \) which concludes the proof. \(\square \)
4.2 Convergence Rate of the Bregman Divergence
In this part, we discuss the convergence rate of a specifically designed Bregman divergence associated to the objective value. As we have seen from the FDR iteration (5), there are three different points \({z}_{k}\) and \(u_{k},x_{k}\) generated along the iteration, which makes very difficult to establish a convergence rate on the objective value directly, unless the constraint subspace V is the whole space. For instance, in [18] the author obtained an \(o(1/\sqrt{k})\) convergence rate on \((R(u_{k})+G(x_{k}))(R(x^{\star })+G(x^{\star }))\), which in general is not a nonnegative quantity. Moreover, the functions R and G in the criterion are not evaluated at the same point. So the latter convergence rate is not only pessimistic (when specialized to \(V=\mathcal {H}\) it gives a convergence rate as slow as subgradient descent), but is also of a limited interest given the lack of nonnegativity. Our result in this part successfully avoids such drawbacks.
The motivation of choosing the above function to quantify the convergence rate of FDR algorithm is due to the fact that it measures both the discrepancy of the objective to the optimal value and violation of the constraint on V.
Lemma 4.2 hereafter will provide us a key estimate on \(\mathcal {D}_{{\Phi }}^{v^{\star }}\!(u_{k})\) which will be used to derive the convergence rate of \(\{\mathcal {D}_{{\Phi }}^{v^{\star }}\!(u_{k})\}_{k\in \mathbb {N}}\). Denote \(z_{k}^{V^\bot }:=\text {P}_{V^\bot }({z}_{k})\) the projection of \({z}_{k}\) onto \(V^\bot \), \(\phi _{k}:=\frac{1}{2\gamma _{k}} ({\Vert } z_{k}^{V^\bot }+ \gamma _{k}v^{\star } {\Vert }^2 + {\Vert } x_{k}x^{\star } {\Vert }^2)\) and two auxiliary quantities \(\xi _{k}:=\tfrac{{} \underline{\gamma }\beta _{{V}} {}}{2\underline{\gamma }\beta _{{V}}} {\Vert } {z}_{k}{z}_{k1} {\Vert }^2,\, \zeta _{k}:=\tfrac{{} \gamma _{k}\gamma _{k1} {}}{2\underline{\gamma }^2}{\Vert } {z}_{k}x^{\star } {\Vert }^2\).
Lemma 4.2
 (i)
We have that \(\mathcal {D}_{{\Phi }}^{v^{\star }}\!(y)\ge 0\) for every y in \(\mathcal {H}\). Moreover, if y is a solution then \(\mathcal {D}_{{\Phi }}^{v^{\star }}\!(y)=0\) (in particular, \(\mathcal {D}_{{\Phi }}^{v^{\star }}\!(x^{\star })=0\)). On the other hand, if y is feasible (\(y\in V\)) and \(\mathcal {D}_{{\Phi }}^{v^{\star }}\!(y)=0\), then y is solution.
 (ii)For the sequence \(\{u_{k}\}_{k\in \mathbb {N}}\), if \(v^{\star }\) is bounded we have$$\begin{aligned} \mathcal {D}_{{\Phi }}^{v^{\star }}\!(u_{k+1}) + \phi _{k+1}\le \phi _{k}+ {\Vert } v^{\star } {\Vert }^2 + \xi _{k+1}+ \zeta _{k+1}< +\infty . \end{aligned}$$(21)
Remark 4.2
If we restrict \(\gamma _{k}\in ]0, \beta _{{V}}]\), then the term \(\xi _{k}\) in (21) can be discarded. If we assume \(\{\gamma _{k}\}_{k\in \mathbb {N}}\) is monotonic, then the term \(\zeta _{k}\) also disappears.
Proof
The nonnegativity of \(\mathcal {D}_{{\Phi }}^{v^{\star }}\!(u_{k})\) is rather obvious, as \(\Phi \) is convex. Therefore, next we focus on the second claim. Define \(y^{V^\bot }:=\text {P}_{V^\bot }(y)\), \(u_{k}^{V^\bot }:=\text {P}_{V^\bot }(u_{k})\), \(z_{k}^{V^\bot }:=\text {P}_{V^\bot }({z}_{k})\) the projections of \(y, u_{k}, {z}_{k}\) onto \(V^\bot \), respectively.
With the above property of \(\mathcal {D}_{{\Phi }}^{v^{\star }}\!(u_{k})\), we are able to present the main result on the convergence rate of the Bregman divergence.
Theorem 4.2
Remark 4.3

A typical situation that ensures the boundedness of \(v^{\star }\) is when \(\partial R(x^{\star })\) is bounded. Such requirement can be removed if we choose more carefully the element \(v^{\star }\). For instance, one can easily show from Theorem 4.1 that the subgradient \(v_k :=(x_{k}{z}_{k})/\gamma _k = \text {P}_{V^\bot }(z_k)/\gamma _k\) converges weakly to \(v^{\star }:=(x^{\star } {z}^\star )/\gamma \in V^\bot \cap (\nabla G(x^{\star }) + \partial R(x^{\star }))\).

The main difficulty in establishing the convergence rate directly on \(\mathcal {D}_{{\Phi }}^{v^{\star }}\!(u_{k})\) (rather that on the best iterate) is that, for \(V \subsetneq \mathcal {H}\), we have no theoretical guarantee that \({\mathcal {D}_{{\Phi }}^{v^{\star }}\!(u_{k})}\) is decreasing, i.e. no descent property on \(\mathcal {D}_{{\Phi }}^{v^{\star }}\!(u_{k})\).
Proof
4.3 Application to Forward–Backward Splitting
Corollary 4.1
Remark 4.4

The o(1 / k) convergence rate for the large choice \(\gamma _{k} \in ]0, 2\beta [\) appears to be new for the FB splitting algorithm. The rate O(1 / k) is known in the literature for several choices of the step size; see, for example, [12, Theorem 3.1] for \(\gamma _k \in ]0,\beta ]\) or with backtracking, and [11, Proposition 2] for \(\gamma _{k} \in ]0, 2\beta [\).

For the global convergence of the sequence \(\{x_{k}\}_{k\in \mathbb {N}}\) generated by the nonstationary FB iteration, neither convergence of \(\gamma _{k}\) to \(\gamma \) nor summability of \(\{{} \gamma _{k}\gamma {}\}_{k\in \mathbb {N}}\) is required. See [32, Theorem 3.4].
Proof
5 Local Linear Convergence
From now on, we turn to the local convergence analysis of FDR. Given that partial smoothness is so far available only in finite dimension, in this section, we consider a finitedimensional setting, i.e.\(\mathcal {H}=\mathbb {R}^{n}\). In the sequel, we denote \({z}^\star \in \text {fix}(\mathscr {F}_{\gamma })\) a fixed point of iteration (6) and \(x^{\star }= \text {P}_{V}({z}^\star ) \in \text {Argmin}(\Phi _{V})\) a global minimizer of problem (4). For simplicity, we also fix \(\lambda _k \equiv 1\).
5.1 Finite Activity Identification
Theorem 5.1
 (i)
There exists \( K \in \mathbb {N}\) such that, for all \(k \ge K\), we have \(u_{k}\in \mathcal {M}_{x^{\star }}^{R}\).
 (ii)Moreover, for every \(k \ge K\),
 (a)
If \(\mathcal {M}_{x^{\star }}^{R}= x^{\star }+T_{x^{\star }}^R\), then \(T_{u_{k}}^R=T_{x^{\star }}^R\).
 (b)
if R is locally polyhedral around \(x^{\star }\), then \(x_{k}\in \mathcal {M}_{x^{\star }}^{R}=x^{\star }+T_{x^{\star }}^R\), \(T_{u_{k}}^R=T_{x^{\star }}^R\), \(\nabla _{\mathcal {M}_{x^{\star }}^{R}} R(u_{k})=\nabla _{\mathcal {M}_{x^{\star }}^{R}} R(x^{\star })\), and \(\nabla ^2_{\mathcal {M}_{x^{\star }}^{R}} R(u_{k})=0\).
 (a)
Remark 5.1
As we mentioned before, for global convergence, approximation errors can be allowed, i.e.\(\text {prox}_{\gamma R}\) and \(\nabla G\) can be computed approximately. However, for the finite activity, we have no identification guarantees for \((u_{k}, x_{k})\) if such an approximation is allowed. For example, if we have \(x_{k}= \text {P}_{V}({z}_{k}) + \varepsilon _{k}\) where \(\varepsilon _{k}\in \mathbb {R}^{n}\) is the error of approximating \(\text {P}_{V}({z}_{k})\). Then, unless \(\varepsilon _{k} \in V\), we can no longer guarantee that \(x_{k}\in V\).
Proof
 (a)
In this case, \(\mathcal {M}_{x^{\star }}^{R}\) is an affine subspace, i.e.\(\mathcal {M}_{x^{\star }}^{R}= x^{\star }+T_{x^{\star }}^R\). Since R is partly smooth at \(x^{\star }\) relative to \(\mathcal {M}_{x^{\star }}^{R}\), the sharpness property holds at all nearby points in \(\mathcal {M}_{x^{\star }}^{R}\) [29, Proposition 2.10]. Thus, for k large enough, i.e.\(u_{k}\) sufficiently close to \(x^{\star }\) on \(\mathcal {M}_{x^{\star }}^{R}\), we have \(\mathcal {T}_{u_{k}}(\mathcal {M}_{x^{\star }}^{R})=T_{x^{\star }}^R=T_{u_{k}}^R\).
 (b)
It is immediate to verify that a locally polyhedral function around \(x^{\star }\) is indeed partly smooth relative to the affine subspace \(x^{\star }+T_{x^{\star }}^R\). Thus, the first claim follows from (ii)(a). For the rest, it is sufficient to observe that by polyhedrality, for any \(x \in \mathcal {M}_{x^{\star }}^{R}\) near \(x^{\star }\), \(\partial R(x) = \partial R(x^{\star })\). Therefore, combining local normal sharpness [29, Proposition 2.10] and [15, Lemma 4.3] yields the second conclusion.
A Bound on the Number of Iterations to Identification In Theorem 5.1, we only assert the existence of some \(K \ge 0\) beyond which finite identification occurs. There are situations where a bound of K can be established.
Proposition 5.1
Suppose that the assumptions of Theorem 5.1 hold. If the iterates are such that \(\partial R(u_{k}) \subset \text {rbd}(\partial R(x^{\star }))\) whenever \(u_{k}\notin \mathcal {M}_{x^\star }\), then we have \(u_{k}\in \mathcal {M}_{x^\star }\) for some k obeying \(k \ge \frac{{\Vert } z_0  {z}^\star {\Vert }^2 + O(\sum _{k\in \mathbb {N}}{} \gamma _k\gamma {})}{\underline{\gamma }^{2} \text {dist}( \nabla G (x^{\star }), V^\bot + \text {rbd}(\partial R(x^{\star })))^2} \).
Remark 5.2
When \(V = \mathbb {R}^n\), we recover the result of [15, Proposition 3.6(i)] established for the Forward–Backward splitting method. For \(F=0\), our result also encompasses that of Douglas–Rachford splitting [16, Proposition 5.1].
Proof
5.2 Locally Linearized Iteration
With the finite identification result, in the next we show that the globally nonlinear fixedpoint iteration (13) can be locally linearized along the identified manifold \(\mathcal {M}_{x^{\star }}^{R}\). Define the function \(\overline{R}(u) {:=} \gamma R(u)  \langle u,\,x^{\star }{z}^\star \gamma \nabla G(x^{\star }) \rangle \). We have the following key property of \(\overline{R}\).
Lemma 5.1
 (i)
Condition (31) holds.
 (ii)
\(\mathcal {M}_{x^{\star }}^{R}\) is an affine subspace.
Theorem 5.2
Proof
Before presenting the local linear convergence result, we need to study the spectral properties of \(\mathscr {M}_{\gamma ,\lambda }\), which is presented in the lemma below.
Lemma 5.2
 (i)\(\mathscr {M}_{\gamma ,\lambda }\) converges to some matrix \(\mathscr {M}_{\gamma }^{\infty }\) and,$$\begin{aligned} \mathscr {M}_{\gamma ,\lambda }^k  \mathscr {M}_{\gamma }^{\infty }= (\mathscr {M}_{\gamma ,\lambda } \mathscr {M}_{\gamma }^{\infty })^k \ \text { and }\ \rho (\mathscr {M}_{\gamma ,\lambda }\mathscr {M}_{\gamma }^{\infty }) < 1 . \end{aligned}$$
 (ii)
Given any \(\rho \in ]\rho (\mathscr {M}_{\gamma ,\lambda }\mathscr {M}_{\gamma }^{\infty }), 1[\), \({\Vert } \mathscr {M}_{\gamma ,\lambda }^k\mathscr {M}_{\gamma }^{\infty } {\Vert } = O(\rho ^k)\).
Proof
Since \({W}_{\overline{R}}\) is firmly nonexpansive by Lemma 5.1, it follows from [1, Example 4.7] that \({M}_{\overline{R}}\) is firmly nonexpansive and hence Open image in new window is nonexpansive. Similarly, as \(\text {P}_{V}\) is firmly nonexpansive, Open image in new window is nonexpansive. As a result, Open image in new window is firmly nonexpansive [1, Proposition 4.21(i)–(ii)]. Then, given \(\gamma \in [0, 2\beta _{{V}}]\), \(\text {Id} \gamma {H}_{G}\) is \(\frac{2\beta _{{V}}}{4\beta _{{V}}\gamma }\)averaged nonexpansive. Therefore, owing to Lemma 4.1, we have the averaged property of \(\mathscr {M}_{\gamma }\) and \(\mathscr {M}_{\gamma ,\lambda }\). We deduce from [1, Proposition 5.15] that \(\mathscr {M}_{\gamma }\) and \(\mathscr {M}_{\gamma ,\lambda }\) are convergent, i.e. the limit of \(\mathscr {M}_{\gamma ,\lambda }^k\) exists as k approaches \(+\infty \). It is denoted as \(\mathscr {M}_{\gamma }^{\infty }\). Moreover, \(\mathscr {M}_{\gamma ,\lambda }^k  \mathscr {M}_{\gamma }^{\infty }= (\mathscr {M}_{\gamma ,\lambda } \mathscr {M}_{\gamma }^{\infty })^k\), \(\forall k \in \mathbb {N}\), and \(\rho (\mathscr {M}_{\gamma ,\lambda }\mathscr {M}_{\gamma }^{\infty }) < 1\) by [36, Theorem 2.12]. The second claim of the lemma is classical using the spectral radius formula; See e.g. [36, Theorem 2.12(i)]. \(\square \)
Owing to Lemma 5.2, we can further simplify the linearized iteration (34).
Corollary 5.1
 (i)Iteration (34) is equivalent to$$\begin{aligned}&(\text {Id}\mathscr {M}_{\gamma }^{\infty })({z}_{k+1} {z}^\star ) \nonumber \\&\quad = (\mathscr {M}_{\gamma ,\lambda }\mathscr {M}_{\gamma }^{\infty })(\text {Id}\mathscr {M}_{\gamma }^{\infty }) ({z}_{k} {z}^\star ) + (\text {Id}\mathscr {M}_{\gamma }^{\infty }) \psi _{k}+ \chi _{k}. \end{aligned}$$(37)
 (ii)
If moreover R is locally polyhedral around \(x^{\star }\) and F is quadratic, then \({z}_{k+1} {z}^\star = (\mathscr {M}_{\gamma ,\lambda }\mathscr {M}_{\gamma }^{\infty })({z}_{k} {z}^\star )\).
Proof
Under polyhedrality and constant parameters, we have from Theorem 5.2 that \(o({\Vert } {z}_{k}{z}^\star {\Vert })\) and \(O(\lambda _{k}{} \gamma _{k}\gamma {})\) vanish, and the result follows. \(\square \)
5.3 Local Linear Convergence
We are now in position to claim local linear convergence of the FDR iterates.
Theorem 5.3
 (i)If there exists \(\eta \in ]0, \rho [\) such that \(\lambda _{k}{} \gamma _{k}\gamma {} = O(\eta ^{kK})\), then$$\begin{aligned} {\Vert } (\text {Id}\mathscr {M}_{\gamma }^{\infty })({z}_{k} {z}^\star ) {\Vert } = O(\rho ^{kK}) . \end{aligned}$$(40)
 (ii)If moreover R is locally polyhedral around \(x^{\star }\), F is quadratic, and that \((\gamma _{k}, \lambda _{k}) \equiv (\gamma ,\lambda ) \in ]0, 2\beta _{{V}}[ \times ]0, [\), then we have$$\begin{aligned} {\Vert } {z}_{k} {z}^\star {\Vert } \le \rho ^{kK} {\Vert } z_{K}  {z}^\star {\Vert } . \end{aligned}$$(41)
Remark 5.3

For the first case of Theorem 5.3, if \(\mathscr {M}_{\gamma }^{\infty }= 0\) then we obtain the convergence rate directly on \({\Vert } {z}_{k}{z}^\star {\Vert }\). Moreover, we can further derive the convergence rate of \({\Vert } x_{k}x^{\star } {\Vert }\) and \({\Vert } u_{k}x^{\star } {\Vert }\).

The condition on \(\lambda _{k}{} \gamma _{k}\gamma {}\) in Theorem 5.3(i) implies that \(\{\gamma _k\}_{k\in \mathbb {N}}\) should converge fast enough to \(\gamma \). Otherwise, the local convergence rate would be dominated by that of \(\lambda _{k}{} \gamma _{k}\gamma {}\). Especially, if \(\lambda _{k}{} \gamma _{k}\gamma {}\) converges sublinearly to 0, then the local convergence rate will eventually become sublinear. See Fig. 2 in the experiments section for a numerical illustration.

The above result can be easily extended to the case of GFB method, for the sake of simplicity we shall skip the details here. Nevertheless, numerical illustrations will be provided in Sect. 6.
Proof
5.4 Extension to ThreeOperator Splitting
So far, we have presented the global and local convergence analysis of the FDR algorithm. As we recalled in the introduction, FDR is closely related with the threeoperator splitting method (TOS) [7]. Therefore, it would be interesting to extend the obtained result to TOS. However, extending the global convergence result to TOS is far from straightforward. Hence, in the following, we mainly focus on the local aspect.
 (B.1)
\(J, R \in \Gamma _0(\mathbb {R}^{n})\).
 (B.2)
\(F: \mathbb {R}^n \rightarrow \mathbb {R}\) is convex continuously differentiable with \(\nabla F\) being \((1/\beta )\)Lipschitz continuous.
 (B.3)
\(\text {Argmin}(\Psi ) \ne \emptyset \), i.e. the set of minimizers is not empty.
 (B.4)
The (constant) step size verifies \(\gamma \in ]0, 2\beta [\) and the sequence of relaxation parameters \(\{\lambda _{k}\}_{k\in \mathbb {N}}\) is such that \(\sum _{k\in \mathbb {N}}\lambda _{k}(\frac{4\beta \gamma }{2\beta }\lambda _{k}) = +\infty \).
Lemma 5.3
 (i)
the operator \(\mathscr {T}_{\gamma }\) is \(\frac{2\beta }{4\beta \gamma }\)averaged nonexpansive.
 (ii)
\(\{{z}_{k}\}_{k\in \mathbb {N}}\) converges to some \({z}^\star \) in \(\text {fix}(\mathscr {T}_{\gamma })\); moreover, both \(\{u_{k}\}_{k\in \mathbb {N}}\) and \(\{x_{k}\}_{k\in \mathbb {N}}\) converge to \(x^{\star }{:=} \text {prox}_{\gamma J}({z}^\star )\), which is a global minimizer of \(\Psi \).
Finite Activity Identification We start with the finite identification result, for both \(u_{k}, x_{k}\) as J is no longer the indicator function of a subspace.
Corollary 5.2
Local Linearized Iteration Define \(\widetilde{R}(u) {:=} \gamma R(u)  \langle u,\,x^{\star }{z}^\star \gamma \nabla F(x^{\star }) \rangle \) and \(\widetilde{J}(x) {:=} \gamma J(x)  \langle x,\,{z}^\star x^{\star } \rangle \). We have the following corollary from Lemma 5.1.
Corollary 5.3
 (i)
Condition (45) holds.
 (ii)
\(\mathcal {M}_{x^{\star }}^{J}\) and \(\mathcal {M}_{x^{\star }}^{R}\) are affine subspaces.
Lemma 5.4
[7, Proposition 2.1] \(\mathscr {L}_{\gamma }\) is \(\frac{2\beta }{4\beta \gamma }\)averaged nonexpansive.
The above lemma entails that \(\mathscr {L}_{\gamma }, \mathscr {L}_{\gamma ,\lambda }\) are convergent; hence, the spectral properties result in Lemma 5.2 applied to them. Denote \(\mathscr {L}_{\gamma }^{\infty }{:=} \lim _{k\rightarrow +\infty } \mathscr {L}_{\gamma ,\lambda }^k\).
Corollary 5.4
Consider the TOS iteration (43). Suppose it is run under Assumptions (B.1)–(B.4), that \(\lambda _{k}\rightarrow \lambda \in ]0, \frac{4\beta \gamma }{2\beta }[\), and that F is locally \(C^2\) around \(x^{\star }\). Then we have \({z}_{k+1} {z}^\star = \mathscr {L}_{\gamma ,\lambda }({z}_{k} {z}^\star ) + o({\Vert } {z}_{k} {z}^\star {\Vert }) \). If moreover J, R are locally polyhedral around \(x^{\star }\), F is quadratic and \(\lambda _{k}\equiv \lambda \in ]0, \frac{4\beta \gamma }{2\beta }[\) is chosen constant, then the term \(o({\Vert } {z}_{k} {z}^\star {\Vert })\) vanishes.
We can also specialize Corollary 5.1 to this context; however, we choose to skip it owing to its obviousness.
Local Linear Convergence Finally, we are able to present the local linear convergence for (43).
Corollary 5.5
 (i)
Given any \(\rho \in ]\rho (\mathscr {L}_{\gamma ,\lambda }\mathscr {L}_{\gamma }^{\infty }), 1[\), there exists \(K \in \mathbb {N}\) large enough such that \({\Vert } (\text {Id}\mathscr {L}_{\gamma }^{\infty })({z}_{k} {z}^\star ) {\Vert } = O(\rho ^{kK}) \,\, \forall k \ge K\).
 (ii)
If moreover J, R are locally polyhedral around \(x^{\star }\), F is quadratic and \(\lambda _{k}\equiv \lambda \in ]0, \frac{4\beta \gamma }{2\beta }[\) is chosen constant, then there exists \(K \in \mathbb {N}\) such that \({\Vert } {z}_{k} {z}^\star {\Vert } \le \rho ^{kK} {\Vert } z_{K}  {z}^\star {\Vert }\,\,\forall k \ge K\).
6 Numerical Experiments
In this section, we illustrate our theoretical results on problems arising from statistics, and signal/image processing applications.^{3}
6.1 Examples of Partly Smooth Functions
Examples of partly smooth functions
Function  Expression  Partial smooth manifold 

\(\ell _1\)norm  \({\Vert } x {\Vert }_1=\sum _{i=1}^n {} x_i {}\)  \(\mathcal {M}= \big \{ z \in \mathbb {R}^n:\; I_z \subseteq I_x \big \} , I_x = \big \{ i:\; x_{i} \ne 0 \big \} \) 
\(\ell _{1,2}\)norm  
\(\ell _\infty \)norm  \(\max _{i=\{1,\ldots ,n\}}{} x_i {}\)  \(\mathcal {M}= \big \{ z \in \mathbb {R}^n:\; z_{I_x} \in \mathbb {R}\text {sign}(x_{I_x}) \big \} \) 
TV seminorm  \({\Vert } x {\Vert }_{\mathrm{TV}}={\Vert } D_{\mathrm{DIF}}x {\Vert }_1\)  \(\mathcal {M}= \big \{ z \in \mathbb {R}^n:\; I_{D_{\mathrm{DIF}}z} \subseteq I_{D_{\mathrm{DIF}}x} \big \} \) 
Nuclear norm  \({\Vert } x {\Vert }_*=\sum _{i=1}^r \sigma (x)\)  \(\mathcal {M}= \big \{ z \in \mathbb {R}^{n_1 \times n_2}:\; \text {rank}(z) = \text {rank}(x) = r \big \} \) 
The \(\ell _1\), \(\ell _\infty \)norms and the anisotropic TV seminorm are all polyhedral functions; hence, the corresponding Riemannian Hessians are simply 0. The \(\ell _{1,2}\)norm is not polyhedral yet partly smooth relative to a subspace; the nuclear norm is partly smooth relative to the manifold of fixedrank matrices, which is no longer a subspace. The Riemannian Hessian of these two functions is nontrivial and can be computed in the following [38].
6.2 Numerical Experiments
The convergence profile of \(\min _{0 \le i \le k} \mathcal {D}_{{\Phi }}^{v^{\star }}\!(u_{i})\) is shown in Fig. 1a. The plot is in loglog scale, where the red line corresponds to the sublinear O(1 / k) rate and the black line is \(\min _{0 \le i \le k} \mathcal {D}_{{\Phi }}^{v^{\star }}\!(u_{i})\). One can then confirm numerically the prediction of Theorem 4.2.
However, it can be observed that beyond some iteration, e.g.\(10^2\) for the consider example, the convergence rate changes to linear. We argue in the next section that this is likely to be due to finite activity identification since \(\ell _{1}\)norm and total variation are partly smooth (in fact even polyhedral) and that, for all k large enough, GFB enters into a local linear convergence regime.
Local Linear Convergence of GFB/FDR Following the above discussion, in Fig. 1b we present the local linear convergence of FDR in terms of \({\Vert } {z}_{k}{z}^\star {\Vert }\) as we are in the scope of Theorem 5.3(ii). We use the same parameters setting as in Fig. 1a. The red line stands for the estimated rate (see Theorem 5.3), while the black line is numerical observation. The starting point of the red line is the number of iteration where \(u_{k}\) identifies the manifolds. As shown in the figure, we indeed have local linear convergence behaviour of \({\Vert } {z}_{k}{z}^\star {\Vert }\). Moreover, since \(F = \frac{1}{2}{\Vert } \mathcal {K}x  f {\Vert }^2\) is quadratic, \(\ell _{1}\)norm and total variation are polyhedral, our theoretical rate estimation is tight, i.e. the red line has the same slope as the black line.

In agreement with our analysis, the local convergence behaviour of the nonstationary iteration is no better than the stationary one. This contrasts with the global behaviour where nonstationarity could be beneficial (see last comment hereafter);

As argued in Remark 5.3(ii), the convergence rate is eventually controlled by the error \({} \gamma _k\gamma {}\), except for “Case 4”, Indeed, 0.5 is strictly smaller than the local linear rate of the stationary version (i.e.\({} \gamma _k\gamma {} = o({\Vert } {z}_{k}{z}^\star {\Vert })\));

The nonstationary FDR seems to lead to faster identification, typically for “Case 3”. This is the effect of bigger step size at the early stage.
In the test, we consider \(x \in \mathbb {R}^{50\times 50}\) and \(\mathcal {K}\) is the subsampling operator (we did not consider larger problem size as computing the theoretical rate is very time and memory consuming). Figure 2b shows the convergence profiles of GFB/TOS. Similarly to the observation made in Fig. 1b, both GFB (magenta line) and TOS (black line) converge sublinearly from the beginning and eventually enter a linear convergence regime. The red line is our theoretical linear rate estimation of TOS. Moreover, for this example, the performances of two algorithms are very close, especially for the global sublinear regime.
7 Perspectives and Open Problems
In this paper, we address the convergence properties of FDR algorithm from both global and local perspectives. The obtained results allow us to better understand the optimization problem (1) and FDR algorithm and moreover lay the foundation for our future research regarding several open problems.
The first open problem is the acceleration of FDR/GFB/TOS, or in general acceleration schemes for nondescenttype methods. In recent years, owing to the success of Nesterov’s optimal scheme [9] and FISTA [12], inertial technique has been widely adopted to speed up other nondescenttype operator splitting methods [40]. However, unlike the results in [9, 12], the acceleration effects of inertial technique for these nondescenttype methods are rather limited, or even slower than the original method [40, Chapter 4]. As a consequence, a proper acceleration scheme for nondescent methods, including FDR/GFB/TOS, with guaranteed acceleration remains an open problem.
Another direction for acceleration is the incremental version of these algorithms, particularly for GFB as the separable structure of \(\sum _i R_i(x)\) in (7) is ideal for designing incremental schemes. Moreover, if F also has finite sum structure, e.g.\(F(x) = \sum _{i=1}^{m} f_i(x)\), then similar to [41], we can consider incremental schemes for both smooth and nonsmooth components of the problem.
The third perspective would be extending the obtained results to the nonEuclidean setting. More precisely, the proximal mapping of (3) is defined based on the Euclidean distance between u and x. By replacing the Euclidean distance with a Bregman distance, we obtain the Bregmantype splitting algorithms which are much more general. Generalizing the obtained results to Bregmantype splitting setting would be important and challenging.
For the local convergence analysis of FDR algorithm, we have to restrict ourselves to finitedimensional Euclidean space, which is due to the fact that partial smoothness is only available in finite dimension. However, recently it is reported that finite identification also occurs for problems in infinite dimension, such as the offthegrid compressive sensing [42]. As a result, proper extension of partial smoothness to the infinite dimension is required to explain these phenomena.
8 Conclusions
In this paper, we studied global and local convergence properties of the Forward–Douglas–Rachford method. Globally, we established an o(1 / k) convergence rate of the best iterate and O(1 / k) ergodic rate in terms of a Bregman divergence criterion designed for the method. We also specialized the result to the case of Forward–Backward splitting method, for which we showed that the objective function of the method converges at an o(1 / k) rate. Then, locally, we proved the linear convergence of the sequence when the involved functions are moreover partly smooth. In particular, we demonstrated that the method identifies the active manifolds in finite time and that then it converges locally linearly at a rate that we characterized precisely. We also extended the local linear convergence result to the case of threeoperator splitting method. Our numerical experiments supported the theoretical findings.
Footnotes
 1.
 2.
It is symmetric, if and only if R is a nondegenerate convex quadratic form.
 3.
MATLAB source for reproducing the numerical result can be found at https://github.com/jliang993/RateFDR.
Notes
Acknowledgements
Cesare Molinari was supported by CONICYT scholarship CONICYTPCHA/Doctorado Nacional/2016. Jingwei Liang was supported by the European Research Council (ERC project SIGMAVision), and Leverhulme Trust project “Breaking the nonconvexity barrier”, the EPSRC Grant “EP/M00483X/1”, EPSRC centre “EP/N014588/1”, Cantab Capital Institute for the Mathematics of Information, and Global Alliance project “Statistical and Mathematical Theory of Imaging”. Jalal Fadili was partly supported by Institut Universitaire de France. We would like to thank the anonymous reviewers whose comments have greatly improve the quality of this paper.
References
 1.Bauschke, H., Combettes, P.L.: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, Berlin (2011)zbMATHGoogle Scholar
 2.BriceñoArias, L.M.: Forward–Douglas–Rachford splitting and forwardpartial inverse method for solving monotone inclusions. Optimization 64(5), 1239–1261 (2015)MathSciNetzbMATHGoogle Scholar
 3.Douglas, J., Rachford, H.H.: On the numerical solution of heat conduction problems in two and three space variables. Trans. Am. Math. Soc. 82(2), 421–439 (1956)MathSciNetzbMATHGoogle Scholar
 4.Lions, P.L., Mercier, B.: Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer. Anal. 16(6), 964–979 (1979)MathSciNetzbMATHGoogle Scholar
 5.Raguet, H.: A note on the forwardDouglasRachford splitting for monotone inclusion and convex optimization. Optim. Lett. (2018). https://doi.org/10.1007/s1159001812728
 6.Raguet, H., Fadili, M.J., Peyré, G.: Generalized forward–backward splitting. SIAM J. Imaging Sci. 6(3), 1199–1226 (2013)MathSciNetzbMATHGoogle Scholar
 7.Davis, D., Yin, W.: A threeoperator splitting scheme and its optimization applications. Setvalued Var. Anal. 25(4), 829–858 (2017)MathSciNetzbMATHGoogle Scholar
 8.Liang, J., Fadili, J., Peyré, G.: Convergence rates with inexact nonexpansive operators. Math. Program. Ser. A 159(1–2), 403–434 (2016)MathSciNetzbMATHGoogle Scholar
 9.Nesterov, Y.: A method for solving the convex programming problem with convergence rate \(O(1/k^2)\). Dokl. Akad. Nauk SSSR 269(3), 543–547 (1983)MathSciNetGoogle Scholar
 10.Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, Berlin (2004)zbMATHGoogle Scholar
 11.Bredies, K., Lorenz, D.A.: Linear convergence of iterative softthresholding. J. Fourier Anal. Appl. 14(5–6), 813–837 (2008)MathSciNetzbMATHGoogle Scholar
 12.Beck, A., Teboulle, M.: A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)MathSciNetzbMATHGoogle Scholar
 13.Chambolle, A., Dossal, C.: On the convergence of the iterates of the “fast iterative shrinkage/thresholding algorithm”. J. Optim. Theory Appl. 166(3), 968–982 (2015)MathSciNetzbMATHGoogle Scholar
 14.Attouch, H., Peypouquet, J.: The rate of convergence of Nesterov’s accelerated forward–backward method is actually faster than \(1/k^2\). SIAM J. Optim. 26(3), 1824–1834 (2016)MathSciNetzbMATHGoogle Scholar
 15.Liang, J., Fadili, J., Peyré, G.: Activity identification and local linear convergence of forward–backwardtype methods. SIAM J. Optim. 27(1), 408–437 (2017)MathSciNetzbMATHGoogle Scholar
 16.Liang, J., Fadili, J., Peyré, G.: Local convergence properties of Douglas–Rachford and alternating direction method of multipliers. J. Optim. Theory Appl. 172(3), 874–913 (2017)MathSciNetzbMATHGoogle Scholar
 17.Liang, J., Fadili, J., Peyré, G.: Local linear convergence analysis of PrimalDual splitting methods. Optimization 67(6), 821–853 (2018)MathSciNetzbMATHGoogle Scholar
 18.Davis, D.: Convergence rate analysis of the Forward–Douglas–Rachford splitting scheme. SIAM J. Optim. 25(3), 1760–1786 (2015)MathSciNetzbMATHGoogle Scholar
 19.Liang, J., Fadili, J., Peyré, G.: Local linear convergence of forward–backward under partial smoothness. In: Advances in Neural Information Processing Systems, pp. 1970–1978 (2014)Google Scholar
 20.Bolte, J., Nguyen, T.P., Peypouquet, J., Suter, B.W.: From error bounds to the complexity of firstorder descent methods for convex functions. Math. Program. 165(2), 471–507 (2017). https://doi.org/10.1007/s1010701610916 MathSciNetzbMATHGoogle Scholar
 21.Drusvyatskiy, D., Lewis, A.S.: Error bounds, quadratic growth, and linear convergence of proximal methods. Math. Oper. Res. 43(3), 919–948 (2018)MathSciNetGoogle Scholar
 22.Luo, Z.Q., Tseng, P.: On the linear convergence of descent methods for convex essentially smooth minimization. SIAM J. Control Optim. 30(2), 408–425 (1992). https://doi.org/10.1137/0330025 MathSciNetzbMATHGoogle Scholar
 23.Luo, Z.Q., Tseng, P.: Error bounds and convergence analysis of feasible descent methods: a general approach. Ann. Oper. Res. 46(1), 157–178 (1993). https://doi.org/10.1007/BF02096261 MathSciNetzbMATHGoogle Scholar
 24.Zhou, Z., So, A.M.C.: A unified approach to error bounds for structured convex optimization problems. Math. Program. 165(2), 689–728 (2017). https://doi.org/10.1007/s1010701611009 MathSciNetzbMATHGoogle Scholar
 25.Li, G., Pong, T.K.: Calculus of the exponent of Kurdyka–Łojasiewicz inequality and its applications to linear convergence of firstorder methods. Found. Comput. Math. (2017). https://doi.org/10.1007/s1020801793668 zbMATHGoogle Scholar
 26.Bertsekas, D.P.: Nonlinear Programming. Athena Scientific, Belmont (1999)zbMATHGoogle Scholar
 27.Ogura, N., Yamada, I.: Nonstrictly convex minimization over the fixed point set of an asymptotically shrinking nonexpansive mapping. Numer. Funct. Anal. Optim. 23(1–2), 113–137 (2002)MathSciNetzbMATHGoogle Scholar
 28.Knopp, K.: Theory and Application of Infinite Series. Courier Corporation, North Chelmsford (2013)zbMATHGoogle Scholar
 29.Lewis, A.S.: Active sets, nonsmoothness, and sensitivity. SIAM J. Optim. 13(3), 702–725 (2003)MathSciNetzbMATHGoogle Scholar
 30.Combettes, P.L.: Quasi–Fejérian analysis of some optimization algorithms. Stud. Comput. Math. 8, 115–152 (2001)zbMATHGoogle Scholar
 31.Brézis, H.: Opérateurs Maximaux Monotones et SemiGroupes de Contractions dans les Espaces de Hilbert. NorthHolland/Elsevier, New York (1973)zbMATHGoogle Scholar
 32.Combettes, P.L., Wajs, V.R.: Signal recovery by proximal forwardbackward splitting. Multiscale Model. Simul. 4(4), 1168–1200 (2005)MathSciNetzbMATHGoogle Scholar
 33.Rockafellar, R.T., Wets, R.: Variational Analysis, vol. 317. Springer, Berlin (1998)zbMATHGoogle Scholar
 34.Hare, W.L., Lewis, A.S.: Identifying active constraints via partial smoothness and proxregularity. J. Convex Anal. 11(2), 251–266 (2004)MathSciNetzbMATHGoogle Scholar
 35.Combettes, P.L.: Solving monotone inclusions via compositions of nonexpansive averaged operators. Optimization 53(5–6), 475–504 (2004)MathSciNetzbMATHGoogle Scholar
 36.Bauschke, H.H., Bello Cruz, J., Nghia, T., Phan, H.M., Wang, X.: Optimal rates of convergence of matrices with applications. Numer. Algorithms (2016). arxiv:1407.0671
 37.Condat, L.: A direct algorithm for 1D total variation denoising. IEEE Signal Process. Lett. 20(11), 1054–1057 (2013)Google Scholar
 38.Vaiter, S., Deledalle, C., Fadili, J., Peyré, G., Dossal, C.: The degrees of freedom of partly smooth regularizers. Ann. Inst. Stat. Math. 69(4), 791–832 (2017)MathSciNetzbMATHGoogle Scholar
 39.Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K.: Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67(1), 91–108 (2005)MathSciNetzbMATHGoogle Scholar
 40.Liang, J.: Convergence Rates of FirstOrder Operator Splitting Methods. Ph.D. Thesis, Normandie Université; GREYC CNRS UMR 6072 (2016)Google Scholar
 41.Bertsekas, D.P.: Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. Optim. Mach. Learn. 2010(1–38), 3 (2011)Google Scholar
 42.Poon, C., Peyré, G.: Multidimensional sparse superresolution. SIAM J. Math. Anal. 51(1), 1–44 (2019)MathSciNetzbMATHGoogle Scholar
 43.Chavel, I.: Riemannian geometry: a modern introduction, vol. 98. Cambridge University Press, Cambridge (2006)zbMATHGoogle Scholar
 44.Miller, S.A., Malick, J.: Newton methods for nonsmooth convex minimization: connections amongLagrangian, Riemannian Newton and SQP methods. Math. Program. 104(2–3), 609–633 (2005)MathSciNetzbMATHGoogle Scholar
 45.Absil, P.A., Mahony, R., Trumpf, J.: An extrinsic look at the Riemannian Hessian. In: Geometric Science of Information, pp. 361–368. Springer (2013)Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.