A Further Applications of the Results of Section 3
1.1
o(1∕(k + 1)2) FPR of FBS and PPA
In problem (4.0), let g be a C
1 function with Lipschitz derivative. The forward-backward splitting (FBS) algorithm is the iteration:
$$\displaystyle\begin{array}{rcl} z^{k+1} = \mathbf{prox}_{\gamma f}(z^{k} -\gamma \nabla g(z^{k})),\quad k = 0,1,\ldots.& &{}\end{array}$$
(4.27)
The FBS algorithm generalizes and has the following subgradient representation:
$$\displaystyle\begin{array}{rcl} z^{k+1} = z^{k} -\gamma \widetilde{\nabla }f(z^{k+1}) -\gamma \nabla g(z^{k})& &{}\end{array}$$
(4.28)
where \(\widetilde{\nabla }f(z^{k+1}):= (1/\gamma )(z^{k} - z^{k+1} -\gamma \nabla g(z^{k})) \in \partial f(z^{k+1})\), and z
k+1 and \(\widetilde{\nabla }f(z^{k+1})\) are unique given z
k and γ > 0.
In this section, we analyze the convergence rate of the FBS algorithm given in Equations (4.27) and (4.28). If g = 0, FBS reduces to the proximal point algorithm (PPA) and β = ∞. If f = 0, FBS reduces to gradient descent. The FBS algorithm can be written in the following operator form:
$$\displaystyle\begin{array}{rcl} T_{\mathrm{FBS}}:= \mathbf{prox}_{\gamma f} \circ (I -\gamma \nabla g).& & {}\\ \end{array}$$
Because prox
γ f
is (1∕2)-averaged and I −γ∇g is γ∕(2β)-averaged [46, Theorem 3(b)], it follows that T
FBS is α
FBS-averaged for
$$\displaystyle\begin{array}{rcl} \alpha _{\mathrm{FBS}}:= \frac{2\beta } {4\beta -\gamma } \in (1/2,1)& & {}\\ \end{array}$$
whenever γ < 2β [2, Proposition 4.32]. Thus, we have T
FBS = (1 −α
FBS)I +α
FBS
T for a certain nonexpansive operator T, and T
FBS(z
k) − z
k = α
FBS(Tz
k − z
k). In particular, for all γ < 2β the following sum is finite:
$$\displaystyle\begin{array}{rcl} \sum _{i=0}^{\infty }\|T_{\mathrm{ FBS}}(z^{k}) - z^{k}\|^{2}\stackrel{(<InternalRef RefID="Equ17">4.10</InternalRef>)}{\leq }\frac{\alpha _{\mathrm{FBS}}\|z^{0} - z^{{\ast}}\|^{2}} {(1 -\alpha _{\mathrm{FBS}})}.& & {}\\ \end{array}$$
To analyze the FBS algorithm we need to derive a joint subgradient inequality for f + g. First, we recall the following sufficient descent property for Lipschitz differentiable functions.
Theorem 11 (Descent Theorem [2, Theorem 18.15(iii)]).
If g is differentiable and ∇g is (1∕β)-Lipschitz, then for all
\(x,y \in \mathcal{H}\)
we have the upper bound
$$\displaystyle\begin{array}{rcl} g(x) \leq g(y) + \langle x - y,\nabla g(y)\rangle + \frac{1} {2\beta }\|x - y\|^{2}.& & {}\\ \end{array}$$
Corollary 4 (Joint Descent Theorem).
If g is differentiable and ∇g is (1∕β)-Lipschitz, then for all points x,y ∈dom (f) and \(z \in \mathcal{H}\)
, and subgradients \(\widetilde{\nabla }f(x) \in \partial f(x)\)
, we have
$$\displaystyle\begin{array}{rcl} f(x) + g(x) \leq f(y) + g(y) + \langle x - y,\nabla g(z) +\widetilde{ \nabla }f(x)\rangle + \frac{1} {2\beta }\|z - x\|^{2}.& &{}\end{array}$$
(4.29)
Proof.
Inequality (4.29) follows from adding the upper bound
$$\displaystyle\begin{array}{rcl} g(x) - g(y) \leq g(z) - g(y) + \langle x - z,\nabla g(z)\rangle + \frac{1} {2\beta }\|z - x\|^{2} \leq \langle x - y,\nabla g(z)\rangle + \frac{1} {2\beta }\|z - x\|^{2}& & {}\\ \end{array}$$
with the subgradient inequality: \(f(x) \leq f(y) + \langle x - y,\widetilde{\nabla }f(x)\rangle\). □
We now improve the O(1∕(k + 1)2) FPR rate for PPA in [12, Théorème 9] by showing that the FPR rate of FBS is actually o(1∕(k + 1)2).
Theorem 12 (Objective and FPR Convergence of FBS).
Let z
0 ∈dom (f) ∩dom (g) and let x
∗ be a minimizer of f + g. Suppose that (z
j
)
j≥0 is generated by FBS (iteration (4.27) ) where ∇g is (1∕β)-Lipschitz and γ < 2β. Then for all k ≥ 0,
$$\displaystyle\begin{array}{rcl} h(z^{k+1},z^{k+1}) \leq \frac{\|z^{0} - x^{{\ast}}\|^{2}} {k + 1} \times \left \{\begin{array}{@{}l@{\quad }l@{}} \frac{1} {2\gamma } \quad &\mbox{if $\gamma \leq \beta$ }; \\ \left (\frac{1} {2\gamma } + \left (\frac{1} {2\beta } -\frac{1} {2\gamma }\right ) \frac{\alpha _{\mathrm{FBS}}} {(1-\alpha _{\mathrm{FBS}})}\right )\quad &\text{otherwise,} \end{array} \right.& & {}\\ \end{array}$$
and
$$\displaystyle{h(z^{k+1},z^{k+1}) = o(1/(k + 1)),}$$
where the objective-error function h is defined in (4.19) . In addition, for all k ≥ 0, we have ∥T
FBS
z
k+1 − z
k+1
∥
2 = o(1∕(k + 1)
2
) and
$$\displaystyle\begin{array}{rcl} \|T_{\mathrm{FBS}}z^{k+1} - z^{k+1}\|^{2}& \leq & \frac{\|z^{0} - x^{{\ast}}\|^{2}} {\big(\frac{1} {\gamma } -\frac{1} {2\beta }\big)(k + 1)^{2}} \times \left \{\begin{array}{@{}l@{\quad }l@{}} \frac{1} {2\gamma } \quad &\mbox{if $\gamma \leq \beta$ }; \\ \left (\frac{1} {2\gamma } +\big (\frac{1} {2\beta } -\frac{1} {2\gamma }\big) \frac{\alpha _{\mathrm{FBS}}} {(1-\alpha _{\mathrm{FBS}})}\right )\quad &\text{otherwise}. \end{array} \right.{}\\ \end{array}$$
Proof.
Recall that \(z^{k} - z^{k+1} =\gamma \widetilde{ \nabla }f(z^{k+1}) +\gamma \nabla g(z^{k}) \in \gamma \partial f(z^{k+1}) +\gamma \nabla g(z^{k})\) for all k ≥ 0. Thus, the joint descent theorem shows that for all x ∈ dom(f), we have
$$\displaystyle\begin{array}{rcl} & & f(z^{k+1}) + g(z^{k+1}) - f(x) - g(x)\stackrel{(<InternalRef RefID="Equ117">4.29</InternalRef>)}{\leq }\frac{1} {\gamma } \langle z^{k+1} - x,z^{k} - z^{k+1}\rangle + \frac{1} {2\beta }\|z^{k} - z^{k+1}\|^{2} \\ & & \phantom{f(z^{k+1}) + g(z^{k+1})-} = \frac{1} {2\gamma }\left (\|z^{k}\! -\! x\|^{2}\! -\!\| z^{k+1}\! -\! x\|^{2}\right )\! +\! \left (\frac{1} {2\beta } -\frac{1} {2\gamma }\right )\|z^{k+1}\! -\! z^{k}\|^{2}. {}\end{array}$$
(4.30)
If we set x = x
∗ in Equation (4.30), we see that (h(z
j+1, z
j+1))
j ≥ 0 is positive, summable, and
$$\displaystyle\begin{array}{rcl} \sum _{i=0}^{\infty }h(z^{j+1},z^{j+1}) \leq \left \{\begin{array}{@{}l@{\quad }l@{}} \frac{1} {2\gamma }\|z^{0} - x^{{\ast}}\|^{2} \quad &\mbox{if $\gamma \leq \beta$ }; \\ \left (\frac{1} {2\gamma } + \left (\frac{1} {2\beta } -\frac{1} {2\gamma }\right ) \frac{\alpha _{\mathrm{FBS}}} {(1-\alpha _{\mathrm{FBS}})}\right )\|z^{0} - x^{{\ast}}\|^{2}\quad &\text{otherwise}. \end{array} \right.& &{}\end{array}$$
(4.31)
In addition, if we set x = z
k in Equation (4.30), then we see that (h(z
j+1, z
j+1))
j ≥ 0 is decreasing:
$$\displaystyle\begin{array}{rcl} \left (\frac{1} {\gamma } -\frac{1} {2\beta }\right )\|z^{k+1} - z^{k}\|^{2} \leq h(z^{k},z^{k}) - h(z^{k+1},z^{k+1}).& & {}\\ \end{array}$$
Therefore, the rates for h(z
k+1, z
k+1) follow by Lemma 1 Part (a), with a
k
= h(z
k+1, z
k+1) and λ
k
≡ 1.
Now we prove the rates for ∥ T
FBS
z
k+1 − z
k+1 ∥ 2. We apply Part 3 of Lemma 1 with \(a_{k} = \left (1/\gamma - 1/(2\beta )\right )\|z^{k+2} - z^{k+1}\|^{2}\), λ
k
≡ 1, e
k
= 0, and b
k
= h(z
k+1) − h(x
∗) for all k ≥ 0, to show that ∑
i = 0
∞(i + 1)a
i
is less than the sum in Equation (4.31). Part 2 of Theorem 1 shows that (a
j
)
j ≥ 0 is monotonically nonincreasing. Therefore, the convergence rate of (a
j
)
j ≥ 0 follows from Part (b) of Lemma 1. □
When f = 0, the objective error upper bound in Theorem 12 is strictly better than the bound provided in [45, Corollary 2.1.2]. In FBS, the objective error rate is the same as the one derived in [4, Theorem 3.1], when γ ∈ (0, β], and is the same as the one given in [11] in the case that γ ∈ (0, 2β). The little-o FPR rate is new in all cases except for the special case of PPA (g ≡ 0) under the condition that the sequence (z
j)
j ≥ 0 strongly converges to a minimizer [33].
1.2
o(1∕(k + 1)2) FPR of One Dimensional DRS
Whenever the operator (T
PRS)1∕2 is applied in R, the convergence rate of the FPR improves to o(1∕(k + 1)2).
Theorem 13.
Suppose that
\(\mathcal{H}\ = \mathbf{R}\)
, and suppose that (z
j
)
j≥0
is generated by the DRS algorithm, i.e., Algorithm
1
with λ
k
≡ 1∕2. Then for all k ≥ 0,
$$\displaystyle\begin{array}{rcl} \vert (T_{\mathrm{PRS}})_{1/2}z^{k+1} - z^{k+1}\vert ^{2} = \frac{\vert z^{0} - z^{{\ast}}\vert ^{2}} {2(k + 1)^{2}} \ \ \mathrm{and}\ \ \vert (T_{\mathrm{PRS}})_{1/2}z^{k+1} - z^{k+1}\vert ^{2} = o\left ( \frac{1} {(k + 1)^{2}}\right ).& & {}\\ \end{array}$$
Proof.
Note that (T
PRS)1∕2 is (1∕2)-averaged, and, hence, it is the resolvent of some maximal monotone operator on R [2, Corollary 23.8]. Furthermore, every maximal monotone operator on R is the subdifferential operator of a closed, proper, and convex function [2, Corollary 22.19]. Therefore, DRS is equivalent to the proximal point algorithm applied to a certain convex function on R. Thus, the result follows by Theorem 12 applied to this function. □
B Further Lower Complexity Results
2.1 Ergodic Convergence of Feasibility Problems
Proposition 8.
The ergodic feasibility convergence rate in Equation (4.20) is optimal up to a factor of two.
Proof.
Algorithm 1 with λ
k
= 1 for all k ≥ 0 (i.e., PRS) is applied to the functions \(f =\iota _{\{(x_{1},x_{2})\in \mathbf{R}^{2}\vert x_{1}=0\}}\) and \(g =\iota _{\{(x_{1},x_{2})\in \mathbf{R}^{2}\vert x_{2}=0\}}\) with the initial iterate z
0 = (1, 1) ∈ R
2. Because \(T_{\mathrm{PRS}} = -I_{\mathcal{H}\ }\), it is easy to see that the only fixed point of T
PRS is z
∗ = (0, 0). In addition, the following identities are satisfied:
$$\displaystyle\begin{array}{rcl} x_{g}^{k} = \left \{\begin{array}{@{}l@{\quad }l@{}} (1,0) \quad &\text{even}\ k; \\ (-1,0)\quad &\text{odd}\ k. \end{array} \right.\quad z^{k} = \left \{\begin{array}{@{}l@{\quad }l@{}} (1,1) \quad &\text{even}\ k; \\ (-1,-1)\quad &\text{odd}\ k. \end{array} \right.\quad x_{f}^{k} = \left \{\begin{array}{@{}l@{\quad }l@{}} (0,-1)\quad &\text{even}\ k; \\ (0,1) \quad &\text{odd}\ k. \end{array} \right.& & {}\\ \end{array}$$
Thus, the PRS algorithm oscillates around the solution x
∗ = (0, 0). However, note that the averaged iterates satisfy:
$$\displaystyle\begin{array}{rcl} \overline{x}_{g}^{k} = \left \{\begin{array}{@{}l@{\quad }l@{}} ( \frac{1} {k+1},0)\quad &\text{even}\ k; \\ (0,0) \quad &\text{odd}\ k. \end{array} \right.\quad \mathrm{and}\quad \overline{x}_{f}^{k} = \left \{\begin{array}{@{}l@{\quad }l@{}} (0, \frac{-1} {k+1})\quad &\text{even}\ k; \\ (0,0) \quad &\text{odd}\ k. \end{array} \right.& & {}\\ \end{array}$$
It follows that \(\|\overline{x}_{g}^{k} -\overline{x}_{f}^{k}\| = (1/(k + 1))\|(1,-1)\| = (1/(k + 1))\|z^{0} - z^{{\ast}}\|\), \(\forall k \geq 0\). □
2.2 Optimal Objective and FPR Rates with Lipschitz Derivative
The following examples show that the objective and FPR rates derived in Theorem 12 are essentially optimal. The setup of the following counterexample already appeared in [12, Remarque 6] but the objective function lower bounds were not shown.
Theorem 14 (Lower Complexity of PPA).
There exists a Hilbert space
\(\mathcal{H}\)
, and a closed, proper, and convex function f such that for all α > 1∕2, there exists
\(z^{0} \in \mathcal{H}\)
such that if (z
j
)
j≥0
is generated by PPA, then
$$\displaystyle\begin{array}{rcl} \|\mathbf{prox}_{\gamma f}(z^{k}) - z^{k}\|^{2}& \geq & \frac{\gamma ^{2}} {(1 + 2\alpha )e^{2\gamma }(k+\gamma )^{1+2\alpha }}, {}\\ f(z^{k+1}) - f(x^{{\ast}})& \geq & \frac{1} {4\alpha e^{2\gamma }(k + 1+\gamma )^{2\alpha }}. {}\\ \end{array}$$
Proof.
Let \(\mathcal{H}\ =\ell _{2}(\mathbf{R})\), and define a linear map \(A: \mathcal{H}\ \rightarrow \mathcal{H}\):
$$\displaystyle\begin{array}{rcl} A\left (z_{1},z_{2},\cdots \,,z_{n},\cdots \,\right )& =& \left (z_{1}, \frac{z_{2}} {2},\cdots \,, \frac{z_{n}} {n},\cdots \,\right ). {}\\ \end{array}$$
For all \(z \in \mathcal{H}\), define \(f(x) = (1/2)\langle Az,z\rangle\). Thus, we have the proximal identity for f and
$$\displaystyle\begin{array}{rcl} \mathbf{prox}_{\gamma f}(z)& =& (I +\gamma A)^{-1}(z) = \left ( \frac{j} {j+\gamma }z_{j}\right )_{j\geq 1}\quad \mathrm{and}\quad (I -\mathbf{prox}_{\gamma f})(z) = \left ( \frac{\gamma } {j+\gamma }z_{j}\right )_{j\geq 1}. {}\\ \end{array}$$
Now let \(z^{0} = (1/(j+\gamma )^{\alpha })_{j\geq 1} \in \mathcal{H}\), and set T = prox
γ f
. Then we get the following FPR lower bound:
$$\displaystyle\begin{array}{rcl} \|z^{k+1} - z^{k}\|^{2} =\| T^{k}(T - I)z^{0}\|^{2}& =& \sum _{ i=1}^{\infty }\left ( \frac{i} {i+\gamma }\right )^{2k} \frac{\gamma ^{2}} {(i+\gamma )^{2+2\alpha }} {}\\ & \geq & \sum _{i=k}^{\infty }\left ( \frac{i} {i+\gamma }\right )^{2k} \frac{\gamma ^{2}} {(i+\gamma )^{2+2\alpha }} {}\\ & \geq & \frac{\gamma ^{2}} {(1 + 2\alpha )e^{2\gamma }(k+\gamma )^{1+2\alpha }}. {}\\ \end{array}$$
Furthermore, the objective lower bound holds
$$\displaystyle\begin{array}{rcl} f(z^{k+1}) - f(x^{{\ast}}) = \frac{1} {2}\langle Az^{k+1},z^{k+1}\rangle & =& \frac{1} {2}\sum _{i=1}^{\infty }\frac{1} {i} \left ( \frac{i} {i+\gamma }\right )^{2(k+1)} \frac{1} {(i+\gamma )^{2\alpha }} {}\\ & \geq & \frac{1} {2}\sum _{i=k+1}^{\infty }\left ( \frac{i} {i+\gamma }\right )^{2(k+1)} \frac{1} {(i+\gamma )^{1+2\alpha }} {}\\ & \geq & \frac{1} {4\alpha e^{2\gamma }(k + 1+\gamma )^{2\alpha }}. {}\\ \end{array}$$
□
C ADMM Convergence Rate Proofs
Given an initial vector \(z^{0} \in \mathcal{G}\), Lemma 2 shows that at each iteration relaxed PRS performs the following computations:
$$\displaystyle\begin{array}{rcl} \left \{\begin{array}{@{}l@{\quad }l@{}} w_{d_{g}}^{k} = \mathbf{prox}_{\gamma d_{g}}(z^{k}); \quad \\ w_{d_{f}}^{k} = \mathbf{prox}_{\gamma d_{f}}(2w_{d_{g}}^{k} - z^{k}); \quad \\ z^{k+1} = z^{k} + 2\lambda _{k}(w_{d_{f}}^{k} - w_{d_{g}}^{k}).\quad \end{array} \right.& & {}\\ \end{array}$$
In order to apply the relaxed PRS algorithm, we need to compute the proximal operators of the dual functions d
f
and d
g
.
Lemma 9 (Proximity Operators on the Dual).
Let
\(w,v \in \mathcal{G}\)
. Then the update formulas
\(w^{+} = \mathbf{prox}_{\gamma d_{f}}(w)\)
and
\(v^{+} = \mathbf{prox}_{\gamma d_{g}}(v)\)
are equivalent to the following computations
$$\displaystyle\begin{array}{rcl} & & \left \{\begin{array}{@{}l@{\quad }l@{}} x^{+} =\mathop{ \mathrm{arg\,min}}\limits _{ x\in \mathcal{H}\ _{1}}f(x) -\langle w,Ax\rangle + \frac{\gamma } {2}\|Ax\|^{2};\quad \\ w^{+} = w -\gamma Ax^{+}; \quad \end{array} \right. \\ & & \left \{\begin{array}{@{}l@{\quad }l@{}} y^{+} =\mathop{ \mathrm{arg\,min}}\limits _{y\in \mathcal{H}\ _{2}}g(y) -\langle v,By - b\rangle + \frac{\gamma } {2}\|By - b\|^{2};\quad \\ v^{+} = v -\gamma (By^{+} - b); \quad \end{array} \right.{}\end{array}$$
(4.32)
respectively. In addition, the subgradient inclusions hold: A
∗
w
+
∈ ∂f(x
+
) and B
∗
v
+
∈ ∂g(y
+
). Finally, w
+
and v
+
are independent of the choice of x
+
and y
+
, respectively, even if they are not unique solutions to the minimization subproblems.
We can use Lemma 9 to derive the relaxed form of ADMM in Algorithm 2. Note that this form of ADMM eliminates the “hidden variable” sequence (z
j)
j ≥ 0 in Equation (4.32). This following derivation is not new, but is included for the sake of completeness. See [31] for the original derivation.
Proposition 9 (Relaxed ADMM).
Let \(z^{0} \in \mathcal{G}\)
, and let (z
j
)
j≥0 be generated by the relaxed PRS algorithm applied to the dual formulation in Equation (4.27) . Choose initial points \(w_{d_{g}}^{-1} = z^{0},x^{-1} = 0\) and y
−1 = 0 and initial relaxation λ
−1 = 1∕2. Then we have the following identities starting from k = −1:
$$\displaystyle\begin{array}{rcl} y^{k+1}& =& \mathop{\mathrm{arg\,min}}\limits _{ y\in \mathcal{H}\ _{2}}g(y) -\langle w_{d_{g}}^{k},Ax^{k} + By - b\rangle + {}\\ & & \qquad \quad \frac{\gamma } {2}\|Ax^{k} + By - b + (2\lambda _{ k} - 1)(Ax^{k} + By^{k} - b)\|^{2} {}\\ w_{d_{g}}^{k+1}& =& w_{ d_{g}}^{k} -\gamma (Ax^{k} + By^{k+1} - b) -\gamma (2\lambda _{ k} - 1)(Ax^{k} + By^{k} - b) {}\\ x^{k+1}& =& \mathop{\mathrm{arg\,min}}\limits _{ x\in \mathcal{H}\ _{1}}f(x) -\langle w_{d_{g}}^{k+1},Ax + By^{k+1} - b\rangle + \frac{\gamma } {2}\|Ax + By^{k+1} - b\|^{2} {}\\ w_{d_{f}}^{k+1}& =& w_{ d_{g}}^{k+1} -\gamma (Ax^{k+1} + By^{k+1} - b) {}\\ \end{array}$$
Proof.
By Equation (4.32) and Lemma 9, we get the following formulation for the k-th iteration: Given \(z^{0} \in \mathcal{H}\)
$$\displaystyle\begin{array}{rcl} \left \{\begin{array}{@{}l@{\quad }l@{}} y^{k} \quad &=\mathop{ \mathrm{arg\,min}}\limits _{y\in \mathcal{H}\ _{2}}g(y) -\langle z^{k},By - b\rangle + \frac{\gamma } {2}\|By - b\|^{2}; \\ w_{d_{g}}^{k}\quad &= z^{k} -\gamma (By^{k} - b); \\ x^{k} \quad &=\mathop{ \mathrm{arg\,min}}\limits _{x\in \mathcal{H}\ _{1}}f(x) -\langle 2w_{d_{g}}^{k} - z^{k},Ax\rangle + \frac{\gamma } {2}\|Ax\|^{2}; \\ w_{d_{f}}^{k}\quad &= 2w_{d_{g}}^{k} - z^{k} -\gamma Ax^{k}; \\ z^{k+1} \quad &= z^{k} + 2\lambda _{k}(w_{d_{f}}^{k} - w_{d_{g}}^{k}).\end{array} \right.& & {}\\ \end{array}$$
We will use this form to get to the claimed iteration. First,
$$\displaystyle\begin{array}{rcl} 2w_{d_{g}}^{k} - z^{k} = w_{ d_{g}}^{k} -\gamma (By^{k} - b)\quad \mathrm{and}\quad w_{ d_{f}}^{k} = w_{ d_{g}}^{k} -\gamma (Ax^{k} + By^{k} - b).& &{}\end{array}$$
(4.33)
Furthermore, we can simplify the definition of x
k:
$$\displaystyle\begin{array}{rcl} x^{k}& = & \mathop{\mathrm{arg\,min}}\limits _{ x\in \mathcal{H}\ _{1}}f(x) -\langle 2w_{d_{g}}^{k} - z^{k},Ax\rangle + \frac{\gamma } {2}\|Ax\|^{2} {}\\ & \stackrel{\mathrm{(<InternalRef RefID="Equ136">4.33</InternalRef>)}}{=}& \mathop{\mathrm{arg\,min}}\limits _{x\in \mathcal{H}\ _{1}}f(x) -\langle w_{d_{g}}^{k} -\gamma (By^{k} - b),Ax\rangle + \frac{\gamma } {2}\|Ax\|^{2} {}\\ & = & \mathop{\mathrm{arg\,min}}\limits _{x\in \mathcal{H}\ _{1}}f(x) -\langle w_{d_{g}}^{k},Ax + By^{k} - b\rangle + \frac{\gamma } {2}\|Ax + By^{k} - b\|^{2}. {}\\ \end{array}$$
Note that the last two lines of Equation (4.34) differ by terms independent of x.
We now eliminate the z
k variable from the y
k subproblem: because \(w_{d_{f}}^{k} + z^{k} = 2w_{d_{g}}^{k} -\gamma Ax^{k}\), we have
$$\displaystyle\begin{array}{rcl} z^{k+1}& = & z^{k} + 2\lambda _{ k}(w_{d_{f}}^{k} - w_{ d_{g}}^{k}) {}\\ & \stackrel{\mathrm{(<InternalRef RefID="Equ136">4.33</InternalRef>)}}{=}& z^{k} + w_{ d_{f}}^{k} - w_{ d_{g}}^{k} +\gamma (2\lambda _{ k} - 1)(Ax^{k} + By^{k} - b) {}\\ & = & w_{d_{g}}^{k} -\gamma Ax^{k} -\gamma (2\lambda _{ k} - 1)(Ax^{k} + By^{k} - b). {}\\ \end{array}$$
We can simplify the definition of y
k+1 by applying the identity in Equation (4.34):
$$\displaystyle\begin{array}{rcl} & & y^{k+1} =\mathop{ \mathrm{arg\,min}}\limits _{ y\in \mathcal{H}\ _{2}}g(y) -\langle z^{k+1},By - b\rangle + \frac{\gamma } {2}\|By - b\|^{2} {}\\ & & \stackrel{\mathrm{(C)}}{=}\mathop{\mathrm{arg\,min}}\limits _{y\in \mathcal{H}\ _{2}}g(y) -\langle w_{d_{g}}^{k} -\gamma Ax^{k} -\gamma (2\lambda _{ k} - 1)(Ax^{k} + By^{k} - b),By - b\rangle + \frac{\gamma } {2}\|By - b\|^{2} {}\\ & & =\mathop{ \mathrm{arg\,min}}\limits _{y\in \mathcal{H}\ _{2}}g(y) -\langle w_{d_{g}}^{k},Ax^{k} + By - b\rangle + \frac{\gamma } {2}\|Ax^{k} + By - b + (2\lambda _{ k} - 1)(Ax^{k} + By^{k} - b)\|^{2}.{}\\ \end{array}$$
The result then follows from Equations (4.33), (4.33), (4.34), and (4.34), combined with the initial conditions listed in the statement of the proposition. In particular, note that the updates of \(x,y,w_{d_{f}},\) and \(w_{d_{g}}\) do not explicitly depend on z. □
Remark 2.
Proposition 9 proves that \(w_{d_{f}}^{k+1} = w_{d_{g}}^{k+1} -\gamma (Ax^{k+1} + By^{k+1} - b)\). Recall that by Equation (4.32), \(z^{k+1} - z^{k} = 2\lambda _{k}(w_{d_{f}}^{k} - w_{d_{g}}^{k})\). Therefore, it follows that
$$\displaystyle\begin{array}{rcl} z^{k+1} - z^{k}& =& -2\gamma \lambda _{ k}(Ax^{k} + By^{k} - b).{}\end{array}$$
(4.34)
3.1 Dual Feasibility Convergence Rates
We can apply the results of Section 5 to deduce convergence rates for the dual objective functions. Instead of restating those theorems, we just list the following bounds on the feasibility of the primal iterates.
Theorem 15.
Suppose that (z
j
)
j≥0
is generated by Algorithm
2
, and let (λ
j
)
j≥0
⊆ (0,1]. Then the following convergence rates hold:
-
1.
Ergodic convergence:
The feasibility convergence rate holds:
$$\displaystyle\begin{array}{rcl} \|A\overline{x}^{k} + B\overline{y}^{k} - b\|^{2} = \frac{4\|z^{0} - z^{{\ast}}\|^{2}} {\gamma \varLambda _{k}^{2}}.& & {}\\ \end{array}$$
-
2.
Nonergodic convergence: Suppose that τ = infj≥0
λ
j
(1 −λ
j
) > 0. Then
$$\displaystyle\begin{array}{rcl} \|Ax^{k} + By^{k} - b\|^{2} \leq \frac{\|z^{0} - z^{{\ast}}\|^{2}} {4\gamma ^{2}\underline{\tau }(k + 1)}\quad \mathrm{and}\quad \|Ax^{k} + By^{k} - b\|^{2} = o\left ( \frac{1} {k + 1}\right ).& & {}\\ \end{array}$$
Proof.
Parts 1 and 2 are straightforward applications of Corollary 1. and the FPR identity: \(z^{k} - z^{k+1}\stackrel{\mbox{ (<InternalRef RefID="Equ140">4.34</InternalRef>)}}{=}2\gamma \lambda _{k}(Ax^{k} + By^{k} - b).\) □
3.2 Converting Dual Inequalities to Primal Inequalities
The ADMM algorithm generates the following five sequences of iterates:
$$\displaystyle\begin{array}{rcl} (z^{j})_{ j\geq 0},(w_{d_{f}}^{j})_{ j\geq 0},\ \text{and}\ (w_{d_{g}}^{j})_{ j\geq 0} \subseteq \mathcal{G}\quad \text{and}\quad (x^{j})_{ j\geq 0} \in \mathcal{H}\ _{1},(y^{j})_{ j\geq 0} \in \mathcal{H}\ _{2}.& & {}\\ \end{array}$$
The dual variables do not necessarily have a meaningful interpretation, so it is desirable to derive convergence rates involving the primal variables. In this section we will apply the Fenchel-Young inequality [2, Proposition 16.9] to convert the dual objective into a primal expression.
The following proposition will help us derive primal fundamental inequalities akin to Proposition 2 and 3.
Proposition 10.
Suppose that (z
j
)
j≥0
is generated by Algorithm
2
. Let z
∗
be a fixed point of T
PRS
and let
\(w^{{\ast}} = \mathbf{prox}_{\gamma d_{f}}(z^{{\ast}})\)
. Then the following identity holds:
$$\displaystyle\begin{array}{rcl} & & 4\gamma \lambda _{k}(h(x^{k},y^{k})) = -4\gamma \lambda _{ k}(d_{f}(w_{d_{f}}^{k}) + d_{ g}(w_{d_{g}}^{k}) - d_{ f}(w^{{\ast}}) - d_{ g}(w^{{\ast}})) \\ & & \phantom{4\gamma \lambda _{k}(h(x^{k},}+\left (2\left (1 - \frac{1} {2\lambda _{k}}\right )\|z^{k} - z^{k+1}\|^{2} + 2\langle z^{k} - z^{k+1},z^{k+1}\rangle \right ). {}\end{array}$$
(4.35)
Proof.
We have the following subgradient inclusions from Proposition 9: \(A^{{\ast}}w_{d_{f}}^{k} \in \partial f(x^{k})\) and \(B^{{\ast}}w_{d_{g}}^{k} \in \partial g(y^{k}).\) From the Fenchel-Young inequality [2, Proposition 16.9] we have the expression for f and g:
$$\displaystyle\begin{array}{rcl} d_{f}(w_{d_{f}}^{k}) = \langle A^{{\ast}}w_{ d_{f}}^{k},x^{k}\rangle - f(x^{k})\quad \mathrm{and}\quad d_{ f}(w_{d_{g}}^{k}) = \langle B^{{\ast}}w_{ d_{g}}^{k},y^{k}\rangle - g(y^{k}) -\langle w_{ d_{g}}^{k},b\rangle.& & {}\\ \end{array}$$
Therefore,
$$\displaystyle\begin{array}{rcl} -d_{f}(w_{d_{f}}^{k}) - d_{ g}(w_{d_{g}}^{k}) = f(x^{k}) + g(y^{k}) -\langle Ax^{k} + By^{k} - b,w_{ d_{f}}^{k}\rangle -\langle w_{ d_{g}}^{k} - w_{ d_{f}}^{k},By^{k} - b\rangle.& & {}\\ \end{array}$$
Let us simplify this bound with an identity from Proposition 9: from \(w_{d_{f}}^{k} - w_{d_{g}}^{k} = -\gamma (Ax^{k} + By^{k} - b),\) it follows that
$$\displaystyle\begin{array}{rcl} - d_{f}(w_{d_{f}}^{k}) - d_{ g}(w_{d_{g}}^{k})& =& f(x^{k}) + g(y^{k}) + \frac{1} {\gamma } \langle w_{d_{f}}^{k} - w_{ d_{g}}^{k},w_{ d_{f}}^{k} +\gamma (By^{k} - b)\rangle.\qquad {}\end{array}$$
(4.36)
Recall that \(\gamma (By^{k} - b) = z^{k} - w_{d_{g}}^{k}\). Therefore
$$\displaystyle{w_{d_{f}}^{k}+\gamma (By^{k}-b) = z^{k}+(w_{ d_{f}}^{k}-w_{ d_{g}}^{k}) = z^{k}+ \frac{1} {2\lambda _{k}}(z^{k+1}-z^{k}) = \frac{1} {2\lambda _{k}}(2\lambda _{k}-1)(z^{k}-z^{k+1})+z^{k+1},}$$
and the inner product term can be simplified as follows:
$$\displaystyle\begin{array}{rcl} \frac{1} {\gamma } \langle w_{d_{f}}^{k} - w_{ d_{g}}^{k},w_{ d_{f}}^{k} +\gamma (By^{k} - b)\rangle & =& \frac{1} {\gamma } \langle \frac{1} {2\lambda _{k}}(z^{k+1} - z^{k}), \frac{1} {2\lambda _{k}}(2\lambda _{k} - 1)(z^{k} - z^{k+1})\rangle \\ & +& \frac{1} {\gamma } \langle \frac{1} {2\lambda _{k}}(z^{k+1} - z^{k}),z^{k+1}\rangle \\ & =& -\frac{1} {2\gamma \lambda _{k}}\left (1 - \frac{1} {2\lambda _{k}}\right )\|z^{k+1} - z^{k}\|^{2} \\ & -& \frac{1} {2\gamma \lambda _{k}}\langle z^{k} - z^{k+1},z^{k+1}\rangle. {}\end{array}$$
(4.37)
Now we derive an expression for the dual objective at a dual optimal w
∗. First, if z
∗ is a fixed point of T
PRS, then \(0 = T_{\mathrm{PRS}}(z^{{\ast}}) - z^{{\ast}} = 2(w_{d_{g}}^{{\ast}}- w_{d_{f}}^{{\ast}}) = -2\gamma (Ax^{{\ast}} + By^{{\ast}}- b)\). Thus, from Equation (4.36) with k replaced by ∗, we get
$$\displaystyle\begin{array}{rcl} - d_{f}(w^{{\ast}}) - d_{ g}(w^{{\ast}})& = f(x^{{\ast}}) + g(y^{{\ast}}) + \langle Ax^{{\ast}} + Bx^{{\ast}}- b,w^{{\ast}}\rangle = f(x^{{\ast}}) + g(y^{{\ast}}).\qquad \ &{}\end{array}$$
(4.38)
Therefore, Equation (4.35) follows by subtracting (4.38) from Equation (4.36), rearranging and using the identity in Equation (4.37). □
The following two propositions prove two fundamental inequalities that bound the primal objective.
Proposition 11 (ADMM Primal Upper Fundamental Inequality).
Let z
∗ be a fixed point of T
PRS and let \(w^{{\ast}} = \mathbf{prox}_{\gamma d_{g}}(z^{{\ast}})\)
. Then for all k ≥ 0, we have the bound:
$$\displaystyle\begin{array}{rcl} 4\gamma \lambda _{k}h(x^{k},y^{k}) \leq \| z^{k} - (z^{{\ast}}- w^{{\ast}})\|^{2} -\| z^{k+1} - (z^{{\ast}}- w^{{\ast}})\|^{2} + \left (1 -\frac{1} {\lambda _{k}}\right )\|z^{k} - z^{k+1}\|^{2},& &{}\end{array}$$
(4.39)
where the objective-error function h is defined in (4.19) .
Proof.
The lower inequality in Proposition 3 applied to d
f
+ d
g
shows that
$$\displaystyle\begin{array}{rcl} -4\gamma \lambda _{k}(d_{f}(w_{d_{f}}^{k}) + d_{ g}(w_{d_{g}}^{k}) - d_{ f}(w^{{\ast}}) - d_{ g}(w^{{\ast}}))& \leq 2\langle z^{k+1} - z^{k},z^{{\ast}}- w^{{\ast}}\rangle.& {}\\ \end{array}$$
The proof then follows from Proposition 10, and the simplification:
$$\displaystyle\begin{array}{rcl} & & 2\langle z^{k} - z^{k+1},z^{k+1} - (z^{{\ast}}- w^{{\ast}})\rangle + 2\left (1 - \frac{1} {2\lambda _{k}}\right )\|z^{k} - z^{k+1}\|^{2} {}\\ & & =\| z^{k} - (z^{{\ast}}- w^{{\ast}})\|^{2} -\| z^{k+1} - (z^{{\ast}}- w^{{\ast}})\|^{2} + \left (1 -\frac{1} {\lambda _{k}}\right )\|z^{k} - z^{k+1}\|^{2}. {}\\ \end{array}$$
□
Remark 3.
Note that Equation (4.39) is nearly identical to the upper inequality in Proposition 2, except that z
∗− w
∗ appears in the former where x
∗ appears in the latter.
Proposition 12 (ADMM Primal Lower Fundamental Inequality).
Let z
∗ be a fixed point of T
PRS and let \(w^{{\ast}} = \mathbf{prox}_{\gamma d_{g}}(z^{{\ast}})\)
. Then for all \(x \in \mathcal{H}\ _{1}\) and \(y \in \mathcal{H}\ _{2}\) we have the bound:
$$\displaystyle\begin{array}{rcl} h(x,y) \geq \langle Ax + By - b,w^{{\ast}}\rangle,& &{}\end{array}$$
(4.40)
where the objective-error function h is defined in (4.19) .
Proof.
The lower bound follows from the subgradient inequalities:
$$\displaystyle\begin{array}{rcl} f(x) - f(x^{{\ast}}) \geq \langle x - x^{{\ast}},A^{{\ast}}w^{{\ast}}\rangle \quad \text{and}\quad g(y) - g(y^{{\ast}}) \geq \langle y - y^{{\ast}},B^{{\ast}}w^{{\ast}}\rangle.& & {}\\ \end{array}$$
We sum these inequalities and use Ax
∗ + By
∗ = b to get Equation (4.40). □
Remark 4.
We use Inequality (4.40) in two special cases:
$$\displaystyle\begin{array}{rcl} h(x^{k},y^{k})& \geq & \frac{1} {\gamma } \langle w_{d_{g}}^{k} - w_{ d_{f}}^{k},w^{{\ast}}\rangle {}\\ h(\overline{x}^{k},\overline{y}^{k})& \geq & \frac{1} {\gamma } \langle \overline{w}_{d_{g}}^{k} -\overline{w}_{ d_{f}}^{k},w^{{\ast}}\rangle. {}\\ \end{array}$$
These bounds are nearly identical to the fundamental lower inequality in Proposition 3, except that w
∗ appears in the former where z
∗− x
∗ appeared in the latter.
3.3 Converting Dual Convergence Rates to Primal Convergence Rates
We can use the inequalities deduced in Section 3.2 to derive convergence rates for the primal objective values. The structure of the proofs of Theorems 9 and 10 are exactly the same as in the primal convergence case in Section 5, except that we use the upper and lower inequalities derived in the Section 3.2 instead of the fundamental upper and lower inequalities in Propositions 2 and 3. This amounts to replacing the term z
∗− x
∗ and x
∗ by w
∗ and z
∗− w
∗, respectively, in all of the inequalities from Section 5. Thus, we omit the proofs.
D Examples
In this section, we apply relaxed PRS and relaxed ADMM to concrete problems and explicitly bound the associated objectives and FPR terms with the convergence rates we derived in the previous sections.
4.1 Feasibility Problems
Suppose that C
f
and C
g
are closed convex subsets of \(\mathcal{H}\), with nonempty intersection. The goal of the feasibility problem is to find a point in the intersection of C
f
and C
g
. In this section, we present one way to model this problem using convex optimization and apply the relaxed PRS algorithm to reach the minimum.
In general, we cannot expect linear convergence of relaxed PRS algorithm for the feasibility problem. We showed this in Theorem 6 by constructing an example for which the DRS iteration converges in norm but does so arbitrarily slow. A similar result holds for the alternating projection (AP) algorithm [3]. Thus, in this section we focus on the convergence rate of the FPR.
Let \(\iota _{C_{f}}\) and \(\iota _{C_{g}}\) be the indicator functions of C
f
and C
g
. Then x ∈ C
f
∩ C
g
, if, and only if, \(\iota _{C_{f}}(x) +\iota _{C_{g}}(x) = 0\), and the sum is infinite otherwise. Thus, a point is in the intersection of C
f
and C
g
if, and only if, it is the minimizer of the following problem:
$$\displaystyle\begin{array}{rcl} \mathop{\mathrm{minimize}}\limits _{x\in \mathcal{H}\ }\iota _{C_{f}}(x) +\iota _{C_{g}}(x).& & {}\\ \end{array}$$
The relaxed PRS algorithm applied to this problem, with \(f =\iota _{C_{f}}\) and \(g =\iota _{C_{g}}\), has the following form: Given \(z^{0} \in \mathcal{H}\), for all k ≥ 0, let
$$\displaystyle\begin{array}{rcl} \left \{\begin{array}{@{}l@{\quad }l@{}} x_{g}^{k} = P_{C_{g}}(z^{k}); \quad \\ x_{f}^{k} = P_{C_{f}}(2x_{g}^{k} - z^{k}); \quad \\ z^{k+1} = z^{k} + 2\lambda _{k}(x_{f}^{k} - x_{g}^{k}).\quad \end{array} \right.& & {}\\ \end{array}$$
Because \(f =\iota _{C_{f}}\) and \(g =\iota _{C_{g}}\) only take on the values 0 and ∞, the objective value convergence rates derived earlier do not provide meaningful information, other than x
f
k ∈ C
f
and x
g
k ∈ C
g
. However, from the FPR identity x
f
k − x
g
k = 1∕(2λ
k
)(z
k+1 − z
k), we find that after k iterations, Corollary 1 produces the bound
$$\displaystyle\begin{array}{rcl} \max \{d_{C_{g}}^{2}(x_{ f}^{k}),d_{ C_{f}}^{2}(x_{ g}^{k})\} \leq \| x_{ f}^{k} - x_{ g}^{k}\|^{2} = o\left ( \frac{1} {k + 1}\right )& & {}\\ \end{array}$$
whenever (λ
j
)
j ≥ 0 is bounded away from 0 and 1. Theorem 5 showed that this rate is optimal. Furthermore, if we average the iterates over all k, Theorem 3 gives the improved bound
$$\displaystyle\begin{array}{rcl} \max \{d_{C_{g}}^{2}(\overline{x}_{ f}^{k}),d_{ C_{f}}^{2}(\overline{x}_{ g}^{k})\} \leq \|\overline{x}_{ f}^{k} -\overline{x}_{ g}^{k}\|^{2} = O\left ( \frac{1} {\varLambda _{k}^{2}}\right ),& & {}\\ \end{array}$$
which is optimal by Proposition 8. Note that the averaged iterates satisfy \(\overline{x}_{f}^{k} = (1/\varLambda _{k})\sum _{i=0}^{k}\lambda _{i}x_{f}^{i} \in C_{f}\) and \(\overline{x}_{g}^{k} = (1/\varLambda _{k})\sum _{i=0}^{k}\lambda _{i}x_{g}^{i} \in C_{g}\), because C
f
and C
g
are convex. Thus, we can state the following proposition:
Proposition 13.
After k iterations the relaxed PRS algorithm produces a point in each set with distance of order O(1∕Λ
k
) from each other.
4.2 Parallelized Model Fitting and Classification
The following general scenario appears in [10, Chapter 8]. Consider the following general convex model fitting problem: Let M: R
n → R
m be a feature matrix, let b ∈ R
m be the output vector, let l: R
m → (−∞, ∞] be a loss function and let r: R
n → (−∞, ∞] be a regularization function. The model fitting problem is formulated as the following minimization:
$$\displaystyle\begin{array}{rcl} \mathop{\mathrm{minimize}}\limits _{x\in \mathbf{R}^{n}}\;l(Mx - b) + r(x).& &{}\end{array}$$
(4.41)
The function l is used to enforce the constraint Mx = b +ν up to some noise ν in the measurement, while r enforces the regularity of x by incorporating prior knowledge of the form of the solution. The function r can also be used to enforce the uniqueness of the solution of Mx = b in ill-posed problems.
We can solve Equation (4.41) by a direct application of relaxed PRS and obtain O(1∕Λ
k
) ergodic convergence and \(o\left (1/\sqrt{k + 1}\right )\) nonergodic convergence rates. Note that these rates do not require differentiability of f or g. In contrast, the FBS algorithm requires differentiability of one of the objective functions and a knowledge of the Lipschitz constant of its gradient. The advantage of FBS is the o(1∕(k + 1)) convergence rate shown in Theorem 12. However, we do not necessarily assume that l is differentiable, so we may need to compute prox
γ l(M(⋅ )−b), which can be significantly more difficult than computing prox
γ l
. Thus, in this section we separate M from l by rephrasing Equation (4.41) in the form of Problem (4.2).
In this section, we present several different ways to split Equation (4.41). Each splitting gives rise to a different algorithm and can be applied to general convex functions l and r. Our results predict convergence rates that hold for primal objectives, dual objectives, and the primal feasibility. Note that in parallelized model fitting, it is not always desirable to take the time average of all of the iterates. Indeed, when r enforces sparsity, averaging the current r-iterate with old iterates, all of which are sparse, can produce a non-sparse iterate. This will slow down vector additions and prolong convergence.
4.2.1 Auxiliary Variable
We can split Equation (4.41) by defining an auxiliary variable for My − b:
$$\displaystyle\begin{array}{rcl} & & \mathop{\mathrm{minimize}}\limits _{x\in \mathbf{R}^{m},y\in \mathbf{R}^{n}}\;l\left (x\right ) + r(y) \\ & & \text{subject to }\;My - x = b. {}\end{array}$$
(4.42)
The constraint in Equation (4.42) reduces to Ax + By = b where B = M and \(A = -I_{\mathbf{R}^{m}}\). If we set f = l and g = r and apply ADMM, the analysis of Section 3.3 shows that
$$\displaystyle\begin{array}{rcl} \vert l(x^{k}) + r(y^{k}) - l(My^{{\ast}}- b) - r(y^{{\ast}})\vert & =& o\left ( \frac{1} {\sqrt{k + 1}}\right ) {}\\ \|My^{k} - b - x^{k}\|^{2}& =& o\left ( \frac{1} {k + 1}\right ). {}\\ \end{array}$$
In particular, if l is Lipschitz, \(\vert l(x^{k}) - l(My^{k} - b)\vert = o\left (1/\sqrt{k + 1}\right )\). Thus, we have
$$\displaystyle\begin{array}{rcl} \vert l(My^{k} - b) + r(y^{k}) - l(My^{{\ast}}- b) - r(y^{{\ast}})\vert = o\left ( \frac{1} {\sqrt{k + 1}}\right ).& & {}\\ \end{array}$$
A similar analysis shows that
$$\displaystyle\begin{array}{rcl} \vert l(M\overline{y}^{k} - b) + r(\overline{y}^{k}) - l(My^{{\ast}}- b) - r(y^{{\ast}})\vert & =& O\left (\frac{1} {\varLambda _{k}}\right ) {}\\ \|M\overline{y}^{k} - b -\overline{x}^{k}\|^{2}& =& O\left ( \frac{1} {\varLambda _{k}^{2}}\right ). {}\\ \end{array}$$
In the next two splittings, we leave the derivation of convergence rates to the reader.
4.2.2 Splitting Across Examples
We assume that l is block separable: we have l(Mx − b) = ∑
i = 1
R
l
i
(M
i
x − b
i
) where
$$\displaystyle\begin{array}{rcl} M = \left [\begin{array}{*{10}c} M_{1}\\ \vdots \\ M_{R} \end{array} \right ]\quad \text{and}\quad b = \left [\begin{array}{*{10}c} b_{1}\\ \vdots \\ b_{R} \end{array} \right ].& & {}\\ \end{array}$$
Each \(M_{i} \in \mathbf{R}^{m_{i}\times n}\) is a submatrix of M, each \(b_{i} \in \mathbf{R}^{m_{i}}\) is a subvector of b, and ∑
i = 1
R
m
i
= m. Therefore, an equivalent form of Equation (4.41) is given by
$$\displaystyle\begin{array}{rcl} & & \mathop{\mathrm{minimize}}\limits _{x_{1},\cdots \,,x_{R},y\in \mathbf{R}^{n}}\;\sum _{i=1}^{R}l_{ i}(M_{i}x_{i} - b_{i}) + r(y) \\ & & \text{subject to }\;x_{r} - y = 0,\quad r = 1,\cdots \,,R. {}\end{array}$$
(4.43)
We say that Equation (4.43) is split across examples. Thus, to apply ADMM to this problem, we simply stack the vectors x
i
, i = 1, ⋯ , R into a vector x = (x
1, ⋯ , x
R
)T ∈ R
nR. Then the constraints in Equation (4.43) reduce to Ax + By = 0 where \(A = I_{\mathbf{R}^{nR}}\) and By = (−y, ⋯ , −y)T.
4.2.3 Splitting Across Features
We can also split Equation (4.41) across features, whenever r is block separable in x, in the sense that there exists C > 0, such that r = ∑
i = 1
C
r
i
(x
i
), and \(x_{i} \in \mathbf{R}^{n_{i}}\) where ∑
i = 1
C
n
i
= n. This splitting corresponds to partitioning the columns of M, i.e., \(M = \left [\begin{array}{*{10}c} M_{1},\cdots \,,M_{C} \end{array} \right ],\) and \(M_{i} \in \mathbf{R}^{m\times n_{i}}\), for all i = 1, ⋯ , C. For all y ∈ R
n, My = ∑
i = 1
C
M
i
y
i
. With this notation, we can derive an equivalent form of Equation (4.41) given by
$$\displaystyle\begin{array}{rcl} & & \mathop{\mathrm{minimize}}\limits _{x,y\in \mathbf{R}^{n}}\;l\left (\sum _{i=1}^{C}x_{ i} - b\right ) +\sum _{ i=1}^{C}r_{ i}(y_{i}) \\ & & \text{subject to }\;x_{i} - M_{i}y_{i} = 0,\quad i = 1,\cdots \,,C. {}\end{array}$$
(4.44)
The constraint in Equation (4.44) reduces to Ax + By = 0 where \(A = I_{\mathbf{R}^{mC}}\) and By = −(M
1
y
1, ⋯ , M
C
y
C
)T ∈ R
nC.
4.3 Distributed ADMM
In this section our goal is to use Algorithm 2 for
$$\displaystyle\begin{array}{rcl} \mathop{\mathrm{minimize}}\limits _{x\in \mathcal{H}\ }\sum _{i=1}^{m}f_{ i}(x)& & {}\\ \end{array}$$
by using the splitting in [49]. Note that we could minimize this function by reformulating it in the product space \(\mathcal{H}\ ^{m}\) as follows:
$$\displaystyle\begin{array}{rcl} \mathop{\mathrm{minimize}}\limits _{\mathbf{x}\in \mathcal{H}\,^{m}}\sum _{i=1}^{m}f_{ i}(x_{i}) +\iota _{D}(\mathbf{x}),& & {}\\ \end{array}$$
where \(D =\{ (x,\cdots \,,x) \in \mathcal{H}\ ^{m}\mid x \in \mathcal{H}\ \}\) is the diagonal set. Applying relaxed PRS to this problem results in a parallel algorithm where each function performs a local minimization step and then communicates its local variable to a central processor. In this section, we assign each function a local variable but we never communicate it to a central processor. Instead, each function only communicates with neighbors.
Formally, we assume there is a simple, connected, undirected graph G = (V, E) on | V | = m vertices with edges E that describe a connection among the different functions. We introduce a variable \(x_{i} \in \mathcal{H}\) for each function f
i
, and, hence, we set \(\mathcal{H}\ _{1} = \mathcal{H}\ ^{m}\), (see Section 8). We can encode the constraint that each node communicates with neighbors by introducing an auxiliary variable for each edge in the graph:
$$\displaystyle\begin{array}{rcl} & & \mathop{\mathrm{minimize}}\limits _{\mathbf{x}\in \mathcal{H}\ ^{m},\mathbf{y}\in \mathcal{H}\ ^{\vert E\vert }}\;\sum _{i=1}^{m}f_{ i}(x_{i}) \\ & & \text{subject to }\;x_{i} = y_{ij},x_{j} = y_{ij},\text{ for all }(i,j) \in E.{}\end{array}$$
(4.45)
The linear constraints in Equation (4.45) can be written in the form of A
x + B
y = 0 for proper matrices A and B. Thus, we reformulate Equation (4.45) as
$$\displaystyle\begin{array}{rcl} & & \mathop{\mathrm{minimize}}\limits _{\mathbf{x}\in \mathcal{H}\ ^{m},\mathbf{y}\in \mathcal{H}\ ^{\vert E\vert }}\;\sum _{i=1}^{m}f_{ i}(x_{i}) + g(\mathbf{y}) \\ & & \text{subject to }\;A\mathbf{x} + B\mathbf{y} = 0, {}\end{array}$$
(4.46)
where \(g: \mathcal{H}\ ^{\vert E\vert }\rightarrow \mathbf{R}\) is the zero map.
Because we only care about finding the value of the variable \(\mathbf{x} \in \mathcal{H}\ ^{m}\), the following simplification can be made to the sequences generated by ADMM applied to Equation (4.46) with λ
k
= 1∕2 for all k ≥ 1 [51]: Let \(\mathcal{N}_{i}\) denote the set of neighbors of i ∈ V and set x
i
0 = α
i
0 = 0 for all i ∈ V. Then for all k ≥ 0,
$$\displaystyle\begin{array}{rcl} \left \{\begin{array}{@{}l@{\quad }l@{}} x_{i}^{k+1} =\mathop{ \mathrm{arg\,min}}\limits _{x_{i}\in \mathcal{H}\ }f_{i}(x) + \frac{\gamma \vert \mathcal{N}_{i}\vert } {2} \|x_{i} - x_{i}^{k} - \frac{1} {\vert \mathcal{N}_{i}\vert }\sum _{j\in \mathcal{N}_{i}}x_{j}^{k} + \frac{1} {\gamma \vert \mathcal{N}_{i}\vert }\alpha _{i}\|^{2} + \frac{\gamma \vert \mathcal{N}_{i}\vert } {2} \|x_{i}\|^{2}\quad \\ \alpha _{i}^{k+1} =\alpha _{ i}^{k} +\gamma \left (\vert \mathcal{N}_{i}\vert x_{i}^{k+1} -\sum _{j\in \mathcal{N}_{i}}x_{j}^{k+1}\right ). \quad \end{array} \right.& & {}\\ \end{array}$$
The above iteration is truly distributed because each node i ∈ V only requires information from its local neighbors at each iteration.
In [51], linear convergence is shown for this algorithm provided that f
i
are strongly convex and ∇f
i
are Lipschitz. For general convex functions, we can deduce the nonergodic rates from Theorem 10
$$\displaystyle\begin{array}{rcl} \left \vert \sum _{i=1}^{m}f_{ i}(x_{i}^{k}) - f(x^{{\ast}})\right \vert & =& o\left ( \frac{1} {\sqrt{k + 1}}\right ) {}\\ \sum _{\begin{array}{c}i\in V \\ j\in N_{i}\end{array}}\|x_{i}^{k} - z_{ ij}^{k}\|^{2} +\sum _{ \begin{array}{c}i\in V\\ i\in N_{j}\end{array}}\|x_{j}^{k} - z_{ ij}^{k}\|^{2}& =& o\left ( \frac{1} {k + 1}\right ), {}\\ \end{array}$$
and the ergodic rates from Theorem 9
$$\displaystyle\begin{array}{rcl} \left \vert \sum _{i=1}^{m}f_{ i}(\overline{x}_{i}^{k}) - f(x^{{\ast}})\right \vert & =& O\left ( \frac{1} {k + 1}\right ) {}\\ \sum _{\begin{array}{c}i\in V \\ j\in N_{i}\end{array}}\|\overline{x}_{i}^{k} -\overline{z}_{ ij}^{k}\|^{2} +\sum _{ \begin{array}{c}i\in V\\ i\in N_{j}\end{array}}\|\overline{x}_{j}^{k} -\overline{z}_{ ij}^{k}\|^{2}& =& O\left ( \frac{1} {(k + 1)^{2}}\right ). {}\\ \end{array}$$
These convergence rates are new and complement the linear convergence results in [51]. In addition, they complement the similar ergodic rate derived in [54] for a different distributed splitting.