On the Global and Linear Convergence of the Generalized Alternating Direction Method of Multipliers
- First Online:
- Received:
- Accepted:
- 33 Citations
- 1.7k Downloads
Abstract
Keywords
Alternating direction method of multipliers Global convergence Linear convergence Strong convexity Distributed computing1 Introduction
The constraints \(x\in {\mathcal {X}}\) and \(y\in {\mathcal {Y}}\), where \({\mathcal {X}}\subseteq \mathbb {R}^n\) and \({\mathcal {Y}}\subseteq \mathbb {R}^m\) are closed convex sets, can be included as the (extended-value) indicator functions \(I_{{\mathcal {X}}}(x)\) and \(I_{{\mathcal {Y}}}(y)\) in the objective functions f and g. Here the indicator function of a convex set \({\mathcal {C}}\) returns 0 if the input lies in \({\mathcal {C}}\) and \(\infty \) otherwise.
The main goal of this paper is to show that the ADM applied to (1.1) has global linear convergence under a variety of different conditions, in particular, when one of the two objective functions is strongly convex,^{1} along with other regularity and rank conditions. The convergence analysis is performed under a general framework that allows the ADM subproblems to be solved inexactly and faster.
Compared to Algorithm 1, Algorithm 2 adds \(\frac{1}{2}\Vert y-y^k\Vert _{Q}^2\) and \(\frac{1}{2}\Vert x-x^k\Vert _{P}^2\) to the y- and x-subproblems, respectively, and assigns \(\gamma \) as the step size for the update of \({\uplambda }\). We abuse the notion \(\Vert x\Vert _M^2:=x^T M x\) as we allow any symmetric, possibly indefinite, matrix M. Different choices of P and Q are reviewed in the next subsection. They can make steps 4 and 5 of Algorithm 2 much easier.
We do not fix \(\gamma =1\) like in most of the ADM literature since \(\gamma \) plays a key role. For example, when \(P=\mathbf {0}\) and \(Q=\mathbf {0}\), any \(\gamma \in (0,(\sqrt{5}+1)/2)\) guarantees the convergence of Algorithm 2 [15], but \(\gamma =1.618\) tends to make the algorithm faster than \(\gamma =1\). The range of \(\gamma \) depends on P and Q, as well as \(\beta \). When P is indefinite (and give much simpler subproblems), \(\gamma \) must be smaller than 1 or the iteration may diverge.
Let us overview two works related to Algorithm 2. Work [23] considers (1.2) where the quadratic penalty term is generalized to \(\Vert Ax+By-b\Vert _{H_k}^2\) for a sequence of bounded positive definite matrices \(\{H_k\}\), and the work proves the convergence of Algorithm 2 restricted to \(\gamma =1\) and differential functions f and g. Work [37] replaces \(\gamma \) by a general positive definite matrix C and establishes convergence assuming that \(A=I\) and the smallest eigenvalue of C is no greater than 1, which corresponds to \(\gamma \le 1\) when \(C=\gamma I\). In these works, no rate of convergence is given.
1.1 Generalized ADM with Simplified Subproblems
By “simplified subproblems”, we mean that the ADM subproblems in Algorithm 1 are replaced by subproblems that are easier to solve or have closed-form solutions. The modified ADM shall still converge to the exact solution.
Let us give a few examples of matrix P in step 5 of Algorithm 2. These examples also apply to Q in step 4, which can be different from P.
Let us describe an image processing application that will benefit from properly applying (1.9). Consider the total variation regularization problem: \(\min _u \Vert {\nabla }u \Vert _1 + \frac{1}{2}\Vert Tu - b\Vert ^2,\) where u is an image and T is a sensing operator that is either a blurring operator or a downsampled Fourier operator. The latter operator finds its applications in MRI. The problem can be reformulated to the form of (1.1) as \(\min _{u,v} \Vert v\Vert _1 + \frac{1}{2}\Vert Tu-b\Vert ^2,\quad \text{ subject } \text{ to }~ v-\nabla u = 0.\) The v-subproblem is closed-form soft-thresholding. The u-subproblem has the form: \(\min _u~\frac{1}{2}\Vert Tu - b\Vert ^2 + \frac{1}{2}\Vert {\nabla }u - \text{ constant }\Vert ^2\), which is quadratic. If the operator \(\nabla \) satisfies the periodic boundary condition, then the normal equation to this quadratic subproblem can be diagonalized by the Fourier transform and have a closed-form solution; see [34]. However, since few images has a periodic boundary, \(\nabla \) generally does not satisfy the periodic boundary condition. A resolution is to introduce the operator \(\nabla _{\text {periodic}}\) that satisfies the condition and the indefinite matrix \(P=\beta (\nabla _{\text {periodic}}^T \nabla _{\text {periodic}}-\nabla ^T \nabla )\). Then, the u-subproblem will have a closed-form solution. Note that in this approach, the ADM still converges to the exact solution.
Goals of P and Q The general goal is to wisely choose P and Q so that the subproblems of Algorithm 2 becomes much easier to carry out and the entire algorithm runs in less time. In the ADM, the two subproblems can be solved in either order (but fixed throughout the iterations; see [35] for a counter example). However, when one subproblem is solved less exactly than the other, Algorithm 2 tends to run faster if the less exact subproblem is solved later—assigned as step 5 of Algorithm 2—because at each iteration, the ADM updates the variables in the Gauss-Seidel fashion. If the less exact subproblem is solved first, its relatively inaccurate solution will then affect the more exact subproblem, making its solution also inaccurate. Since the less exact subproblem should be assigned as the later step 5, more choices of P are needed than Q, which is the case in this paper.
1.2 Summary of Results
The conclusions in Table 2 are the quantities that converge either Q-linearly or R-linearly.^{2} Q-linear convergent quantities are the entireties of multiple variables whereas R-linear convergent quantities are the individual variables \(x^k\), \(y^k\), and \({\uplambda }^k\).
Four scenarios leading to linear convergence
Scenario | Strongly convex | Lipschitz continuous | Full row rank | Additional assumptions |
---|---|---|---|---|
1 | f | \(\nabla f\) | A | If\(Q\succ 0\), B has full column rank |
2 | f, g | \(\nabla f\) | A | |
3 | f | \(\nabla f,\nabla g\) | – | B has full column rank |
4 | f, g | \(\nabla f,\nabla g\) | – |
Summary of linear convergence results
Case | \(P,\hat{P}\) | Q | Any scenario 1–4 | |
---|---|---|---|---|
Q-linear convergence | R-linear convergence | |||
1 | \(P=0\) | \(=0\) | \((Ax^k,{\uplambda }^k)\) | \(x^k\), (\(y^k\) or \(By^k\))\(^*\), \({\uplambda }^k\) |
2 | \(\hat{P}\succ 0\) | \(= 0\) | \((x^k,{\uplambda }^k)\) | |
3 | \(P= 0\) | \(\succ 0\) | \((Ax^k,y^k,{\uplambda }^k)\) | |
4 | \(\hat{P}\succ 0\) | \(\succ 0\) | \((x^k,y^k,{\uplambda }^k)\) |
Scenario 2 adds the strong convexity assumption on g. As a result, the remark in case 1 regarding the full column rank of B is no longer needed.
Both scenarios 3 and 4 assume that g is differentiable and \(\nabla g\) is Lipschitz continuous. As a result, the error of \({\uplambda }^k\) can be controlled by taking advantages of the Lipschitz continuity of both \(\nabla f\) and \(\nabla g\), and the full row rank assumption on A is no longer needed. On the other hand, scenarios 3 and 4 exclude the problems with non-differentiable g. Compared to scenario 3, scenario 4 adds the strong convexity assumption on g and drops the remark on the full column rank of B.
Under scenario 1 with \(Q\succ 0\) and scenario 3, the remarks in Table 1 are needed essentially because \(y^k\) gets coupled with \(x^k\) and \({\uplambda }^k\) in certain inequalities in our convergence analysis. The full column rank of B helps bound the error of \(y^k\) by those of \(x^k\) and \({\uplambda }^k\).
Four cases When \(P=0\) (corresponds to exactly solving the ADM x-subproblem), we have \(\hat{P}\succeq 0\) and only obtain linear convergence in Ax. However, when \(\hat{P}\succ 0\), linear convergence in x is obtained. When \(Q=0\) (corresponds to exactly solving the ADM y-subproblem), y is not part of the Q-linear convergent joint variable. But, when \(Q\succ 0\), y becomes part of it.
1.3 Existing Rate-of-Convergence Results
Although there is extensive literature on the ADM and its applications, there are very few results on its rate of convergence until the very recent past. The work [17] shows that for a Jacobi version of the ADM applied to smooth functions with Lipschitz continuous gradients, the objective value descends at the rate of O(1 / k) and that of an accelerated version descends at \(O(1/k^2)\). Then, work [18] establishes the same rates on a Gauss-Seidel version and requires only one of the two objective functions to be smooth with Lipschitz continuous gradient. These two works only consider the model with the linear constraint coefficient matrices A and B being identity matrix or negative identity matrix. Later, He and Yuan [25] shows that a variational inequality based condition converges at an O(1 / k) rate in an ergodic sense. The work [24] shows that \(\Vert u^k-u^{k+1}\Vert ^2\), where \(u^k:=(x^k,y^k,{\uplambda }^k)\), of the ADM converges at O(1 / k). The work [21] proves that the dual objective value of a modification to the ADM descends at \(O(1/k^2)\) under the assumption that the objective functions are strongly convex (one of them being quadratic) and both subproblems are solved exactly. The recent works [6, 7] obtain the sublinear and linear rates of Douglas-Rachford Splitting Method (DRSM) in a variety of senses including fixed-point residual and objective error and extend their rates to ADM. In this paper, we show the linear rate of convergence \(O(1/c^k)\) for some \(c>1\) under a variety of scenarios in which at least one of the two objective functions is strong convex and has Lipschitz continuous gradient. This rate is stronger than the sublinear rates such as O(1 / k) and \(O(1/k^2)\) and is given in terms of the solution error, which is stronger than those given in terms of the objective error. On the other hand, [6, 17, 18, 24, 25] do not require any strong convexity. The fact that a wide range of applications give rise to model (1.1) with at least one strongly convex function has motivated this work.
There are many regularization problems such as the LASSO model: \(\min \Vert x\Vert _1+\frac{\mu }{2}\Vert Ax-b\Vert ^2\) and the total variation model: \(\min \Vert \nabla x\Vert _1+\frac{\mu }{2}\Vert Ax-b\Vert ^2\), where neither objective function is strongly convex unless the matrix A has full column rank. However, one can combine our results with those in the recent papers [28, 35] to establish eventual linear convergence on the optimal manifold. Specifically, [35] shows that applying ADM is equivalent to applying the DRSM to the same problem. Furthermore, [28] establishes that DRSM identifies the optimal manifold in a finite number of iterations if the objective functions are partially smooth along the optimal manifold near the solution. The partially-smooth condition holds for regularization functions such as \(\ell _1\), \(\ell _2\), and total variation and also holds trivially for smooth functions. Once identifying the optimal manifold, many problems become strongly convex on the optimal manifold. We do not pursue this direction in this paper.
The recent work [26] proves the linear convergence of ADM in a different approach. The linear convergence in [26] requires that the objective function takes a certain form involving a strongly convex function and the step size for updating the multipliers is sufficiently small (which is impractical), while no explicit linear rate is given. Its recent update assumes a bounded sequence in addition. On the other hand, it allows more than two blocks of separable variables and it does not require strict convexity; instead, it requires the objective function to include f(Ex), where f is strongly convex and E is a possibly rank-deficient matrix.
It is worth mentioning that the ADM applied to linear programming is known to converge at a global linear rate [10]. For quadratic programming, work [1] presents an analysis leading to a conjecture that the ADM should converge linearly near the optimal solution. Our analysis in this paper is different from those in [1, 10].
The linear convergence of ADM was also established in the context of DRSM [9] and Proximal Point Method (PPA) [32] under certain conditions. It has been shown that ADM is a special case of DRSM applied to the Lagrange dual [13] and also DRSM applied to the original problem [35]. Further, DRSM is a special case of PPA [11]. Therefore, the linear convergence of ADM can be obtained from the existing linear convergence results of DRSM and PPA [29, 32] under the conditions therein. However, it is unclear whether those results can apply to the generalizations to ADM in Sect. 1.1. In addition, in Sect. 3.3 below, we will review the result in [29] and show that our analysis covers significantly more cases and, in the overlapped case, yields a better linear rate.
1.4 The Penalty Parameter \(\beta \)
It is well known that the penalty parameter \(\beta \) can significantly affect the speed of the ADM. Since the rate of convergence developed in this paper is a function of \(\beta \), the rate can be optimized over \(\beta \). We give some examples in Sect. 3.2 below, which shows the rate of convergence is positively related to the strong convexity constant of f and g, while being negatively related to the Lipschitz constant of \(\nabla f\) and \(\nabla g\) as well as the condition number of A, B and [A, B]. More analysis and numerical simulations are left as future research.
1.5 Preliminary, Notation, and Assumptions
We let \(\langle \cdot ,\cdot \rangle \) denote the standard inner product, and let \(\Vert \cdot \Vert \) denote the \(\ell _2\)-norm \(\Vert \cdot \Vert _2\) (the Euclidean norm of a vector or the spectral norm of a matrix). In addition, we use \({\uplambda }_{\min }(M)\) and \({\uplambda }_{\max }(M)\) for the smallest and largest eigenvalues of a symmetric matrix M, respectively.
Assumption 1
When assumption 1 fails to hold, the ADM method has either unsolvable or unbounded subproblems or a diverging sequence of \({\uplambda }^k\).
Assumption 2
Functions f and g are convex.
1.6 Organization
The rest of the paper is organized as follows. Section 2 shows the global convergence of the generalized ADM. Then Sect. 3, under the assumptions in Table 1, further proves the global linear convergence. Section 4 discusses several interesting applications that are covered by our linear convergence theory. In Sect. 5, we present some preliminary numerical results to demonstrate the linear convergence behavior of ADM. Finally, Sect. 6 concludes the paper.
2 Global Convergence
In this section, we show the global convergence of Algorithm 2. The proof steps are similar to the existing ADM convergence theory in [23, 37] but are adapted to Algorithm 2. Several inequalities in the section are used in the linear convergence analysis in the next section.
2.1 Convergence Analysis
Lemma 2.1
- i)$$\begin{aligned}&A^T\hat{{\uplambda }}+P\left( x^k-x^{k+1}\right) \in \partial f(x^{k+1}),\end{aligned}$$(2.6)$$\begin{aligned}&B^T\left( \hat{{\uplambda }}-\beta A(x^{k}-x^{k+1})\right) +Q(y^k-y^{k+1})\in \partial g(y^{k+1}). \end{aligned}$$(2.7)
- ii)$$\begin{aligned}&\langle x^{k+1}-x^*, ~A^T(\hat{{\uplambda }}-{\uplambda }^*)+P\left( x^k-x^{k+1}\right) \rangle \ge \nu _f\Vert x^{k+1}-x^*\Vert ^2, \end{aligned}$$(2.8)$$\begin{aligned}&\langle y^{k+1}-y^{*},~B^T\left( \hat{{\uplambda }}-{\uplambda }^*-\beta A(x^{k}-x^{k+1})\right) +Q\left( y^k-y^{k+1}\right) \rangle \ge \nu _g\Vert y^{k+1}-y^*\Vert ^2. \end{aligned}$$(2.9)
- iii)$$\begin{aligned} A\left( x^{k+1}-x^*\right) +B\left( y^{k+1}-y^*\right) =\frac{1}{\beta }\left( {\uplambda }^k-\hat{{\uplambda }}\right) . \end{aligned}$$(2.10)
- iv)where$$\begin{aligned} \Vert u^k-u^*\Vert _G^2-\Vert u^{k+1}-u^*\Vert _G^2\ge h(u^k-\hat{u})+2\nu _f\Vert x^{k+1}-x^{*}\Vert ^2+2\nu _g\Vert y^{k+1}-y^*\Vert ^2, \end{aligned}$$(2.11)$$\begin{aligned} h(u^k-\hat{u}):= & {} \Vert x^k-x^{k+1}\Vert _{\hat{P}}^2+\Vert y^k-y^{k+1}\Vert _Q^2+\frac{2-\gamma }{\beta }\Vert {\uplambda }^k-\hat{{\uplambda }}\Vert ^2\nonumber \\&+\,2\left( {\uplambda }^k-\hat{{\uplambda }}\right) ^TA\left( x^k-x^{k+1}\right) . \end{aligned}$$(2.12)
The proof of this lemma is given in “Appendix 1”. In the next theorem, we show that \(\Vert u^k-u^*\Vert _G\) has sufficient descent. Technically, it is done through bounding \(h(u^k-\hat{u})\) from zero by applying the Cauchy–Schwarz inequality to its cross term \(2({\uplambda }^k-\hat{{\uplambda }})^TA(x^k-x^{k+1})\). If \(P=\mathbf {0}\), a more refined bound is obtained to give \(\gamma \) a wider range of convergence. See “Appendix 2” for the details of the proof.
Theorem 2.1
- i)When \(P\not =\mathbf {0}\), if \(\gamma \) obeys(see Remark 2.1 below for simplification), then there exists \(\eta >0\) such that$$\begin{aligned} (2-\gamma )P\succ (\gamma -1)\beta A^TA \end{aligned}$$(2.13)$$\begin{aligned} \Vert u^k-u^*\Vert _G^2-\Vert u^{k+1}-u^*\Vert _G^2\ge & {} \eta \Vert u^k-u^{k+1}\Vert _G^2+2\nu _f\Vert x^{k+1}-x^{*}\Vert ^2\nonumber \\&+\,2\nu _g\Vert y^{k+1}-y^*\Vert ^2. \end{aligned}$$(2.14)
- ii)When \(P=\mathbf {0}\), ifthen there exist \(\eta >0\) such that$$\begin{aligned} \gamma \in \left( 0,\frac{1+\sqrt{5}}{2}\right) , \end{aligned}$$(2.15)where \(r^k\) is the residual at iteration k:$$\begin{aligned}&\left( \Vert u^k-u^*\Vert _G^2+\frac{\beta }{\rho }\Vert r^k\Vert ^2\right) - \left( \Vert u^{k+1}-u^*\Vert _G^2+\frac{\beta }{\rho }\Vert r^{k+1}\Vert ^2\right) \nonumber \\&\quad {\ge ~\eta \Vert u^k-u^{k+1}\Vert _G^2+2\nu _f\Vert x^k-x^{k+1}\Vert ^2+2\nu _f\Vert x^{k+1}-x^*\Vert ^2+2\nu _g\Vert y^{k+1}-y^*\Vert ^2,} \end{aligned}$$(2.16)If we set \(\gamma =1\), then we have$$\begin{aligned} r^k:=Ax^k+By^k - b. \end{aligned}$$$$\begin{aligned}&\Vert u^k-u^*\Vert _G^2 - \Vert u^{k+1}-u^*\Vert _G^2\ge \Vert u^k-u^{k+1}\Vert _G^2+2\nu _f\Vert x^k-x^{k+1}\Vert ^2\nonumber \\&\quad +\,2\nu _f\Vert x^{k+1}-x^*\Vert ^2+2\nu _g\Vert y^{k+1}-y^*\Vert ^2. \end{aligned}$$(2.17)
Now the sufficient descent of \(\Vert u^k-u^*\Vert _G\) in Theorem 2.1 is used to yield the global convergence of Algorithm 2.
Theorem 2.2
- (a)
\({\uplambda }^k\rightarrow {{\uplambda }^*}\), regardless of the choice of P and Q;
- (b)
when \(P\not =\mathbf {0}\), \(x^k\rightarrow {x^*}\); otherwise, \(Ax^k\rightarrow A{x^*}\);
- (c)
when \(Q\succ \mathbf {0}\), \(y^k\rightarrow {y^*}\); when \(Q=\mathbf {0}\), \(By^k\rightarrow B{y^*}\).
Proof
Being bounded, \(\{u^k\}\) has a converging subsequence \(\{u^{k_j}\}\). Let \(\bar{u}=\lim _{j\rightarrow \infty }u^{k_j}\). Next, we will show \(\bar{u}\) is a KKT point. Let \(u^*\) denote an arbitrary KKT point.
when \(P=\mathbf {0}\), \(A(x^{k}-x^{k+1})\rightarrow \mathbf {0}\);
when \(P\not =\mathbf {0}\), the condition (2.13) guarantees \(\hat{P}\succ \mathbf {0}\) and thus \(x^k-x^{k+1}\rightarrow 0\);
since \(Q\succeq \mathbf {0}\), we obtain \(Q(y^k-y^{k+1})\rightarrow 0\).
Since \(\bar{u}\) is a KKT point, we can now let \(u^*=\bar{u}\). From \(u^{k_j}\rightarrow \bar{u}\) in j and the convergence of \(\Vert u^k-u^*\Vert ^2_G\) it follows \(\Vert u^k-u^*\Vert ^2_G\rightarrow 0\) in k.
- (a)
\({\uplambda }^k\rightarrow {{\uplambda }^*}\), regardless of the choice of P and Q;
- (b)
when \(P\not =\mathbf {0}\), condition (2.13) guarantees \(\hat{P}\succ \mathbf {0}\) and thus \(x^k\rightarrow {x^*}\); when \(P=\mathbf {0}\), \(Ax^k\rightarrow A{x^*}\);
- (c)
when \(Q\succ \mathbf {0}\), \(y^k\rightarrow {y^*}\); when \(Q=\mathbf {0}\), \(By^k\rightarrow B{y^*}\) following from (2.18) and (2.19).
Remark 2.1
Let us discuss the conditions on \(\gamma \). If \(P\succ 0\), the condition (2.13) is always be satisfied for \(0<\gamma \le 1\). However, in this case, \(\gamma \) can go greater than 1, which often leads to faster convergence in practice. If \(P\not \succ 0\), the condition (2.13) requires \(\gamma \) to lie in \((0,\bar{\gamma })\) where \(0<\bar{\gamma }<1\) depends on \(\beta \), P, and \(A^TA\). A larger \(\beta \) would allow a larger \(\bar{\gamma }\).
Remark 2.2
- (i)
matrix A has full column rank whenever \(P=\mathbf {0}\); and
- (ii)
matrix B has full column rank whenever \(Q=\mathbf {0}\).
3 Global Linear Convergence
3.1 Linear Convergence in G-(Semi)Norm
Lemma 3.1
Proof
Lemma 3.2
Proof
Lemma 3.3
If the matrix [A, B] has full row rank, \(\bar{c}:={\uplambda }_{\min }^{-1}([A,B][A,B]^T)>0\).
- Otherwise, \(\text{ rank }([A,B])=r<p\). Without loss of generality, assuming the first r rows of [A, B] (denoted by \([A_r,B_r]\)) are linearly independent, we havewhere \(I\in \mathbb {R}^{r\times r}\) is the identity matrix and \(L\in \mathbb {R}^{(p-r)\times r}\). Let \(E:=(I+L^TL)[A_r,B_r]\), and \(\bar{c}:={\uplambda }_{\min }^{-1}(EE^T)\Vert I+L^TL\Vert >0\).$$\begin{aligned}{}[A,B]= \begin{bmatrix} I\\ L \end{bmatrix}[A_r,B_r], \end{aligned}$$
Proof
With the above lemmas, we now prove the following main theorem of this subsection.
Theorem 3.1
(Q-linear convergence of \(\Vert u^k-u^*\Vert _G\)) Under the same assumptions of Theorem 2.2 and \(\gamma =1\), for all scenarios in Table 1, there exists \(\delta >0\) such that (3.1) holds.
Proof
Consider the case of \(P=0\) and the corresponding inequality (2.17). In this case \(\hat{P}=\beta A^T A\succeq 0\). Let C denote the right-hand side of (2.17).
For scenario 1 with \(Q=0\), \(\Vert u^{k+1}-u^*\Vert ^2_G=\Vert x^{k+1}-x^*\Vert _{\hat{P}}^2+\frac{1}{\beta \gamma }\Vert {\uplambda }^{k+1}-{\uplambda }^*\Vert ^2\). Since \(\Vert x^{k+1}-x^*\Vert ^2 \ge {\uplambda }_{\max }(\hat{P})^{-1}\Vert x^{k+1}-x^*\Vert _{\hat{P}}^2\), (3.2) follows from (3.11) with \(\delta =\min \{c_{9}{\uplambda }^{-1}_{\max }(\hat{P}),c_{11}\beta \gamma \}>0\).
Scenario 2 (recall it is scenario 1 plus that g is strongly convex). We have \(c_{10}=2\nu _g>0\) in (3.11), which gives (3.1) with \(\delta =\min \{c_{9}{\uplambda }^{-1}_{\max }(\hat{P}),c_{10}{\uplambda }^{-1}_{\max }(Q),c_{11}\beta \gamma \}>0\). Note that we have used the convention that if \(Q=0\), then \({\uplambda }^{-1}_{\max }(Q)=\infty \).
Scenario 4 (recall it is scenario 3 plus that g is strongly convex). Since \(c_{11}=2\nu _g>0\) in (3.11), we can directly apply Lemma 3.3 to get (3.1) with \(\delta >0\) in a way similar to scenario 3.
Now consider the case of \(P\not =0\) and the corresponding inequality (7.5). Inequalities (7.5) and (2.17) are similar except (2.17) has the extra term \(\Vert x^k-x^{k+1}\Vert ^2\) with a strictly positive coefficient in its right-hand side. This term is needed when Lemma 3.2 is applied. However, the assumptions of the theorem ensure \(\hat{P}\succ 0\) whenever \(P\not =0\). Therefore, in (7.5), the term \(\Vert u^{k}-u^{k+1}\Vert _G^2\), which contains \(\Vert x^k-x^{k+1}\Vert _{\hat{P}}^2\), can spare out a term \(c_{19}\Vert x^k-x^{k+1}\Vert ^2\) with \(c_{19}>0\). Therefore, following the same arguments for the case of \(P=0\), we get (3.1) with certain \(\delta >0\).
Now we extend the result in Theorem 3.1 (which is under \(\gamma =1\)) to \(\gamma \ne 1\) in the following theorem.
Theorem 3.2
- i)
if \(P\not =0\), there exists \(\delta >0\) such that (3.1) holds;
- ii)if \(P=0\), there exists \(\delta > 0\) such that$$\begin{aligned} \Vert u^k-u^*\Vert _G^2+\frac{\beta }{\rho }\Vert r^k\Vert ^2 \ge (1+\delta )\left( \Vert u^{k+1}-u^*\Vert _G^2+\frac{\beta }{\rho }\Vert r^{k+1}\Vert ^2 \right) . \end{aligned}$$(3.14)
Proof
When \(\gamma \not =1\), which causes \({\uplambda }^{k+1}\not =\hat{{\uplambda }}\). We shall bound \(\Vert {\uplambda }^{k+1}-{\uplambda }^*\Vert ^2\) but Lemmas 3.2 and 3.3 only give bounds on \(\Vert \hat{{\uplambda }}-{\uplambda }^*\Vert ^2\). Noticing that \((\hat{{\uplambda }}-{\uplambda }^*)-({\uplambda }^{k+1}-{\uplambda }^*)=\hat{{\uplambda }}-{\uplambda }^{k+1}=(\gamma -1)r^{k+1}\) and C contains a strictly positive term in \(\Vert {\uplambda }^k -{\uplambda }^{k+1}\Vert ^2 = \gamma ^2\Vert r^{k+1}\Vert ^2\), we can bound \(\Vert {\uplambda }^{k+1}-{\uplambda }^*\Vert ^2\) by a positively weighted sum of \(\Vert \hat{{\uplambda }}-{\uplambda }^*\Vert ^2\) and \(\Vert {\uplambda }^k -{\uplambda }^{k+1}\Vert ^2\).
If \(P\not =0\), the rest of the proof follows from that of Theorem 3.1.
If \(P=0\), \(\gamma \not =1\) leads to (2.16), which extends \(\Vert u^i-u^*\Vert _G^2\) in (2.17) to \(\Vert u^i-u^*\Vert _G^2+\frac{\beta }{\rho }\Vert r^i\Vert ^2\), for \(i=k,k+1\). Since C contains \(\Vert {\uplambda }^k -{\uplambda }^{k+1}\Vert ^2 = \gamma ^2\Vert r^{k+1}\Vert ^2\) with a strictly positive coefficient, one obtains (3.14) by using this term and following the proof of Theorem 3.1.
3.2 Explicit Formula of Convergence Rate
To keep the proof of Theorem 3.1 easy to follow, we have avoided giving the explicit formulas of \(c_i\)’s and thus also those of \(\delta \). To give the reader an idea what quantities affect \(\delta \), we now provide an explicit formula of \(\delta \) for the classic ADM (i.e., case 1 with \(\gamma =1\)) under scenario 1.
Corollary 3.1
Proof
Not surprisingly, the convergence rate under scenario 1 is negatively affected by the condition numbers of A and f. For other scenarios, the formulas of \(\delta \) can also be similarly obtained by deriving the specific values of \(c_i\)’s in our analysis. However, they appear to be more complicated than the nice formula (3.16) for scenario 1. A close look at these formulas of \(c_i\)’s reveals that the convergence rate is negatively affected by the condition numbers of the constraint matrices A, B and [A, B], as well as the condition numbers of the objective functions f and g. Due to page limit, we leave other scenarios/cases and further analysis to future research.
3.3 Comparison with Lions and Mercier’s Linear Rate of DRSM
It is known that applying the classic ADM to problem (1.1) is equivalent to applying the Douglas–Rachford splitting method (DRSM) to the dual of (1.1). (However, it is unclear to which splitting methods the various ADM generalizations correspond to.) In this subsection, we review the classic linear convergence result [29] of DRSM. In comparison, we show that our linear rate for the classic ADM is considerably better than the one in [29].
Theorem 3.3
3.4 Q-Linear Convergent Quantities
From the definition of G, which depends on P and Q, it is easy to see that the Q-linear convergence of \(u^k=(x^k;y^k;{\uplambda }^k)\) translates to the Q-linear convergence results in Table 2. For example, in case 1 (\(P=0\) and \(Q=0\)), \(\Vert u^{k+1}-u^*\Vert ^2_G=\Vert x^{k+1}-x^*\Vert _{\hat{P}}^2+\frac{1}{\beta \gamma }\Vert {\uplambda }^{k+1}-{\uplambda }^*\Vert ^2\), where \(\hat{P}=P+\beta AA^T=\beta A^TA\). Hence, \((Ax^k,{\uplambda }^k)\) converges Q-linearly. Examining \(\Vert u^{k+1}-u^*\Vert ^2_G\) gives the results for cases 2, 3, 4.
3.5 R-Linear Convergent Quantities
By the definition of R-linear convergence, any part of a Q-linear convergent quantity converges R-linearly. For example, in case 1 (\(P=0\) and \(Q=0\)), the Q-linear convergence of \((Ax^k,{\uplambda }^k)\) in Table 2 gives the R-linear convergence of \(Ax^k\) and \({\uplambda }^k\). Therefore, to establish Table 2, it remains to show the R-linear convergence of \(x^k\) in cases 1 and 3 and that of \(y^k\) in cases 1 and 2. Our approach is to bound their errors by existing R-linear convergent quantities.
Theorem 3.4
- i)
In cases 1 and 3, if \({\uplambda }^k\) converges R-linearly, then \(x^k\) converges R-linearly.
- ii)
In cases 1 and 2, scenario 1, if \({\uplambda }^k\) and \(x^k\) both converge R-linearly, then \(By^k\) converges R-linearly. In addition, if B has full column rank, then \(y^k\) converges R-linearly.
- iii)
In cases 1 and 2, scenarios 2–4, if \({\uplambda }^k\) and \(x^k\) both converge R-linearly, then \(y^k\) converges R-linearly.
Proof
- i)By (2.8) and \(\hat{P}=\beta A^TA\), we have \(\nu _f\Vert x^{k+1}-x^{*}\Vert ^2 \le \Vert A\Vert \Vert x^{k+1}-x^{*}\Vert \Vert {\uplambda }^{k+1}-{\uplambda }^{*}\Vert \), which implies$$\begin{aligned} \Vert x^{k+1}-x^*\Vert ^2 \le \frac{\Vert A\Vert ^2}{\nu _f^2}\Vert {\uplambda }^{k+1}-{\uplambda }^*\Vert ^2. \end{aligned}$$(3.38)
- ii)
The result follows from (2.10).
- iii)Scenario 3 assumes the full column rank of B, so the result follows from (2.10). In scenarios 2 and 4, g is strongly convex. Recall (2.9) with \(\hat{{\uplambda }}={\uplambda }^{k+1}\):By the Cauchy–Schwarz inequality and \(Q=\mathbf {0}\), we have$$\begin{aligned}&\left\langle y^{k+1}-y^{*},~B^T\left( {\uplambda }^{k+1}-{\uplambda }^*-\beta A(x^{k}-x^{k+1})\right) +Q(y^k-y^{k+1})\right\rangle \nonumber \\&\quad \ge \nu _g\Vert y^{k+1}-y^*\Vert ^2. \end{aligned}$$(3.39)Therefore, the result follows from the R-linear convergence of \(x^k\) and \({\uplambda }^k\).$$\begin{aligned} \nu _g\Vert y^{k+1}-y^*\Vert \le \Vert B\Vert \Vert {\uplambda }^{k+1}-{\uplambda }^*-\beta A(x^{k}-x^{k+1})\Vert . \end{aligned}$$(3.40)
4 Applications
This section describes some well-known optimization models on which Algorithm 2 not only enjoys global linear convergence but also often has easy-to-solve subproblems. In general, at least one of two objective functions need to be strictly convex. This is the case with Tikhonov regularization, which has numerous applications such as ridge regression and support vector machine (SVM) in statistics and machine learning, elastic net regularization (see Sect. 4.2 below), and entropy maximization. In some applications, the conditions for linear convergence hold, not initially, but after the iterates enter an optimal active set. Then we obtain eventual linear convergence rather than global linear convergence.
4.1 Convex Regularization
4.2 Sparse Optimization
In recent years, the problem of recovering sparse vectors and low-rank matrices has received tremendous attention from researchers and engineers, particularly those in the areas of compressive sensing, machine learning, and statistics.
4.3 Consensus and Sharing Optimization
Algorithm 2 applied to the problems (4.8), (4.9) and (4.10) converges linearly if each function \(f_i\) is strongly convex and has Lipschitz continuous gradient. The resulting ADM is particularly suitable for distributed implementation, since the x-subproblem can be decomposed into N independent \(x_{i}\)-subproblems, and the update to the multiplier \({\uplambda }\) can also be done at each node i.
5 Numerical Demonstration
We present the results of some simple numerical tests to demonstrate the linear convergence of Algorithm 2. The numerical performance is not the focus of this paper and will be investigated more thoroughly in future research.
5.1 Elastic Net
We apply Algorithm 2 with \(P=0\) and \(Q=0\) to a small elastic net problem (4.4), where the feature matrix A has \(m=250\) examples and \(n=1000\) features. We first generated the matrix A from the standard Gaussian distribution \({\mathcal {N}}(0,1)\) and then orthonormalized its rows. A sparse vector \(x^0\in \mathbb {R}^n\) was generated with 25 nonzero entries, each sampled from the standard Gaussian distribution. The observation vector \(b\in \mathbb {R}^m\) was then computed by \(b=Ax^0+\epsilon \), where \(\epsilon \sim {\mathcal {N}}(0,10^{-3}I)\). We chose the model parameters \(\alpha =0.1\) and \(\mu =10^{-2}\), which we found to yield reasonable accuracy for recovering the sparse solution. We initialized all the variables at zero and set the algorithm parameters \(\beta =100\) and \(\gamma =1\). We ran the algorithm for 200 iterations and recorded the errors at each iteration with respect to a precomputed reference solution \(u^*\).
Here, the strong convexity constant of f is \(\nu _f=2\alpha +{\uplambda }_{\min }(A^TA)/\mu =2\alpha \) and the Lipschitz constant of \(\nabla f\) is \(L_f=2\alpha +{\uplambda }_{\max }(A^TA)/\mu =2\alpha +1/\mu \). By (3.15), our bound for the global linear rate is \((1+\delta )^{-1}=0.996\), which roughly matches the early-stage rate shown in the figure. However, our theoretical bound is rather conservative, since it is a global worst-case bound and it does not take into account the properties of the \(\ell _1\) norm and the solution. In fact, the optimal solution \(x^*\) is very sparse and \(x^k\) will also become sparse after a number of iterations. Let \({\mathcal {S}}\) be an index set of the nonzero support of \((x^k-x^*)\), and \(A_{{\mathcal {S}}}\) be a submatrix composed of those columns of A indexed by \({\mathcal {S}}\). Then, the constants \(\nu _f\) and \(L_f\) in our bound can be effectively replaced by \(\bar{\nu }_f=2\alpha +{\uplambda }_{\min }(A_{{\mathcal {S}}}^TA_{{\mathcal {S}}})/\mu \) and \(\bar{L}_f=2\alpha +{\uplambda }_{\max }(A_{{\mathcal {S}}}^TA_{{\mathcal {S}}})/\mu \), thereby accounting for the faster convergence rate in the later stage. For example, letting \({\mathcal {S}}\) be the nonzero support of the optimal solution \(x^*\), we obtain an estimate of the (asymptotic) linear rate \((1+\delta )^{-1}=0.817\), which well matches the later-stage rate.
5.2 Distributed Lasso
We apply Algorithm 2 with \(P=0\) and \(Q=0\) to a small distributed Lasso problem (5.1) with \(N=5\), where each \(A_i\) has \(m=600\) examples and \(n=500\) features. Each \(A_i\) is a tall matrix and has full column rank, yielding a strongly convex objective function in \(x_i\). Therefore, Algorithm 2 is guaranteed to converge linearly.
We generated the data similarly as in the elastic net test. We randomly generated each \(A_i\) from the standard Gaussian distribution \({\mathcal {N}}(0,1)\), and then simply scaled its columns to have a unit length. We generated a sparse vector \(x^0\in \mathbb {R}^n\) with 250 nonzero entries, each sampled from the \({\mathcal {N}}(0,1)\) distribution. Each \(b_i\in \mathbb {R}^m\) was then computed by \(b_i=A_ix^0+\epsilon _i\), where \(\epsilon _i\sim {\mathcal {N}}(0,10^{-3}I)\). We chose the model parameter \(\mu =0.1\), which we found to yield reasonably good recovery quality. From the initial point at zero, we ran the algorithm with parameters \(\beta =10\) and \(\gamma =1\) for 50 iterations and computed the iterative errors.
6 Conclusions
In this paper, we provide sufficient conditions for the global linear convergence of a general class of ADMs which solve subproblems either exactly or approximately in a certain manner. Among the conditions is a function that is strongly convex and has Lipschitz continuous gradient. These sufficient conditions cover a wide range of applications. We also extend the existing convergence theory to allow more generality on the step size \(\gamma \) for updating the multipliers.
In practice, how to choose the penalty parameter \(\beta \) is always an important issue. Our convergence rate analysis provides more insights on how the penalty parameter \(\beta \) affects the convergence speed, thereby providing some theoretical guidance for choosing \(\beta \).
Footnotes
- 1.
The results continue to hold in many cases when strong convexity is relaxed to strict convexity (e.g., \(-\log (x)\) is strictly convex but not strongly convex over \(x>0\). ADM always generates a bounded sequence \(Ax^k, By^k, {\uplambda }^k\), where the bound only depends on the starting point and the solution, even when the feasible set is unbounded. When restricted to a compact set, a strictly convex function is strongly convex.
- 2.Suppose a sequence \(\{u^k\}\) converges to \(u^*\). We say the convergence is (in some norm \(\Vert \cdot \Vert \))
Q-linear, if there exists \(\mu \in (0,1)\) such that \(\frac{\Vert u^{k+1}-u^*\Vert }{\Vert u^{k}-u^*\Vert }\le \mu \);
R-linear, if there exists a sequence \(\{\sigma ^k\}\) such that \(\Vert u^{k}-u^*\Vert \le \sigma ^k\) and \(\sigma ^k\rightarrow 0\) Q-linearly.
Notes
Acknowledgments
The authors’ work is supported in part by ARL MURI Grant W911NF-09-1-0383 and NSF Grant DMS-1317602.
References
- 1.Boley, D.: Linear convergence of ADMM on a model problem. TR 12-009, Department of Computer Science and Engineering, University of Minnesota (2012)Google Scholar
- 2.Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2010)CrossRefMATHGoogle Scholar
- 3.Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Trans. Pattern Anal. Mach. Intell. 26(9), 1124–1137 (2004)CrossRefGoogle Scholar
- 4.Cai, J., Osher, S., Shen, Z.: Split Bregman methods and frame based image restoration. Multiscale Model. Simul. 8(2), 337 (2009)CrossRefMathSciNetGoogle Scholar
- 5.Chen, G., Teboulle, M.: A proximal-based decomposition method for convex minimization problems. Math. Program. 64(1), 81–101 (1994)CrossRefMathSciNetMATHGoogle Scholar
- 6.Davis, D., Yin, W.: Convergence rate analysis of several splitting schemes. arXiv preprint arXiv:1406.4834 (2014)
- 7.Davis, D., Yin, W.: Faster convergence rates of relaxed Peaceman–Rachford and ADMM under regularity assumptions. arXiv preprint arXiv:1407.5210 (2014)
- 8.Deng, W., Yin, W., Zhang, Y.: Group sparse optimization by alternating direction method. In: SPIE Optical Engineering+Applications, pp. 88580R–88580R (2013)Google Scholar
- 9.Douglas, J., Rachford, H.: On the numerical solution of heat conduction problems in two and three space variables. Trans. Am. Math. Soc. 82(2), 421–439 (1956)CrossRefMathSciNetMATHGoogle Scholar
- 10.Eckstein, J., Bertsekas, D.: An alternating direction method for linear programming. Division of Research, Harvard Business School, Laboratory for Information Technology, M.I., Systems (1990)Google Scholar
- 11.Eckstein, J., Bertsekas, D.P.: On the Douglas–Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program. 55(1–3), 293–318 (1992)CrossRefMathSciNetMATHGoogle Scholar
- 12.Esser, E.: Applications of Lagrangian-based alternating direction methods and connections to split Bregman. CAM report 09–31, UCLA (2009)Google Scholar
- 13.Gabay, D.: Chapter ix applications of the method of multipliers to variational inequalities. Stud. Math. Appl. 15, 299–331 (1983)Google Scholar
- 14.Gabay, D., Mercier, B.: A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Comput. Math. Appl. 2(1), 17–40 (1976)CrossRefMATHGoogle Scholar
- 15.Glowinski, R.: Numerical Methods for Nonlinear Variational Problems, Springer Series in Computational Physics. Springer, Berlin (1984)CrossRefGoogle Scholar
- 16.Glowinski, R., Marrocco, A.: Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité, d’une classe de problèmes de Dirichlet non linéaires. Laboria (1975)Google Scholar
- 17.Goldfarb, D., Ma, S.: Fast multiple splitting algorithms for convex optimization. SIAM J. Optim. 22(2), 533–556 (2012)CrossRefMathSciNetMATHGoogle Scholar
- 18.Goldfarb, D., Ma, S., Scheinberg, K.: Fast alternating linearization methods for minimizing the sum of two convex functions. Math. Program. 141(1–2), 349–382 (2013)Google Scholar
- 19.Goldfarb, D., Yin, W.: Parametric maximum flow algorithms for fast total variation minimization. SIAM J. Sci. Comput. 31(5), 3712–3743 (2009)CrossRefMathSciNetMATHGoogle Scholar
- 20.Goldstein, T., Bresson, X., Osher, S.: Geometric applications of the split Bregman method: segmentation and surface reconstruction. J. Sci. Comput. 45(1), 272–293 (2010)CrossRefMathSciNetMATHGoogle Scholar
- 21.Goldstein, T., O’Donoghue, B., Setzer, S., Baraniuk, R.: Fast alternating direction optimization methods. SIAM J. Imaging Sci. 7(3), 1588–1623 (2014)Google Scholar
- 22.Goldstein, T., Osher, S.: The split Bregman method for L1 regularized problems. SIAM J. Imaging Sci. 2(2), 323–343 (2009)CrossRefMathSciNetMATHGoogle Scholar
- 23.He, B., Liao, L., Han, D., Yang, H.: A new inexact alternating directions method for monotone variational inequalities. Math. Program. 92(1), 103–118 (2002)CrossRefMathSciNetMATHGoogle Scholar
- 24.He, B., Yuan, X.: On non-ergodic convergence rate of Douglas–Rachford alternating direction method of multipliers. Numerische Mathematik 130(3), 567–577 (2014)Google Scholar
- 25.He, B., Yuan, X.: On the \(O(1/n)\) convergence rate of the Douglas–Rachford alternating direction method. SIAM J. Numer. Anal. 50(2), 700–709 (2012)CrossRefMathSciNetMATHGoogle Scholar
- 26.Hong, M., Luo, Z.: On the linear convergence of the alternating direction method of multipliers. Arxiv preprint arXiv:1208.3922v3 (2013)
- 27.Jiang, H., Deng, W., Shen, Z.: Surveillance video processing using compressive sensing. Inverse Probl. Imaging 6(2), 201–214 (2012)Google Scholar
- 28.Liang, J., Fadili, J., Peyre, G., Luke, R.: Activity identification and local linear convergence of Douglas–Rachford/ADMM under partial smoothness. arXiv:1412.6858v5 (2015)
- 29.Lions, P.L., Mercier, B.: Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer. Anal. 16(6), 964–979 (1979)CrossRefMathSciNetMATHGoogle Scholar
- 30.Mateos, G., Bazerque, J., Giannakis, G.: Distributed sparse linear regression. IEEE Trans. Signal Process. 58(10), 5262–5276 (2010)CrossRefMathSciNetGoogle Scholar
- 31.Mendel, J., Burrus, C.: Maximum-Likelihood Deconvolution: A Journey into Model-Based Signal Processing. Springer, New York (1990)CrossRefGoogle Scholar
- 32.Rockafellar, R.: Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14(5), 877–898 (1976)CrossRefMathSciNetMATHGoogle Scholar
- 33.Rockafellar, R.: Convex Analysis, vol. 28. Princeton University Press, Princeton (1997)MATHGoogle Scholar
- 34.Wang, Y., Yang, J., Yin, W., Zhang, Y.: A new alternating minimization algorithm for total variation image reconstruction. SIAM J. Imaging Sci. 1(3), 248–272 (2008)CrossRefMathSciNetMATHGoogle Scholar
- 35.Yan, M., Yin, W.: Self equivalence of the alternating direction method of multipliers. arXiv:1407.7400 (2014)
- 36.Yang, J., Zhang, Y.: Alternating direction algorithms for \(\ell _1\)-problems in compressive sensing. SIAM J. Sci. Comput. 33(1–2), 250–278 (2011)CrossRefMathSciNetMATHGoogle Scholar
- 37.Zhang, X., Burger, M., Osher, S.: A unified primal–dual algorithm framework based on Bregman iteration. J. Sci. Comput. 46(1), 20–46 (2011)CrossRefMathSciNetMATHGoogle Scholar
- 38.Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005)CrossRefMathSciNetMATHGoogle Scholar