Journal of Scientific Computing

, Volume 66, Issue 3, pp 889–916

On the Global and Linear Convergence of the Generalized Alternating Direction Method of Multipliers

Article

Abstract

The formulation
$$\begin{aligned} \min _{x,y} ~f(x)+g(y),\quad \text{ subject } \text{ to } Ax+By=b, \end{aligned}$$
where f and g are extended-value convex functions, arises in many application areas such as signal processing, imaging and image processing, statistics, and machine learning either naturally or after variable splitting. In many common problems, one of the two objective functions is strictly convex and has Lipschitz continuous gradient. On this kind of problem, a very effective approach is the alternating direction method of multipliers (ADM or ADMM), which solves a sequence of f/g-decoupled subproblems. However, its effectiveness has not been matched by a provably fast rate of convergence; only sublinear rates such as O(1 / k) and \(O(1/k^2)\) were recently established in the literature, though the O(1 / k) rates do not require strong convexity. This paper shows that global linear convergence can be guaranteed under the assumptions of strong convexity and Lipschitz gradient on one of the two functions, along with certain rank assumptions on A and B. The result applies to various generalizations of ADM that allow the subproblems to be solved faster and less exactly in certain manners. The derived rate of convergence also provides some theoretical guidance for optimizing the ADM parameters. In addition, this paper makes meaningful extensions to the existing global convergence theory of ADM generalizations.

Keywords

Alternating direction method of multipliers Global convergence  Linear convergence Strong convexity Distributed computing 

1 Introduction

The alternating direction method of multipliers (ADM or ADMM) is very effective at solving many practical optimization problems and has wide applications in areas such as signal and image processing, machine learning, statistics, compressive sensing, and operations research. We refer to [2, 4, 8, 12, 20, 22, 27, 31, 34, 36] for a few examples of applications. The ADM is applied to constrained convex optimization problems with separable objective functions in the following form
$$\begin{aligned} \begin{aligned} \min _{x,y} ~&\quad f(x)+g(y)\\ {\text {s.t.}}~&\quad Ax+By=b, \end{aligned} \end{aligned}$$
(1.1)
where \(x\in \mathbb {R}^n\) and \(y\in \mathbb {R}^m\) are unknown variables, \(A\in \mathbb {R}^{p\times n}\) and \(B\in \mathbb {R}^{p\times m}\) are given matrices, and \(f:\mathbb {R}^n\rightarrow \mathbb {R}\cup \{+\infty \}\) and \(g:\mathbb {R}^m\rightarrow \mathbb {R}\cup \{+\infty \}\) are closed proper convex functions. Some original problems are not in the form of (1.1), but after introducing variables and constraints, they become the form of (1.1). For example, introducing \(y=Ax\), the problem: \(\min _x f(x) + g(Ax)\) is transformed to (1.1) with \(B=-I\) and \(b=\mathbf {0}\).

The constraints \(x\in {\mathcal {X}}\) and \(y\in {\mathcal {Y}}\), where \({\mathcal {X}}\subseteq \mathbb {R}^n\) and \({\mathcal {Y}}\subseteq \mathbb {R}^m\) are closed convex sets, can be included as the (extended-value) indicator functions \(I_{{\mathcal {X}}}(x)\) and \(I_{{\mathcal {Y}}}(y)\) in the objective functions f and g. Here the indicator function of a convex set \({\mathcal {C}}\) returns 0 if the input lies in \({\mathcal {C}}\) and \(\infty \) otherwise.

The main goal of this paper is to show that the ADM applied to (1.1) has global linear convergence under a variety of different conditions, in particular, when one of the two objective functions is strongly convex,1 along with other regularity and rank conditions. The convergence analysis is performed under a general framework that allows the ADM subproblems to be solved inexactly and faster.

The classic ADM was first introduced in [14, 16]. Consider the augmented Lagrangian of (1.1):
$$\begin{aligned} {\mathcal {L}}_{\mathcal {A}}(x,y,{\uplambda })=f(x)+g(y)-{\uplambda }^T(Ax+By-b)+\frac{\beta }{2}\Vert Ax+By-b\Vert _2^2, \end{aligned}$$
(1.2)
where \({\uplambda }\in \mathbb {R}^p\) is the Lagrangian multiplier and \(\beta >0\) is a penalty parameter. The classic augmented Lagrangian method (ALM) minimizes \({\mathcal {L}}_{\mathcal {A}}(x,y,{\uplambda })\) over x and y jointly and then updates \({\uplambda }\). However, the ADM replaces the joint minimization by minimization over x and y, one after another, as described in Algorithm 1. Compared to the ALM, though the ADM may take more iterations, it often runs faster due to the easier subproblems.
Variants of Algorithm 1 that allow the variants of \(\mathcal {L}_{\mathcal {A}}\) to be quickly minimized over x or y are very important to the applications in which it is expensive to exactly solve either the x-subproblem or the y-subproblem, or both of them, in Algorithm 1. For this reason, we present Algorithm 2 below, which is more general than Algorithm 1. Our convergence results are established for Algorithm 2.

Compared to Algorithm 1, Algorithm 2 adds \(\frac{1}{2}\Vert y-y^k\Vert _{Q}^2\) and \(\frac{1}{2}\Vert x-x^k\Vert _{P}^2\) to the y- and x-subproblems, respectively, and assigns \(\gamma \) as the step size for the update of \({\uplambda }\). We abuse the notion \(\Vert x\Vert _M^2:=x^T M x\) as we allow any symmetric, possibly indefinite, matrix M. Different choices of P and Q are reviewed in the next subsection. They can make steps 4 and 5 of Algorithm 2 much easier.

We do not fix \(\gamma =1\) like in most of the ADM literature since \(\gamma \) plays a key role. For example, when \(P=\mathbf {0}\) and \(Q=\mathbf {0}\), any \(\gamma \in (0,(\sqrt{5}+1)/2)\) guarantees the convergence of Algorithm 2 [15], but \(\gamma =1.618\) tends to make the algorithm faster than \(\gamma =1\). The range of \(\gamma \) depends on P and Q, as well as \(\beta \). When P is indefinite (and give much simpler subproblems), \(\gamma \) must be smaller than 1 or the iteration may diverge.

Let us overview two works related to Algorithm 2. Work [23] considers (1.2) where the quadratic penalty term is generalized to \(\Vert Ax+By-b\Vert _{H_k}^2\) for a sequence of bounded positive definite matrices \(\{H_k\}\), and the work proves the convergence of Algorithm 2 restricted to \(\gamma =1\) and differential functions f and g. Work [37] replaces \(\gamma \) by a general positive definite matrix C and establishes convergence assuming that \(A=I\) and the smallest eigenvalue of C is no greater than 1, which corresponds to \(\gamma \le 1\) when \(C=\gamma I\). In these works, no rate of convergence is given.

1.1 Generalized ADM with Simplified Subproblems

By “simplified subproblems”, we mean that the ADM subproblems in Algorithm 1 are replaced by subproblems that are easier to solve or have closed-form solutions. The modified ADM shall still converge to the exact solution.

Let us give a few examples of matrix P in step 5 of Algorithm 2. These examples also apply to Q in step 4, which can be different from P.

Prox-linear ADM This approach was introduced in [5]. Setting
$$\begin{aligned} P=\frac{\beta }{\tau }I-\beta A^TA \end{aligned}$$
(1.3)
gives rise to a prox-linear problem at step 5 of Algorithm 2:
$$\begin{aligned} \min _{x}~f(x)+\beta \left( (p^k)^T(x-x^k)+\frac{1}{2\tau }\Vert x-x^k\Vert _2^2\right) , \end{aligned}$$
(1.4)
where \(\tau >0\) is a proximal parameter and \(p^k:=A^T(Ax^k+By^{k+1}-b-{\uplambda }^k/\beta )\) is the gradient of the last two terms of \({\mathcal {L}}_A(x,y^{k+1},{\uplambda }^k)\) (1.2) at \(x=x^k\). In addition, by letting
$$\begin{aligned} Q=\frac{\beta }{\tau '}I-\beta B^TB, \end{aligned}$$
(1.5)
either alone or together with (1.4), step 4 of Algorithm 2 can also implemented as a prox-linear subproblem:
$$\begin{aligned} \min _{y}~g(y)+\beta \left( (q^k)^T(y-y^k)+\frac{1}{2\tau '}\Vert y-y^k\Vert _2^2\right) , \end{aligned}$$
(1.6)
where \(\tau '>0\) and \(q^k:=B^T(Ax^k+By^{k}-b-{\uplambda }^k/\beta )\). Prox-linear subproblems are easier to compute in various applications. For example, if f is a separable function, problem (1.4) reduces to a set of independent one-dimensional problems. In particular, if f is \(\ell _1\) norm, the solution is given in the closed form by so-called soft-thresholding. If f is the matrix nuclear norm, then singular-value soft-thresholding is used. If \(f(x)=\Vert \Phi x\Vert _1\) where \(\Phi \) is an orthogonal operator or a tight frame, (1.4) also has a closed-form solution. If f is total variation, (1.4) can be solved by graph-cut [3, 19]. There are a large number of such examples in signal processing, imaging, statistics, machine learning, etc.
Gradient-descent ADM When function f is quadratic, letting
$$\begin{aligned} P=\frac{1}{\alpha }I-H_f-\beta A^TA, \quad \text {}~H_f:=\nabla ^2 f(x)\succeq 0, \end{aligned}$$
(1.7)
gives rise to a gradient descent step for step 5 of Algorithm 2:
$$\begin{aligned} \min _{x} ~(g^k)^T(x-x^k)+\frac{1}{2\alpha }\Vert x-x^k\Vert _2^2, \end{aligned}$$
(1.8)
where \( g^k:= \nabla f(x^k)+\beta A^T(Ax^k+By^{k+1}-b-{\uplambda }^k/\beta ) \) is the gradient of \(\mathcal {L}_{\mathcal {A}}(x,y^{k+1},{\uplambda }^k)\) at \(x=x^k\). The solution of such subproblem is obviously given by \(x^{k+1}= x^k-\alpha g^k,\) where \(\alpha >0\) is the gradient-descent step size. Like before, step 4 can also be realized as gradient descent by using a similar Q. When the subproblems of Algorithm 1 must solve a large, nontrivial linear system, taking just one gradient step has a clear speed advantage.
ADM with fast approximation of A and/or B The term \(\frac{\beta }{2}\Vert Ax+By-b\Vert ^2_2\) in \({\mathcal {L}}_A(x,y,{\uplambda })\) contains the second-order term \(\frac{\beta }{2}x^TA^TAx\). Replacing \(A^T A\) by a certain symmetric matrix \(D\approx A^T A\) can make step 5 easier to compute or even have a closed-form solution. To this end, one can let
$$\begin{aligned} P=\beta (D-A^T A). \end{aligned}$$
(1.9)
The choice of P effectively turns \(\frac{\beta }{2}x^TA^TAx\) into \(\frac{\beta }{2}x^TDx\) since we have
$$\begin{aligned}&\frac{\beta }{2}\Vert Ax+By-b\Vert _2^2+ \frac{1}{2}\Vert x-x^k\Vert ^2_{P}\\&\quad = \frac{\beta }{2}x^T Dx + [\text {terms linear in}~ x] +[\text {terms independent of { x}}].\end{aligned}$$
This approach is useful when \(A^T A\) is nearly diagonal (set D as the diagonal matrix), or is nearly an orthogonal matrix (set D as the orthogonal matrix), as well as when an off-the-grid operator A can be approximated by its on-the-grid counterpart that has very fast implementations (e.g., the discrete Fourier transforms and FFT).

Let us describe an image processing application that will benefit from properly applying (1.9). Consider the total variation regularization problem: \(\min _u \Vert {\nabla }u \Vert _1 + \frac{1}{2}\Vert Tu - b\Vert ^2,\) where u is an image and T is a sensing operator that is either a blurring operator or a downsampled Fourier operator. The latter operator finds its applications in MRI. The problem can be reformulated to the form of (1.1) as \(\min _{u,v} \Vert v\Vert _1 + \frac{1}{2}\Vert Tu-b\Vert ^2,\quad \text{ subject } \text{ to }~ v-\nabla u = 0.\) The v-subproblem is closed-form soft-thresholding. The u-subproblem has the form: \(\min _u~\frac{1}{2}\Vert Tu - b\Vert ^2 + \frac{1}{2}\Vert {\nabla }u - \text{ constant }\Vert ^2\), which is quadratic. If the operator \(\nabla \) satisfies the periodic boundary condition, then the normal equation to this quadratic subproblem can be diagonalized by the Fourier transform and have a closed-form solution; see [34]. However, since few images has a periodic boundary, \(\nabla \) generally does not satisfy the periodic boundary condition. A resolution is to introduce the operator \(\nabla _{\text {periodic}}\) that satisfies the condition and the indefinite matrix \(P=\beta (\nabla _{\text {periodic}}^T \nabla _{\text {periodic}}-\nabla ^T \nabla )\). Then, the u-subproblem will have a closed-form solution. Note that in this approach, the ADM still converges to the exact solution.

Goals of P and Q The general goal is to wisely choose P and Q so that the subproblems of Algorithm 2 becomes much easier to carry out and the entire algorithm runs in less time. In the ADM, the two subproblems can be solved in either order (but fixed throughout the iterations; see [35] for a counter example). However, when one subproblem is solved less exactly than the other, Algorithm 2 tends to run faster if the less exact subproblem is solved later—assigned as step 5 of Algorithm 2—because at each iteration, the ADM updates the variables in the Gauss-Seidel fashion. If the less exact subproblem is solved first, its relatively inaccurate solution will then affect the more exact subproblem, making its solution also inaccurate. Since the less exact subproblem should be assigned as the later step 5, more choices of P are needed than Q, which is the case in this paper.

1.2 Summary of Results

Table 1 summarizes the four scenarios under which we study the linear convergence of Algorithm 2, and Table 2 specifies the linear convergent quantities for different types of matrices \(\hat{P}\), P, and Q, where
$$\begin{aligned} \hat{P}:=P+\beta A^TA \end{aligned}$$
is defined for the convenience of convergence analysis. \(P=0\) and \(Q=0\) correspond to exactly solving the x- and y-subproblems, respectively. Although \(P=0\) and \(\hat{P}\succ 0\) are different cases in Table 2, they may happen at the same time if A has full column rank; if so, apply the result under \(\hat{P}\succ 0\), which is stronger.

The conclusions in Table 2 are the quantities that converge either Q-linearly or R-linearly.2 Q-linear convergent quantities are the entireties of multiple variables whereas R-linear convergent quantities are the individual variables \(x^k\), \(y^k\), and \({\uplambda }^k\).

Four scenarios of global linear convergence In scenario 1, only function f needs to be strongly convex and having Lipschitz continuous gradient; there is no assumption on g besides convexity. On the other hand, matrix A must have full row rank. Roughly speaking, the full row rank of A makes sure that the error of \({\uplambda }^k\) can be bounded just from the x-side by applying the Lipschitz continuity of \(\nabla f\). One cannot remove this condition or relax it to the full row rank of [AB] without additional assumptions. Consider the example of \(A = [1;0]\) and \(B= [0;1]\), where \([A, B]=I\) has full rank. Since \({\uplambda }^k_2\), which is the 2nd entry of \({\uplambda }^k\), is not affected by f or \(\{x^k\}\) at all, there is no way to take advantages of the Lipschitz continuity of \(\nabla f\) to bound the error of \({\uplambda }^k_2\). In general, without the full row rank of A, a part of \({\uplambda }^k\) needs to be controlled from the y-side using properties of g.
Table 1

Four scenarios leading to linear convergence

Scenario

Strongly convex

Lipschitz continuous

Full row rank

Additional assumptions

1

f

\(\nabla f\)

A

If\(Q\succ 0\), B has full column rank

2

fg

\(\nabla f\)

A

3

f

\(\nabla f,\nabla g\)

B has full column rank

4

fg

\(\nabla f,\nabla g\)

Table 2

Summary of linear convergence results

Case

\(P,\hat{P}\)

Q

Any scenario 1–4

Q-linear convergence

R-linear convergence

1

\(P=0\)

\(=0\)

\((Ax^k,{\uplambda }^k)\)

\(x^k\),  (\(y^k\) or \(By^k\))\(^*\)\({\uplambda }^k\)

2

\(\hat{P}\succ 0\)

\(= 0\)

\((x^k,{\uplambda }^k)\)

3

\(P= 0\)

\(\succ 0\)

\((Ax^k,y^k,{\uplambda }^k)\)

4

\(\hat{P}\succ 0\)

\(\succ 0\)

\((x^k,y^k,{\uplambda }^k)\)

Column rank of B; otherwise, only \(By^k\) has R-linear convergence

\(^*\) In cases 1 and 2, scenario 1, R-linear convergence of \(y^k\) requires full

Scenario 2 adds the strong convexity assumption on g. As a result, the remark in case 1 regarding the full column rank of B is no longer needed.

Both scenarios 3 and 4 assume that g is differentiable and \(\nabla g\) is Lipschitz continuous. As a result, the error of \({\uplambda }^k\) can be controlled by taking advantages of the Lipschitz continuity of both \(\nabla f\) and \(\nabla g\), and the full row rank assumption on A is no longer needed. On the other hand, scenarios 3 and 4 exclude the problems with non-differentiable g. Compared to scenario 3, scenario 4 adds the strong convexity assumption on g and drops the remark on the full column rank of B.

Under scenario 1 with \(Q\succ 0\) and scenario 3, the remarks in Table 1 are needed essentially because \(y^k\) gets coupled with \(x^k\) and \({\uplambda }^k\) in certain inequalities in our convergence analysis. The full column rank of B helps bound the error of \(y^k\) by those of \(x^k\) and \({\uplambda }^k\).

Four cases When \(P=0\) (corresponds to exactly solving the ADM x-subproblem), we have \(\hat{P}\succeq 0\) and only obtain linear convergence in Ax. However, when \(\hat{P}\succ 0\), linear convergence in x is obtained. When \(Q=0\) (corresponds to exactly solving the ADM y-subproblem), y is not part of the Q-linear convergent joint variable. But, when \(Q\succ 0\), y becomes part of it.

1.3 Existing Rate-of-Convergence Results

Although there is extensive literature on the ADM and its applications, there are very few results on its rate of convergence until the very recent past. The work [17] shows that for a Jacobi version of the ADM applied to smooth functions with Lipschitz continuous gradients, the objective value descends at the rate of O(1 / k) and that of an accelerated version descends at \(O(1/k^2)\). Then, work [18] establishes the same rates on a Gauss-Seidel version and requires only one of the two objective functions to be smooth with Lipschitz continuous gradient. These two works only consider the model with the linear constraint coefficient matrices A and B being identity matrix or negative identity matrix. Later, He and Yuan [25] shows that a variational inequality based condition converges at an O(1 / k) rate in an ergodic sense. The work [24] shows that \(\Vert u^k-u^{k+1}\Vert ^2\), where \(u^k:=(x^k,y^k,{\uplambda }^k)\), of the ADM converges at O(1 / k). The work [21] proves that the dual objective value of a modification to the ADM descends at \(O(1/k^2)\) under the assumption that the objective functions are strongly convex (one of them being quadratic) and both subproblems are solved exactly. The recent works [6, 7] obtain the sublinear and linear rates of Douglas-Rachford Splitting Method (DRSM) in a variety of senses including fixed-point residual and objective error and extend their rates to ADM. In this paper, we show the linear rate of convergence \(O(1/c^k)\) for some \(c>1\) under a variety of scenarios in which at least one of the two objective functions is strong convex and has Lipschitz continuous gradient. This rate is stronger than the sublinear rates such as O(1 / k) and \(O(1/k^2)\) and is given in terms of the solution error, which is stronger than those given in terms of the objective error. On the other hand, [6, 17, 18, 24, 25] do not require any strong convexity. The fact that a wide range of applications give rise to model (1.1) with at least one strongly convex function has motivated this work.

There are many regularization problems such as the LASSO model: \(\min \Vert x\Vert _1+\frac{\mu }{2}\Vert Ax-b\Vert ^2\) and the total variation model: \(\min \Vert \nabla x\Vert _1+\frac{\mu }{2}\Vert Ax-b\Vert ^2\), where neither objective function is strongly convex unless the matrix A has full column rank. However, one can combine our results with those in the recent papers [28, 35] to establish eventual linear convergence on the optimal manifold. Specifically, [35] shows that applying ADM is equivalent to applying the DRSM to the same problem. Furthermore, [28] establishes that DRSM identifies the optimal manifold in a finite number of iterations if the objective functions are partially smooth along the optimal manifold near the solution. The partially-smooth condition holds for regularization functions such as \(\ell _1\), \(\ell _2\), and total variation and also holds trivially for smooth functions. Once identifying the optimal manifold, many problems become strongly convex on the optimal manifold. We do not pursue this direction in this paper.

The recent work [26] proves the linear convergence of ADM in a different approach. The linear convergence in [26] requires that the objective function takes a certain form involving a strongly convex function and the step size for updating the multipliers is sufficiently small (which is impractical), while no explicit linear rate is given. Its recent update assumes a bounded sequence in addition. On the other hand, it allows more than two blocks of separable variables and it does not require strict convexity; instead, it requires the objective function to include f(Ex), where f is strongly convex and E is a possibly rank-deficient matrix.

It is worth mentioning that the ADM applied to linear programming is known to converge at a global linear rate [10]. For quadratic programming, work [1] presents an analysis leading to a conjecture that the ADM should converge linearly near the optimal solution. Our analysis in this paper is different from those in [1, 10].

The linear convergence of ADM was also established in the context of DRSM [9] and Proximal Point Method (PPA) [32] under certain conditions. It has been shown that ADM is a special case of DRSM applied to the Lagrange dual [13] and also DRSM applied to the original problem [35]. Further, DRSM is a special case of PPA [11]. Therefore, the linear convergence of ADM can be obtained from the existing linear convergence results of DRSM and PPA [29, 32] under the conditions therein. However, it is unclear whether those results can apply to the generalizations to ADM in Sect. 1.1. In addition, in Sect. 3.3 below, we will review the result in [29] and show that our analysis covers significantly more cases and, in the overlapped case, yields a better linear rate.

1.4 The Penalty Parameter \(\beta \)

It is well known that the penalty parameter \(\beta \) can significantly affect the speed of the ADM. Since the rate of convergence developed in this paper is a function of \(\beta \), the rate can be optimized over \(\beta \). We give some examples in Sect. 3.2 below, which shows the rate of convergence is positively related to the strong convexity constant of f and g, while being negatively related to the Lipschitz constant of \(\nabla f\) and \(\nabla g\) as well as the condition number of A, B and [AB]. More analysis and numerical simulations are left as future research.

1.5 Preliminary, Notation, and Assumptions

We let \(\langle \cdot ,\cdot \rangle \) denote the standard inner product, and let \(\Vert \cdot \Vert \) denote the \(\ell _2\)-norm \(\Vert \cdot \Vert _2\) (the Euclidean norm of a vector or the spectral norm of a matrix). In addition, we use \({\uplambda }_{\min }(M)\) and \({\uplambda }_{\max }(M)\) for the smallest and largest eigenvalues of a symmetric matrix M, respectively.

A function f is strongly convex with constant \(\nu >0\) if for all \(x_1,x_2\in \mathbb {R}^n\) and all \(t\in [0,1]\),
$$\begin{aligned} f(tx_1+(1-t)x_2)\le tf(x_1)+(1-t)f(x_2)-\frac{1}{2}\nu t(1-t)\Vert x_1-x_2\Vert ^2. \end{aligned}$$
(1.10)
For a differentiable function f, the gradient \(\nabla f\) is Lipschitz continuous with constant \(L_f>0\) if
$$\begin{aligned} \Vert \nabla f(x_1)-\nabla f(x_2)\Vert \le L_f\Vert x_1-x_2\Vert , \quad \forall x_1,x_2\in \mathbb {R}^n. \end{aligned}$$
(1.11)
If a convex function f has a Lipschitz continuous gradient with constant \(L_f\), then it satisfies:
$$\begin{aligned} \left\langle x_1-x_2,~\nabla f(x_1)-\nabla f(x_2)\right\rangle \ge \frac{1}{L_f}\Vert \nabla f(x_1)-\nabla f(x_2)\Vert ^2,\quad \forall x_1,x_2\in \mathbb {R}^n. \end{aligned}$$
(1.12)
Throughout the paper, we make the following standard assumptions.

Assumption 1

There exists a saddle point \(u^*:=(x^*,y^*,{\uplambda }^*)\) to problem (1.1), namely, \(x^*\), \(y^*\), and \({\uplambda }^*\) satisfy the KKT conditions:
$$\begin{aligned} A^T{\uplambda }^{*} \in \partial f(x^{*}), \end{aligned}$$
(1.13)
$$\begin{aligned} B^T{\uplambda }^{*}\in \partial g(y^{*}), \end{aligned}$$
(1.14)
$$\begin{aligned} Ax^{*}+By^{*}-b =&0. \end{aligned}$$
(1.15)

When assumption 1 fails to hold, the ADM method has either unsolvable or unbounded subproblems or a diverging sequence of \({\uplambda }^k\).

Assumption 2

Functions f and g are convex.

We define scalars \(\nu _f\) and \(\nu _g\) as the modulus of f and g, respectively. Following from (1.10),
$$\begin{aligned} \langle s_1 -s_2,x_1-x_2\rangle&\ge \nu _f\Vert x_1-x_2\Vert ^2, \quad \forall x_1,~x_2,~s_1\in \partial f(x_1), \quad s_2\in \partial f(x_2), \end{aligned}$$
(1.16)
$$\begin{aligned} \langle t_1 -t_2,y_1-y_2\rangle&\ge \nu _g\,\Vert y_1-y_2\Vert ^2, \quad \forall y_1,~y_2,~\;t_1\in \partial g(y_1),\quad t_2\in \partial g(y_2). \end{aligned}$$
(1.17)
From the convexity of f and g, it follows that \(\nu _f,\nu _g\ge 0\), which are used throughout Sect. 2. They are strictly positive if the functions are strongly convex. To show linear convergence, Sect. 3 uses \(\nu _f>0\) and, for scenarios 3 and 4, \(\nu _g>0\) as well. Indeed, we only use f and g’s properties over the compact sets including \(\{x^k\}\) and \(\{y^k\}\), not globally.

1.6 Organization

The rest of the paper is organized as follows. Section 2 shows the global convergence of the generalized ADM. Then Sect. 3, under the assumptions in Table 1, further proves the global linear convergence. Section 4 discusses several interesting applications that are covered by our linear convergence theory. In Sect. 5, we present some preliminary numerical results to demonstrate the linear convergence behavior of ADM. Finally, Sect. 6 concludes the paper.

2 Global Convergence

In this section, we show the global convergence of Algorithm 2. The proof steps are similar to the existing ADM convergence theory in [23, 37] but are adapted to Algorithm 2. Several inequalities in the section are used in the linear convergence analysis in the next section.

2.1 Convergence Analysis

For notation simplicity, we introduce
$$\begin{aligned} \hat{{\uplambda }}:={\uplambda }^k-\beta (Ax^{k+1}+By^{k+1}-b). \end{aligned}$$
(2.1)
If \(\gamma =1\), then \(\hat{{\uplambda }}={\uplambda }^{k+1}\); otherwise,
$$\begin{aligned} \hat{{\uplambda }}-{\uplambda }^{k+1}=(\gamma -1)\beta (Ax^{k+1}+By^{k+1}-b)=(1-\frac{1}{{\uplambda }})({\uplambda }^{k}-{\uplambda }^{k+1}). \end{aligned}$$
(2.2)
This relation between \(\hat{{\uplambda }}\) and \({\uplambda }^{k+1}\) is used frequently in our analysis. Let
$$\begin{aligned} u^*:=\begin{pmatrix} x^*\\ y^*\\ {\uplambda }^* \end{pmatrix}, \quad u^k:=\begin{pmatrix} x^k\\ y^k\\ {\uplambda }^k \end{pmatrix}, \quad \hat{u}:=\begin{pmatrix} x^{k+1}\\ y^{k+1}\\ \hat{{\uplambda }} \end{pmatrix}, \quad \text{ for } k=0,1,\ldots , \end{aligned}$$
(2.3)
where \(u^*\) is a KKT point, \(u^k\) is the current point, and \(\hat{u}\) is the next point as if\(\gamma =1\), and
$$\begin{aligned} ~G_0:=\begin{pmatrix} I_n &{}\quad &{}\quad \\ \quad &{} I_m &{}\quad \\ \quad &{}\quad &{}\gamma I_p \end{pmatrix}, \quad G_1:=\begin{pmatrix} \hat{P} &{}\quad &{}\quad \\ \quad &{}Q &{}\quad \\ \quad &{}\quad &{}\frac{1}{\beta }I_p \end{pmatrix}, \quad G:=G_0^{-1}G_1=\begin{pmatrix} \hat{P} &{}\quad &{}\quad \\ \quad &{}Q &{}\quad \\ \quad &{}\quad &{}\frac{1}{\beta \gamma }I_p \end{pmatrix}, \end{aligned}$$
(2.4)
where we recall \(\hat{P}=P+\beta A^TA\). From these definitions it follows
$$\begin{aligned} u^{k+1}=u^k-G_0(u^k-\hat{u}). \end{aligned}$$
(2.5)
We choose P, Q and \(\beta \) such that \(\hat{P}\succeq 0\) and \(Q\succeq 0\). Hence \(G\succeq 0\) and \(\Vert \cdot \Vert _G\) is a (semi-)norm. The definitions of the matrix G and the G-norm are similar to those in the work [24]. The analysis is based on bounding the error \(\Vert u^k-u^*\Vert ^2_G\) and estimate its decrease.

Lemma 2.1

Under Assumptions 1 and 2, the sequence \(\{u^k\}\) of Algorithm 2 obeys
  1. i)
    $$\begin{aligned}&A^T\hat{{\uplambda }}+P\left( x^k-x^{k+1}\right) \in \partial f(x^{k+1}),\end{aligned}$$
    (2.6)
    $$\begin{aligned}&B^T\left( \hat{{\uplambda }}-\beta A(x^{k}-x^{k+1})\right) +Q(y^k-y^{k+1})\in \partial g(y^{k+1}). \end{aligned}$$
    (2.7)
     
  2. ii)
    $$\begin{aligned}&\langle x^{k+1}-x^*, ~A^T(\hat{{\uplambda }}-{\uplambda }^*)+P\left( x^k-x^{k+1}\right) \rangle \ge \nu _f\Vert x^{k+1}-x^*\Vert ^2, \end{aligned}$$
    (2.8)
    $$\begin{aligned}&\langle y^{k+1}-y^{*},~B^T\left( \hat{{\uplambda }}-{\uplambda }^*-\beta A(x^{k}-x^{k+1})\right) +Q\left( y^k-y^{k+1}\right) \rangle \ge \nu _g\Vert y^{k+1}-y^*\Vert ^2. \end{aligned}$$
    (2.9)
     
  3. iii)
    $$\begin{aligned} A\left( x^{k+1}-x^*\right) +B\left( y^{k+1}-y^*\right) =\frac{1}{\beta }\left( {\uplambda }^k-\hat{{\uplambda }}\right) . \end{aligned}$$
    (2.10)
     
  4. iv)
    $$\begin{aligned} \Vert u^k-u^*\Vert _G^2-\Vert u^{k+1}-u^*\Vert _G^2\ge h(u^k-\hat{u})+2\nu _f\Vert x^{k+1}-x^{*}\Vert ^2+2\nu _g\Vert y^{k+1}-y^*\Vert ^2, \end{aligned}$$
    (2.11)
    where
    $$\begin{aligned} h(u^k-\hat{u}):= & {} \Vert x^k-x^{k+1}\Vert _{\hat{P}}^2+\Vert y^k-y^{k+1}\Vert _Q^2+\frac{2-\gamma }{\beta }\Vert {\uplambda }^k-\hat{{\uplambda }}\Vert ^2\nonumber \\&+\,2\left( {\uplambda }^k-\hat{{\uplambda }}\right) ^TA\left( x^k-x^{k+1}\right) . \end{aligned}$$
    (2.12)
     

The proof of this lemma is given in “Appendix 1”. In the next theorem, we show that \(\Vert u^k-u^*\Vert _G\) has sufficient descent. Technically, it is done through bounding \(h(u^k-\hat{u})\) from zero by applying the Cauchy–Schwarz inequality to its cross term \(2({\uplambda }^k-\hat{{\uplambda }})^TA(x^k-x^{k+1})\). If \(P=\mathbf {0}\), a more refined bound is obtained to give \(\gamma \) a wider range of convergence. See “Appendix 2” for the details of the proof.

Theorem 2.1

(Sufficient descent of \(\Vert u^k-u^*\Vert _G\)). Assume Assumptions 1 and 2.
  1. i)
    When \(P\not =\mathbf {0}\), if \(\gamma \) obeys
    $$\begin{aligned} (2-\gamma )P\succ (\gamma -1)\beta A^TA \end{aligned}$$
    (2.13)
    (see Remark 2.1 below for simplification), then there exists \(\eta >0\) such that
    $$\begin{aligned} \Vert u^k-u^*\Vert _G^2-\Vert u^{k+1}-u^*\Vert _G^2\ge & {} \eta \Vert u^k-u^{k+1}\Vert _G^2+2\nu _f\Vert x^{k+1}-x^{*}\Vert ^2\nonumber \\&+\,2\nu _g\Vert y^{k+1}-y^*\Vert ^2. \end{aligned}$$
    (2.14)
     
  2. ii)
    When \(P=\mathbf {0}\), if
    $$\begin{aligned} \gamma \in \left( 0,\frac{1+\sqrt{5}}{2}\right) , \end{aligned}$$
    (2.15)
    then there exist \(\eta >0\) such that
    $$\begin{aligned}&\left( \Vert u^k-u^*\Vert _G^2+\frac{\beta }{\rho }\Vert r^k\Vert ^2\right) - \left( \Vert u^{k+1}-u^*\Vert _G^2+\frac{\beta }{\rho }\Vert r^{k+1}\Vert ^2\right) \nonumber \\&\quad {\ge ~\eta \Vert u^k-u^{k+1}\Vert _G^2+2\nu _f\Vert x^k-x^{k+1}\Vert ^2+2\nu _f\Vert x^{k+1}-x^*\Vert ^2+2\nu _g\Vert y^{k+1}-y^*\Vert ^2,} \end{aligned}$$
    (2.16)
    where \(r^k\) is the residual at iteration k:
    $$\begin{aligned} r^k:=Ax^k+By^k - b. \end{aligned}$$
    If we set \(\gamma =1\), then we have
    $$\begin{aligned}&\Vert u^k-u^*\Vert _G^2 - \Vert u^{k+1}-u^*\Vert _G^2\ge \Vert u^k-u^{k+1}\Vert _G^2+2\nu _f\Vert x^k-x^{k+1}\Vert ^2\nonumber \\&\quad +\,2\nu _f\Vert x^{k+1}-x^*\Vert ^2+2\nu _g\Vert y^{k+1}-y^*\Vert ^2. \end{aligned}$$
    (2.17)
     

Now the sufficient descent of \(\Vert u^k-u^*\Vert _G\) in Theorem 2.1 is used to yield the global convergence of Algorithm 2.

Theorem 2.2

(Global convergence of Algorithm 2) Assume Assumptions 1 and 2, and that \(\{u^k\}\) of Algorithm 2 is bounded (see Remark 2.2 below). For any \(\gamma \) satisfying its conditions given in Theorem 2.1, \(\{u^k\}\) converges to a KKT point \(u^*\) of (1.1) in the G-norm, namely,
$$\begin{aligned}\Vert u^k-u^*\Vert _G\rightarrow 0.\end{aligned}$$
It further follows that
  1. (a)

    \({\uplambda }^k\rightarrow {{\uplambda }^*}\), regardless of the choice of P and Q;

     
  2. (b)

    when \(P\not =\mathbf {0}\), \(x^k\rightarrow {x^*}\); otherwise, \(Ax^k\rightarrow A{x^*}\);

     
  3. (c)

    when \(Q\succ \mathbf {0}\), \(y^k\rightarrow {y^*}\); when \(Q=\mathbf {0}\), \(By^k\rightarrow B{y^*}\).

     

Proof

Being bounded, \(\{u^k\}\) has a converging subsequence \(\{u^{k_j}\}\). Let \(\bar{u}=\lim _{j\rightarrow \infty }u^{k_j}\). Next, we will show \(\bar{u}\) is a KKT point. Let \(u^*\) denote an arbitrary KKT point.

Consider \(P\not =\mathbf {0}\) first. From (2.14) we conclude that \(\Vert u^k-u^*\Vert ^2_G\) is monotonically nonincreasing and thus converging, and due to \(\eta >0\), \(\Vert u^k-u^{k+1}\Vert ^2_G\rightarrow 0\). In light of (2.4) where \(\hat{P}\succeq 0\) and \(Q\succeq 0\), we obtain \({\uplambda }^k-{\uplambda }^{k+1}\rightarrow 0\) or equivalently,
$$\begin{aligned} r^k=\left( Ax^{k+1}+By^{k+1}-b\right) \rightarrow 0,\quad \text {as}~k\rightarrow \infty . \end{aligned}$$
(2.18)
Now consider \(P=\mathbf {0}\). From (2.16) we conclude that \(\Vert u^k-u^*\Vert _G^2+\frac{\beta }{\rho }\Vert r^k\Vert ^2\) is monotonically nonincreasing and thus converging. Due to \(\eta >0\), \(\Vert u^k-u^{k+1}\Vert ^2_G\rightarrow 0\), so \({\uplambda }^k-{\uplambda }^{k+1}\rightarrow 0\) and (2.18) holds as well. Consequently, \(\Vert u^k-u^*\Vert _G^2\) also converges.
Therefore, by passing limit on (2.18) over the subsequence, we have for \(P=0\) or not:
$$\begin{aligned} A\bar{x}+B\bar{y}-b=0. \end{aligned}$$
(2.19)
Recall the optimality conditions (2.7) and (2.6):
$$\begin{aligned} B^T\hat{{\uplambda }}-\beta B^TA\left( x^{k}-x^{k+1}\right) +Q\left( y^k-y^{k+1}\right) \in \partial g\left( y^{k+1}\right) ,\\ A^T\hat{{\uplambda }}+P\left( x^k-x^{k+1}\right) \in \partial f\left( x^{k+1}\right) . \end{aligned}$$
Since \(\Vert u^k-u^{k+1}\Vert ^2_G\rightarrow 0\), in light of the definition of G (2.4), we have the following:
  • when \(P=\mathbf {0}\), \(A(x^{k}-x^{k+1})\rightarrow \mathbf {0}\);

  • when \(P\not =\mathbf {0}\), the condition (2.13) guarantees \(\hat{P}\succ \mathbf {0}\) and thus \(x^k-x^{k+1}\rightarrow 0\);

  • since \(Q\succeq \mathbf {0}\), we obtain \(Q(y^k-y^{k+1})\rightarrow 0\).

In summary, \(\beta B^TA(x^{k}-x^{k+1})\), \(Q(y^k-y^{k+1})\), and \(P(x^k-x^{k+1})\) are either 0 or converging to 0 in k, no matter \(P=\mathbf {0}\) or not.
Now on both sides of (2.7) and (2.6) taking limit over the subsequence and applying Theorem 24.4 of [33], we obtain:
$$\begin{aligned}&B^T\bar{{\uplambda }}\in \partial g(\bar{y}), \end{aligned}$$
(2.20)
$$\begin{aligned}&A^T\bar{{\uplambda }}\in \partial f(\bar{x}). \end{aligned}$$
(2.21)
Therefore, together with (2.19), \(\bar{u}\) satisfies the KKT condition of (1.1).

Since \(\bar{u}\) is a KKT point, we can now let \(u^*=\bar{u}\). From \(u^{k_j}\rightarrow \bar{u}\) in j and the convergence of \(\Vert u^k-u^*\Vert ^2_G\) it follows \(\Vert u^k-u^*\Vert ^2_G\rightarrow 0\) in k.

By the definition of G, \(\Vert u^k-{u^*}\Vert ^2_G\rightarrow 0\) implies the following:
  1. (a)

    \({\uplambda }^k\rightarrow {{\uplambda }^*}\), regardless of the choice of P and Q;

     
  2. (b)

    when \(P\not =\mathbf {0}\), condition (2.13) guarantees \(\hat{P}\succ \mathbf {0}\) and thus \(x^k\rightarrow {x^*}\); when \(P=\mathbf {0}\), \(Ax^k\rightarrow A{x^*}\);

     
  3. (c)

    when \(Q\succ \mathbf {0}\), \(y^k\rightarrow {y^*}\); when \(Q=\mathbf {0}\), \(By^k\rightarrow B{y^*}\) following from (2.18) and (2.19).

     

Remark 2.1

Let us discuss the conditions on \(\gamma \). If \(P\succ 0\), the condition (2.13) is always be satisfied for \(0<\gamma \le 1\). However, in this case, \(\gamma \) can go greater than 1, which often leads to faster convergence in practice. If \(P\not \succ 0\), the condition (2.13) requires \(\gamma \) to lie in \((0,\bar{\gamma })\) where \(0<\bar{\gamma }<1\) depends on \(\beta \), P, and \(A^TA\). A larger \(\beta \) would allow a larger \(\bar{\gamma }\).

In particular, in Prox-liner ADM where the x-subproblem is solved by (1.4), condition (2.13) is guaranteed by
$$\begin{aligned} \tau \Vert A\Vert ^2+\gamma <2. \end{aligned}$$
(2.22)
In Gradient-descent ADM where the x-subproblem has the form of (1.8), a sufficient condition for (2.13) is given by
$$\begin{aligned} \frac{\beta \Vert A\Vert ^2}{\frac{1}{\alpha }-\Vert H_f\Vert }+\gamma <2. \end{aligned}$$
(2.23)

Remark 2.2

The assumption on the boundedness of the sequence \(\{u^k\}\) can be guaranteed by various conditions. Since (2.14) and (2.16) imply that \(\Vert u^k-u^*\Vert ^2_G\) is bounded, \(\{u^k\}\) must be bounded if \(\hat{P}\succ 0\) and \(Q\succ 0\). Furthermore, if \(P=\mathbf {0}\) and \(Q=\mathbf {0}\), we have the boundedness of \(\{(Ax^k,{\uplambda }^k)\}\) (since \(\Vert u^k-u^*\Vert ^2_G\) is bounded) and that of \(\{By^k\}\) by (2.10), so in this case, \(\{u^k\}\) is bounded if
  1. (i)

    matrix A has full column rank whenever \(P=\mathbf {0}\); and

     
  2. (ii)

    matrix B has full column rank whenever \(Q=\mathbf {0}\).

     
In addition, the boundedness of \(\{u^k\}\) is guaranteed if the objective functions are coercive.

3 Global Linear Convergence

In this section, we establish the global linear convergence results for Algorithm 2 that are described in Tables 1 and 2. We take three steps. First, using (2.14) for \(P\not =0\) and (2.16) for \(P=0\), as well as the assumptions in Table 1, we show that there exists \(\delta >0\) such that
$$\begin{aligned} \Vert u^k-u^*\Vert _G^2\ge (1+\delta )\Vert u^{k+1}-u^*\Vert _G^2, \end{aligned}$$
(3.1)
where \(u^*=\lim _{k\rightarrow \infty }u^k\) is given by Theorem 2.2. We call (3.1) the Q-linear convergence of \(\{u^k\}\) in G-(semi)norm. Next, using (3.1) and the definition of G, we obtain the Q-linear convergent quantities in Table 2. Finally, the R-linear convergence in Table 2 is established.

3.1 Linear Convergence in G-(Semi)Norm

We first assume \(\gamma =1\), which allows us to simplify the proof presentation. At the end of this subsection, we explain why the results for \(\gamma =1\) can be extended to \(\gamma \not =1\) that satisfies the conditions of Theorem 2.1. Note that for \(\gamma =1\), we have (2.17) instead of (2.16). Hence, no matter \(P=0\) or \(P\not =0\), both inequalities (2.14) and (2.17) have the form
$$\begin{aligned}\Vert u^k-u^*\Vert _G^2 - \Vert u^{k+1}-u^*\Vert _G^2\ge C,\end{aligned}$$
where C stands for their right-hand sides. To show (3.1), it is sufficient to establish
$$\begin{aligned} C\ge \delta \Vert u^{k+1}-u^*\Vert _G^2. \end{aligned}$$
(3.2)
The challenge is that \(\Vert u^{k+1}-u^*\Vert _G^2\) is the sum of \(\Vert x^{k+1}-x^*\Vert _{\hat{P}}^2\), \(\Vert y^{k+1}-y^*\Vert _{Q}^2\), and \(\frac{1}{\beta \gamma }\Vert {\uplambda }^{k+1}-{\uplambda }^*\Vert ^2\), but C does not contain terms like \(\Vert y^{k+1}-y^*\Vert ^2\) and \(\Vert {\uplambda }^{k+1}-{\uplambda }^*\Vert ^2\). Therefore, we shall bound \(\Vert {\uplambda }^{k+1}-{\uplambda }^*\Vert ^2\) and \(\Vert y^{k+1}-y^*\Vert _{Q}^2\) from the existing terms in C or using the strong convexity assumptions. This is done in a series of lemmas below.

Lemma 3.1

(For scenario 1, cases 3 and 4, and scenario 3) Suppose that B has full column rank. For any \(\mu _1>0\), we have
$$\begin{aligned} \Vert y^{k+1}-y^*\Vert ^2\le c_1\Vert x^{k+1}-x^*\Vert ^2+c_2\Vert {\uplambda }^k-{\uplambda }^{k+1}\Vert ^2, \end{aligned}$$
(3.3)
where \(c_1:= (1+\frac{1}{\mu _1})\Vert A\Vert ^2\cdot {\uplambda }^{-1}_{\min }(B^TB)>0\) and \(c_2:=(1+\mu _1)(\beta \gamma )^{-2}\cdot {\uplambda }^{-1}_{\min }(B^TB)>0\).

Proof

By (2.10), we have \(\Vert B(y^{k+1}-y^*)\Vert ^2=\Vert A(x^{k+1}-x^*)-\frac{1}{\beta \gamma }({\uplambda }^k-{\uplambda }^{k+1})\Vert ^2\). Then apply the following inequality (or the Cauchy–Schwarz inequality):
$$\begin{aligned} \Vert u+v\Vert ^2\le \left( 1+\frac{1}{\mu _1}\right) \Vert u\Vert ^2+(1+\mu _1)\Vert v\Vert ^2,\quad \forall \mu _1>0, \end{aligned}$$
(3.4)
to its right-hand side.

Lemma 3.2

(For scenarios 1 and 2) Suppose that \(\nabla f\) is Lipschitz continuous with constant \(L_f\) and A has full row rank. For any \(\mu _2>1\), we have
$$\begin{aligned} \Vert \hat{{\uplambda }}-{\uplambda }^*\Vert ^2\le c_3\Vert x^{k+1}-x^*\Vert ^2+c_4\Vert x^k-x^{k+1}\Vert ^2, \end{aligned}$$
(3.5)
where \(c_3:=L_f^2(1-\frac{1}{\mu _2})^{-1}{\uplambda }^{-1}_{\min }(AA^T)>0\) and \(c_4:=\mu _2\Vert P\Vert ^2{\uplambda }^{-1}_{\min }(AA^T)>0\).

Proof

By the optimality conditions (1.13) and (2.6) together with the Lipschitz continuity of \(\nabla f\), we have
$$\begin{aligned} \Vert A^T\left( \hat{{\uplambda }}-{\uplambda }^*\right) +P\left( x^k-x^{k+1}\right) \Vert ^2\!=\!\Vert \nabla f(x^{k+1})\!-\!\nabla f(x^*)\Vert ^2\!\le \! L_f^2\Vert x^{k+1}-x^*\Vert ^2 .\qquad \end{aligned}$$
(3.6)
Then apply the following basic inequality:
$$\begin{aligned} \Vert u+v\Vert ^2\ge \left( 1-\frac{1}{\mu _2}\right) \Vert u\Vert ^2+(1-\mu _2)\Vert v\Vert ^2,\quad \forall \mu _2>0, \end{aligned}$$
(3.7)
to the left hand side of (3.6). We require \(\mu _2>1\) so that \((1-\frac{1}{\mu _2})>0\).

Lemma 3.3

(For scenarios 3 and 4) Suppose \(\nabla f\) and \(\nabla g\) are Lipschitz continuous, and the initial multiplier \({\uplambda }^0\) is in the range space of [AB] (letting \({\uplambda }^0=0\) suffices). For any \(\mu _3>1\) and \(\mu _4>0\), we have
$$\begin{aligned} \Vert \hat{{\uplambda }}-{\uplambda }^*\Vert ^2\le c_5\Vert x^k-x^{k+1}\Vert ^2 +c_6\Vert y^k-y^{k+1}\Vert _Q^2+c_7\Vert x^{k+1}-x^*\Vert ^2+c_8\Vert y^{k+1}-y^*\Vert ^2, \end{aligned}$$
(3.8)
where \(c_5=\mu _3(1+\frac{1}{\mu _4})\Vert [P^T,-\beta A^TB]\Vert ^2\bar{c}>0\), \(c_6=\mu _3(1+\mu _4)\Vert Q\Vert ^2\bar{c}\ge 0\), \(c_7=(1-\frac{1}{\mu _3})^{-1}L_f^2\bar{c}>0\), \(c_8=(1-\frac{1}{\mu _3})^{-1}L_g^2\bar{c}>0\), and \(\bar{c}>0\) is as follows:
  • If the matrix [AB] has full row rank, \(\bar{c}:={\uplambda }_{\min }^{-1}([A,B][A,B]^T)>0\).

  • Otherwise, \(\text{ rank }([A,B])=r<p\). Without loss of generality, assuming the first r rows of [AB] (denoted by \([A_r,B_r]\)) are linearly independent, we have
    $$\begin{aligned}{}[A,B]= \begin{bmatrix} I\\ L \end{bmatrix}[A_r,B_r], \end{aligned}$$
    where \(I\in \mathbb {R}^{r\times r}\) is the identity matrix and \(L\in \mathbb {R}^{(p-r)\times r}\). Let \(E:=(I+L^TL)[A_r,B_r]\), and \(\bar{c}:={\uplambda }_{\min }^{-1}(EE^T)\Vert I+L^TL\Vert >0\).

Proof

We first show that
$$\begin{aligned} \Vert \hat{{\uplambda }}-{\uplambda }^*\Vert ^2\le \bar{c}\cdot \left\| \begin{bmatrix} A^T\\B^T \end{bmatrix}(\hat{{\uplambda }}-{\uplambda }^*)\right\| ^2, \end{aligned}$$
(3.9)
where \(\bar{c}>0\) is defined above. If [AB] has full row rank, it is trivial. Now, suppose [AB] is rank deficient, i.e., \(\text{ rank }([A,B])=r<p\). Without loss of generality, we assume the first r rows of [AB] (denoted by \([A_r,B_r]\)) are linearly independent, and thus
$$\begin{aligned}{}[A,B]= \begin{bmatrix} I\\L\end{bmatrix}[A_r,B_r]. \end{aligned}$$
By the update formula, if the initial multiplier \({\uplambda }^0\) is in the range space of [AB], then \({\uplambda }^k\), \(k=1,2,\ldots \), always stay in the range space of [AB], so do \(\hat{{\uplambda }}\) and \({\uplambda }^*\). It follows that
$$\begin{aligned} {\uplambda }^k = \begin{bmatrix} I\\L\end{bmatrix}{\uplambda }_r^k,\quad \hat{{\uplambda }} = \begin{bmatrix} I\\L\end{bmatrix}\hat{{\uplambda }}_r,\quad {\uplambda }^* = \begin{bmatrix} I\\L\end{bmatrix}{\uplambda }_r^*. \end{aligned}$$
and thus
$$\begin{aligned} \begin{bmatrix} A^T\\ B^T \end{bmatrix}(\hat{{\uplambda }}-{\uplambda }^*)=\begin{bmatrix} A_r^T\\ B_r^T \end{bmatrix}(I+L^TL)(\hat{{\uplambda }}_r-{\uplambda }_r^*). \end{aligned}$$
Since \(E:=(I+L^TL)[A_r,B_r]\) has full row rank, we have \(\bar{c}:={\uplambda }_{\min }^{-1}(EE^T)\Vert I+L^TL\Vert >0\) and (3.9) follows immediately.
Combining the optimality conditions (1.14), (1.13), (2.6), and (2.7) together with the Lipschitz continuity of \(\nabla f\) and \(\nabla g\), we have
$$\begin{aligned}&\left\| \begin{bmatrix} A^T\\B^T \end{bmatrix}(\hat{{\uplambda }}-{\uplambda }^*)+ \begin{bmatrix} P\\ - \beta B^TA \end{bmatrix}(x^k-x^{k+1})+ \begin{bmatrix} \mathbf {0}\\Q \end{bmatrix}(y^k-y^{k+1})\right\| ^2\nonumber \\&\quad =\Vert \nabla f(x^{k+1})-\nabla f(x^*)\Vert ^2+\Vert \nabla g(y^{k+1})-\nabla g(y^*)\Vert ^2\nonumber \\&\quad \le L^2_f\Vert x^{k+1}-x^*\Vert ^2+L^2_g\Vert y^{k+1}-y^*\Vert ^2. \end{aligned}$$
(3.10)
Similarly, we apply the basic inequalities (3.4) and (3.7) to its left hand side and use (3.9).

With the above lemmas, we now prove the following main theorem of this subsection.

Theorem 3.1

(Q-linear convergence of \(\Vert u^k-u^*\Vert _G\)) Under the same assumptions of Theorem 2.2 and \(\gamma =1\), for all scenarios in Table 1, there exists \(\delta >0\) such that (3.1) holds.

Proof

Consider the case of \(P=0\) and the corresponding inequality (2.17). In this case \(\hat{P}=\beta A^T A\succeq 0\). Let C denote the right-hand side of (2.17).

Scenarios 1 and 2 (recall in both scenarios, f is strongly convex, \(\nabla f\) is Lipschitz continuous, and A has full row rank). Note that C contains the terms on the right side of (3.5) with strictly positive coefficients. Hence, applying Lemma 3.2 to C, we can obtain
$$\begin{aligned} C\ge & {} (c_{9}\Vert x^{k+1}-x^*\Vert ^2 +c_{10}\Vert y^{k+1}-y^*\Vert ^2+ c_{11}\Vert {\uplambda }^{k+1}-{\uplambda }^*\Vert ^2)+(c_{12}\Vert y^{k}-y^{k+1}\Vert _Q^2\nonumber \\&+\,c_{13}\Vert {\uplambda }^k-{\uplambda }^{k+1}\Vert ^2) \end{aligned}$$
(3.11)
with \(c_9,c_{11}>0\), \(c_{10}=2\nu _g\ge 0\), \(c_{12}=\eta >0\), and \(c_{13}=\eta /(\beta \gamma )> 0\). We have \(c_{9}>0\) because only a fraction of \(2\nu _f\Vert x^{k+1}-x^*\Vert ^2\) is used with Lemma 3.2; \(c_{9}\Vert x^{k+1}-x^*\Vert ^2\) is unused so it stays. The same principle is applied below to get strictly positive coefficients, and we do not re-state it. For proof brevity, we do not necessarily specify the values of \(c_i\).

For scenario 1 with \(Q=0\), \(\Vert u^{k+1}-u^*\Vert ^2_G=\Vert x^{k+1}-x^*\Vert _{\hat{P}}^2+\frac{1}{\beta \gamma }\Vert {\uplambda }^{k+1}-{\uplambda }^*\Vert ^2\). Since \(\Vert x^{k+1}-x^*\Vert ^2 \ge {\uplambda }_{\max }(\hat{P})^{-1}\Vert x^{k+1}-x^*\Vert _{\hat{P}}^2\), (3.2) follows from (3.11) with \(\delta =\min \{c_{9}{\uplambda }^{-1}_{\max }(\hat{P}),c_{11}\beta \gamma \}>0\).

For scenario 1 with \(Q\succ 0\), \(\Vert u^{k+1}-u^*\Vert ^2_G=\Vert x^{k+1}-x^*\Vert _{\hat{P}}^2+\Vert y^{k+1}-x^*\Vert _{Q}^2+\frac{1}{\beta \gamma }\Vert {\uplambda }^{k+1}-{\uplambda }^*\Vert ^2\). Since \(c_{10}\) is not necessarily strictly positive, we shall apply Lemma 3.1 to (3.11) and obtain
$$\begin{aligned} C\ge (c_{14}\Vert x^{k+1}-x^*\Vert ^2 +c_{15}\Vert y^{k+1}-y^{*}\Vert ^2+c_{11}\Vert {\uplambda }^{k+1}-{\uplambda }^*\Vert ^2)+c_{12}\Vert y^{k}-y^{k+1}\Vert ^2_Q \end{aligned}$$
(3.12)
where \(c_{14},c_{15},c_{11},c_{12}>0\). So, it leads to (3.1) with \(\delta =\min \{c_{14}{\uplambda }^{-1}_{\max }(\hat{P}),c_{15}{\uplambda }^{-1}_{\max }(Q),c_{11}\beta \gamma \}>0\).

Scenario 2 (recall it is scenario 1 plus that g is strongly convex). We have \(c_{10}=2\nu _g>0\) in (3.11), which gives (3.1) with \(\delta =\min \{c_{9}{\uplambda }^{-1}_{\max }(\hat{P}),c_{10}{\uplambda }^{-1}_{\max }(Q),c_{11}\beta \gamma \}>0\). Note that we have used the convention that if \(Q=0\), then \({\uplambda }^{-1}_{\max }(Q)=\infty \).

Scenario 3 (recall f is strongly convex, both \(\nabla f\) and \(\nabla g\) are Lipschitz continuous). We apply Lemma 3.1 to get \(\Vert y^{k+1}-y^*\Vert ^2\) with which we then apply Lemma 3.3 to obtain
$$\begin{aligned} C\ge c_{16}\Vert x^{k+1}-x^*\Vert ^2 +c_{17}\Vert y^{k+1}-y^{*}\Vert ^2+c_{18}\Vert {\uplambda }^{k+1}-{\uplambda }^*\Vert ^2, \end{aligned}$$
(3.13)
where \(c_{16},c_{17},c_{18}>0\) and the terms \(\Vert x^{k}-x^{k+1}\Vert ^2 \), \(\Vert y^{k}-y^{k+1}\Vert ^2 \), and \(\Vert {\uplambda }^{k}-{\uplambda }^{k+1}\Vert ^2\) with nonnegative coefficients have been dropped from the right-hand side of (3.13). From (3.13), we obtain (3.1) with \(\delta =\min \{c_{16}{\uplambda }^{-1}_{\max }(\hat{P}),c_{17}{\uplambda }^{-1}_{\max }(Q),c_{18}\beta \gamma \}>0\).

Scenario 4 (recall it is scenario 3 plus that g is strongly convex). Since \(c_{11}=2\nu _g>0\) in (3.11), we can directly apply Lemma 3.3 to get (3.1) with \(\delta >0\) in a way similar to scenario 3.

Now consider the case of \(P\not =0\) and the corresponding inequality (7.5). Inequalities (7.5) and (2.17) are similar except (2.17) has the extra term \(\Vert x^k-x^{k+1}\Vert ^2\) with a strictly positive coefficient in its right-hand side. This term is needed when Lemma 3.2 is applied. However, the assumptions of the theorem ensure \(\hat{P}\succ 0\) whenever \(P\not =0\). Therefore, in (7.5), the term \(\Vert u^{k}-u^{k+1}\Vert _G^2\), which contains \(\Vert x^k-x^{k+1}\Vert _{\hat{P}}^2\), can spare out a term \(c_{19}\Vert x^k-x^{k+1}\Vert ^2\) with \(c_{19}>0\). Therefore, following the same arguments for the case of \(P=0\), we get (3.1) with certain \(\delta >0\).

Now we extend the result in Theorem 3.1 (which is under \(\gamma =1\)) to \(\gamma \ne 1\) in the following theorem.

Theorem 3.2

Under the same assumptions of Theorem 2.2 and \(\gamma \not =1\), for all scenarios in Table 1,
  1. i)

    if \(P\not =0\), there exists \(\delta >0\) such that (3.1) holds;

     
  2. ii)
    if \(P=0\), there exists \(\delta > 0\) such that
    $$\begin{aligned} \Vert u^k-u^*\Vert _G^2+\frac{\beta }{\rho }\Vert r^k\Vert ^2 \ge (1+\delta )\left( \Vert u^{k+1}-u^*\Vert _G^2+\frac{\beta }{\rho }\Vert r^{k+1}\Vert ^2 \right) . \end{aligned}$$
    (3.14)
     

Proof

When \(\gamma \not =1\), which causes \({\uplambda }^{k+1}\not =\hat{{\uplambda }}\). We shall bound \(\Vert {\uplambda }^{k+1}-{\uplambda }^*\Vert ^2\) but Lemmas 3.2 and 3.3 only give bounds on \(\Vert \hat{{\uplambda }}-{\uplambda }^*\Vert ^2\). Noticing that \((\hat{{\uplambda }}-{\uplambda }^*)-({\uplambda }^{k+1}-{\uplambda }^*)=\hat{{\uplambda }}-{\uplambda }^{k+1}=(\gamma -1)r^{k+1}\) and C contains a strictly positive term in \(\Vert {\uplambda }^k -{\uplambda }^{k+1}\Vert ^2 = \gamma ^2\Vert r^{k+1}\Vert ^2\), we can bound \(\Vert {\uplambda }^{k+1}-{\uplambda }^*\Vert ^2\) by a positively weighted sum of \(\Vert \hat{{\uplambda }}-{\uplambda }^*\Vert ^2\) and \(\Vert {\uplambda }^k -{\uplambda }^{k+1}\Vert ^2\).

If \(P\not =0\), the rest of the proof follows from that of Theorem 3.1.

If \(P=0\), \(\gamma \not =1\) leads to (2.16), which extends \(\Vert u^i-u^*\Vert _G^2\) in (2.17) to \(\Vert u^i-u^*\Vert _G^2+\frac{\beta }{\rho }\Vert r^i\Vert ^2\), for \(i=k,k+1\). Since C contains \(\Vert {\uplambda }^k -{\uplambda }^{k+1}\Vert ^2 = \gamma ^2\Vert r^{k+1}\Vert ^2\) with a strictly positive coefficient, one obtains (3.14) by using this term and following the proof of Theorem 3.1.

3.2 Explicit Formula of Convergence Rate

To keep the proof of Theorem 3.1 easy to follow, we have avoided giving the explicit formulas of \(c_i\)’s and thus also those of \(\delta \). To give the reader an idea what quantities affect \(\delta \), we now provide an explicit formula of \(\delta \) for the classic ADM (i.e., case 1 with \(\gamma =1\)) under scenario 1.

Corollary 3.1

(Convergence rate of classic ADM under scenario 1) Under Assumptions 1 and 2, for scenario 1 in Table 1, the sequence \(\{u^k\}\) of Algorithm 1 satisfies (3.1) with
$$\begin{aligned} \delta =2\left( \frac{\beta \Vert A\Vert ^2}{\nu _f}+\frac{L_f}{\beta {\uplambda }_{\min }(AA^T)}\right) ^{-1}. \end{aligned}$$
(3.15)
In particular,choosing \(\beta = \sqrt{\frac{L_f\nu _f}{\Vert A\Vert ^2{\uplambda }_{\min }(AA^T)}}\) yields the largest \(\delta \):
$$\begin{aligned} \delta _{\max } = \frac{1}{\kappa _A\sqrt{\kappa _f}}, \end{aligned}$$
(3.16)
where \(\kappa _A:=\sqrt{{\uplambda }_{\max }(AA^T)/{\uplambda }_{\min }(AA^T)}\) is the condition number of matrix A, and \(\kappa _f=L_f/\nu _f\) is the condition number of function f.

Proof

Recall the important inequality (2.17) in Theorem 2.1:
$$\begin{aligned} \Vert u^k-u^*\Vert _G^2 - \Vert u^{k+1}-u^*\Vert _G^2\ge & {} 2\nu _f\Vert x^{k+1}-x^*\Vert ^2+2\nu _g\Vert y^{k+1}-y^*\Vert ^2\nonumber \\&+\,\Vert u^k-u^{k+1}\Vert _G^2+2\nu _f\Vert x^k-x^{k+1}\Vert ^2. \end{aligned}$$
(3.17)
Note that the term \(\nu _f\Vert x^{k+1}-x^*\Vert ^2\) on the right-hand side comes from (2.8):
$$\begin{aligned} \langle x^{k+1}-x^*, ~A^T ({\uplambda }^{k+1}-{\uplambda }^*)\rangle \ge \nu _f\Vert x^{k+1}-x^*\Vert ^2, \end{aligned}$$
(3.18)
due to the strong convexity of f and the optimality conditions:
$$\begin{aligned} A^T{\uplambda }^{k+1}=\nabla f(x^{k+1}),~A^T{\uplambda }^*=\nabla f(x^*). \end{aligned}$$
On the other hand, since \(\nabla f\) is Lipschitz continuous and A has full row rank, using (1.12) yields
$$\begin{aligned} \langle x^{k+1}-x^*, ~A^T ({\uplambda }^{k+1}-{\uplambda }^*)\rangle \ge \frac{1}{L_f}\Vert A^T({\uplambda }^{k+1}-{\uplambda }^*)\Vert ^2 \ge \frac{{\uplambda }_{\min }(AA^T)}{L_f}\Vert {\uplambda }^{k+1}-{\uplambda }^*\Vert ^2. \end{aligned}$$
(3.19)
By combining (3.18) and (3.19), it follows that for any \(t\in [0,1]\),
$$\begin{aligned} \langle x^{k+1}-x^*, ~A^T ({\uplambda }^{k+1}-{\uplambda }^*)\rangle \ge t\cdot \nu _f\Vert x^{k+1}-x^*\Vert ^2+(1-t)\frac{{\uplambda }_{\min }(AA^T)}{L_f}\Vert {\uplambda }^{k+1}-{\uplambda }^*\Vert ^2. \end{aligned}$$
(3.20)
Now, using (3.20) to replace (3.18) in our analysis in Sect. 2, the inequality (3.17) can be further refined as
$$\begin{aligned}&\Vert u^k-u^*\Vert _G^2 - \Vert u^{k+1}-u^*\Vert _G^2\nonumber \\&\quad \ge ~ 2t\cdot \nu _f\Vert x^{k+1}-x^*\Vert ^2+2(1-t)\frac{{\uplambda }_{\min }(AA^T)}{L_f}\Vert {\uplambda }^{k+1}-{\uplambda }^*\Vert ^2\nonumber \\&\quad \quad +\,2\nu _g\Vert y^{k+1}-y^*\Vert ^2 +\Vert u^k-u^{k+1}\Vert _G^2+2\nu _f\Vert x^k-x^{k+1}\Vert ^2,\quad \forall t\in [0,1]. \end{aligned}$$
(3.21)
In particular, letting
$$\begin{aligned} t = \left( 1+\frac{L_f \nu _f}{\beta ^2\Vert A\Vert ^2{\uplambda }_{\min }(AA^T)}\right) ^{-1}, \end{aligned}$$
(3.22)
we have
$$\begin{aligned}&2t\cdot \nu _f\Vert x^{k+1}-x^*\Vert ^2+2(1-t)\frac{{\uplambda }_{\min }(AA^T)}{L_f}\Vert {\uplambda }^{k+1}-{\uplambda }^*\Vert ^2\nonumber \\&\quad \ge \delta \left( \beta \Vert A\Vert ^2\Vert x^{k+1}-x^*\Vert ^2+\frac{1}{\beta }\Vert {\uplambda }^{k+1}-{\uplambda }^*\Vert ^2\right) \ge \delta \Vert u^{k+1}-u^*\Vert _G^2, \end{aligned}$$
(3.23)
where \(\delta >0\) is given by (3.15). Then (3.1) follows from (3.21) and (3.23) immediately.

Not surprisingly, the convergence rate under scenario 1 is negatively affected by the condition numbers of A and f. For other scenarios, the formulas of \(\delta \) can also be similarly obtained by deriving the specific values of \(c_i\)’s in our analysis. However, they appear to be more complicated than the nice formula (3.16) for scenario 1. A close look at these formulas of \(c_i\)’s reveals that the convergence rate is negatively affected by the condition numbers of the constraint matrices A, B and [AB], as well as the condition numbers of the objective functions f and g. Due to page limit, we leave other scenarios/cases and further analysis to future research.

3.3 Comparison with Lions and Mercier’s Linear Rate of DRSM

It is known that applying the classic ADM to problem (1.1) is equivalent to applying the Douglas–Rachford splitting method (DRSM) to the dual of (1.1). (However, it is unclear to which splitting methods the various ADM generalizations correspond to.) In this subsection, we review the classic linear convergence result [29] of DRSM. In comparison, we show that our linear rate for the classic ADM is considerably better than the one in [29].

The dual of (1.1) is given by
$$\begin{aligned} \min _{{\uplambda }}~\left\{ - \min _{x,y}~f(x)+g(y)-{\uplambda }^\top (Ax+By-b) \right\} =\min _{{\uplambda }}~f^*(A^\top {\uplambda })+g^*(B^\top {\uplambda })-b^\top {\uplambda }, \end{aligned}$$
(3.24)
where \(f^*\) and \(g^*\) are the convex conjugate functions of f and g, respectively. Define the maximal monotone operators\({\mathcal {A}}\) and \({\mathcal {B}}\) as follows:
$$\begin{aligned} {\mathcal {A}}({\uplambda }):=\partial [g^*(B^\top {\uplambda })]-b,\quad {\mathcal {B}}({\uplambda }):=\partial [f^*(A^\top {\uplambda })]. \end{aligned}$$
(3.25)
Then (3.24) is equivalent to finding a zero of the sum of two maximal monotone operators:
$$\begin{aligned} 0\in {\mathcal {A}}({\uplambda })+{\mathcal {B}}({\uplambda }). \end{aligned}$$
(3.26)
Applying DRSM to the above problem yields the following algorithm:
$$\begin{aligned} v^{k+1}&= J^{\beta }_{{\mathcal {A}}}(2J^{\beta }_{{\mathcal {B}}}-I)v^k+(I-J^{\beta }_{{\mathcal {B}}})v^k,\end{aligned}$$
(3.27)
$$\begin{aligned} {\uplambda }^{k+1}&= J^{\beta }_{{\mathcal {B}}} v^{k+1}, \end{aligned}$$
(3.28)
where \(J^{\beta }_{{\mathcal {A}}}=(I+\beta {\mathcal {A}})^{-1}\) and \(J^{\beta }_{{\mathcal {B}}}=(I+\beta {\mathcal {B}})^{-1}\) are the resolvent operators. After some calculation, it can be shown that this algorithm is equivalent to the classic ADM (Algorithm 1) [13]. Here, the variable v corresponds to
$$\begin{aligned} v^k =\beta Ax^k +{\uplambda }^k. \end{aligned}$$
(3.29)
The linear convergence of DRSM was established by Lions and Merciers [29]. We summarize their result in the following theorem.

Theorem 3.3

(Lions and Mercier [29]) Assume the operator \({\mathcal {B}}\) is both coercive and Lipschitz. Namely, there exists \(\alpha >0\) and \(M>0\) such that
$$\begin{aligned} \langle {\mathcal {B}}({\uplambda }_1)-{\mathcal {B}}({\uplambda }_2),~{\uplambda }_1-{\uplambda }_2\rangle&\ge \alpha \Vert {\uplambda }_1-{\uplambda }_2\Vert ^2,\end{aligned}$$
(3.30)
$$\begin{aligned} \Vert {\mathcal {B}}({\uplambda }_1)-{\mathcal {B}}({\uplambda }_2)\Vert&\le M \Vert {\uplambda }_1-{\uplambda }_2\Vert . \end{aligned}$$
(3.31)
Then, there exists a constant \(C>0\) such that
$$\begin{aligned} \Vert {\uplambda }^{k}-{\uplambda }^*\Vert ^2 \le C\cdot \theta ^k, ~\Vert v^{k+1}-v^*\Vert ^2 \le \theta \cdot \Vert v^k-v^*\Vert ^2, \end{aligned}$$
(3.32)
where
$$\begin{aligned} \theta = 1-\frac{2\beta \alpha }{(1+\beta M)^2}. \end{aligned}$$
(3.33)
The smallest \(\theta \) is given by
$$\begin{aligned} \theta _{\min } = 1-\frac{\alpha }{2M}, \end{aligned}$$
(3.34)
which corresponds to \(\beta =1/M\).
Under the assumptions of Scenario 1 of Table 1, f is strongly convex with constant \(\nu _f\), gradient \(\nabla f\) is Lipschitz with constant \(L_f\), and matrix A has full row rank. Then the operator \({\mathcal {B}}:=\partial [f^*\circ A^\top ]\) is coercive and Lipschitz with the constants
$$\begin{aligned} \alpha ={\uplambda }_{\min }(AA^\top )/L_f,~M = \Vert A\Vert ^2/\nu _f. \end{aligned}$$
(3.35)
Hence, the linear convergence of ADM follows from Theorem 3.3, and the optimal linear rate is given by
$$\begin{aligned} \theta _{\min } = 1-\frac{{\uplambda }_{\min }(AA^\top )\nu _f}{2\Vert A\Vert ^2 L_f} = 1- \frac{1}{2\kappa _A^2\kappa _f}. \end{aligned}$$
(3.36)
In contrast, our linear rate (3.16) is given by
$$\begin{aligned} \frac{1}{1+\delta _{\max }} = 1-\delta _{\max }+O(\delta _{\max }^2) = 1-\frac{1}{\kappa _A\sqrt{\kappa _f}}+O\left( \frac{1}{\kappa ^2_A {\kappa _f}}\right) , \end{aligned}$$
(3.37)
which is better than (3.36). By careful inspection, it is also clear that our linear rate (3.15) considerably improves the classic rate (3.33) in [29].

3.4 Q-Linear Convergent Quantities

From the definition of G, which depends on P and Q, it is easy to see that the Q-linear convergence of \(u^k=(x^k;y^k;{\uplambda }^k)\) translates to the Q-linear convergence results in Table 2. For example, in case 1 (\(P=0\) and \(Q=0\)), \(\Vert u^{k+1}-u^*\Vert ^2_G=\Vert x^{k+1}-x^*\Vert _{\hat{P}}^2+\frac{1}{\beta \gamma }\Vert {\uplambda }^{k+1}-{\uplambda }^*\Vert ^2\), where \(\hat{P}=P+\beta AA^T=\beta A^TA\). Hence, \((Ax^k,{\uplambda }^k)\) converges Q-linearly. Examining \(\Vert u^{k+1}-u^*\Vert ^2_G\) gives the results for cases 2, 3, 4.

3.5 R-Linear Convergent Quantities

By the definition of R-linear convergence, any part of a Q-linear convergent quantity converges R-linearly. For example, in case 1 (\(P=0\) and \(Q=0\)), the Q-linear convergence of \((Ax^k,{\uplambda }^k)\) in Table 2 gives the R-linear convergence of \(Ax^k\) and \({\uplambda }^k\). Therefore, to establish Table 2, it remains to show the R-linear convergence of \(x^k\) in cases 1 and 3 and that of \(y^k\) in cases 1 and 2. Our approach is to bound their errors by existing R-linear convergent quantities.

Theorem 3.4

(R-linear convergence) The following statements hold.
  1. i)

    In cases 1 and 3, if \({\uplambda }^k\) converges R-linearly, then \(x^k\) converges R-linearly.

     
  2. ii)

    In cases 1 and 2, scenario 1, if \({\uplambda }^k\) and \(x^k\) both converge R-linearly, then \(By^k\) converges R-linearly. In addition, if B has full column rank, then \(y^k\) converges R-linearly.

     
  3. iii)

    In cases 1 and 2, scenarios 2–4, if \({\uplambda }^k\) and \(x^k\) both converge R-linearly, then \(y^k\) converges R-linearly.

     

Proof

We only show the result for \(\gamma = 1\) (thus \(\hat{{\uplambda }}={\uplambda }^{k+1}\)); for \(\gamma \not =1\) (thus \(\hat{{\uplambda }}\not ={\uplambda }^{k+1}\)), the results follow from those for \(\gamma =1\) and the R-linear convergence of \(\Vert \hat{{\uplambda }}-{\uplambda }^{k+1}\Vert ^2\), which itself follows from (2.2) and the R-linear convergence of \({\uplambda }^k\) (thus that of \({\uplambda }^k-{\uplambda }^{k+1}\)).
  1. i)
    By (2.8) and \(\hat{P}=\beta A^TA\), we have \(\nu _f\Vert x^{k+1}-x^{*}\Vert ^2 \le \Vert A\Vert \Vert x^{k+1}-x^{*}\Vert \Vert {\uplambda }^{k+1}-{\uplambda }^{*}\Vert \), which implies
    $$\begin{aligned} \Vert x^{k+1}-x^*\Vert ^2 \le \frac{\Vert A\Vert ^2}{\nu _f^2}\Vert {\uplambda }^{k+1}-{\uplambda }^*\Vert ^2. \end{aligned}$$
    (3.38)
     
  2. ii)

    The result follows from (2.10).

     
  3. iii)
    Scenario 3 assumes the full column rank of B, so the result follows from (2.10). In scenarios 2 and 4, g is strongly convex. Recall (2.9) with \(\hat{{\uplambda }}={\uplambda }^{k+1}\):
    $$\begin{aligned}&\left\langle y^{k+1}-y^{*},~B^T\left( {\uplambda }^{k+1}-{\uplambda }^*-\beta A(x^{k}-x^{k+1})\right) +Q(y^k-y^{k+1})\right\rangle \nonumber \\&\quad \ge \nu _g\Vert y^{k+1}-y^*\Vert ^2. \end{aligned}$$
    (3.39)
    By the Cauchy–Schwarz inequality and \(Q=\mathbf {0}\), we have
    $$\begin{aligned} \nu _g\Vert y^{k+1}-y^*\Vert \le \Vert B\Vert \Vert {\uplambda }^{k+1}-{\uplambda }^*-\beta A(x^{k}-x^{k+1})\Vert . \end{aligned}$$
    (3.40)
    Therefore, the result follows from the R-linear convergence of \(x^k\) and \({\uplambda }^k\).
     

4 Applications

This section describes some well-known optimization models on which Algorithm 2 not only enjoys global linear convergence but also often has easy-to-solve subproblems. In general, at least one of two objective functions need to be strictly convex. This is the case with Tikhonov regularization, which has numerous applications such as ridge regression and support vector machine (SVM) in statistics and machine learning, elastic net regularization (see Sect. 4.2 below), and entropy maximization. In some applications, the conditions for linear convergence hold, not initially, but after the iterates enter an optimal active set. Then we obtain eventual linear convergence rather than global linear convergence.

4.1 Convex Regularization

The following convex regularization model has been widely used in various applications:
$$\begin{aligned} \min _y f(By-b)+g(y) \end{aligned}$$
(4.1)
where f is often a strongly convex function with Lipschitz continuous gradient, and g is a convex function which is very versatile across different applications. In particular g can be nonsmooth (e.g., projection to a convex set, \(\ell _1\)-norm). Here, f and g are often referred to as the loss (or data fidelity) function and the regularization function, respectively. Model (4.1) can be reformulated to
$$\begin{aligned} \min _{x,y} f(x)+g(y),\quad \text {s.t.}~x+By = b \end{aligned}$$
(4.2)
and be solved by Algorithm 2. With many popular choices of f and g and also with proper P and Q, the x- and y-subproblems are easy to solve. If B has full column rank or g is strongly convex, then Algorithm 2 converges at a global linear rate.

4.2 Sparse Optimization

In recent years, the problem of recovering sparse vectors and low-rank matrices has received tremendous attention from researchers and engineers, particularly those in the areas of compressive sensing, machine learning, and statistics.

Elastic net (augmented \(\ell _1\)) model To recover a sparse vector \(y^0\in \mathbb {R}^n\) from linear measurements \(b=By^0\in \mathbb {R}^m\), the elastic net model solves
$$\begin{aligned} \min _y ~\Vert y\Vert _1+\alpha \Vert y\Vert ^2+\frac{1}{2\mu }\Vert Ay-b\Vert ^2, \end{aligned}$$
(4.3)
where \(A\in \mathbb {R}^{m\times n}\), \(\alpha >0\) and \(\mu >0\) are parameters, and the \(\ell _1\) norm \(\Vert y\Vert _1:=\sum _{i=1}^{n}|y_i|\) is known to promote sparsity in the solution. It has been shown that the elastic model can effectively recover sparse vectors and outperform Lasso (\(\alpha =0\)) on reported real-world regression problems [38]. With the constraint \(x=y\), (4.3) can be reformulated as:
$$\begin{aligned} \begin{aligned} \min _{x,y}&\quad ~\Vert y\Vert _1+\alpha \Vert x\Vert ^2+\frac{1}{2\mu }\Vert Ax-b\Vert ^2\\ {\text {s.t.}}&\quad ~x-y=0. \end{aligned} \end{aligned}$$
(4.4)
Augmented nuclear-norm model Similarly, the elastic net model can be extended for recovering low-rank matrices. To recover a low-rank matrix \(Y^0\in \mathbb {R}^{n_1\times n_2}\) from linear measurements \(b={\mathcal {B}}(Y^0)\in \mathbb {R}^m\), the augmented nuclear-norm model solves
$$\begin{aligned} \min _Y ~\Vert Y\Vert _*+\alpha \Vert Y\Vert _F^2+\frac{1}{2\mu }\Vert {\mathcal {A}}(Y)-b\Vert ^2, \end{aligned}$$
(4.5)
where \(\alpha >0\) and \(\mu >0\) are parameters, \({\mathcal {A}}:\mathbb {R}^{n_1\times n_2}\rightarrow \mathbb {R}^m\) is a linear operator, \(\Vert \cdot \Vert _F\) denotes the Frobenius norm, and the nuclear norm \(\Vert Y\Vert _*\) denotes the sum of singular values of Y which is known to promote low-rankness in the solution. By variable splitting \(X=Y\), (4.5) can be reformulated as:
$$\begin{aligned} \begin{aligned} \min _{X,Y}&\quad ~\Vert Y\Vert _*+\alpha \Vert X\Vert _F^2+\frac{1}{2\mu }\Vert {\mathcal {A}}(X)-b\Vert ^2\\ {\text {s.t.}}&\quad ~X-Y=0. \end{aligned} \end{aligned}$$
(4.6)
In (4.4) and (4.6), the functions \(f(x)=\alpha \Vert x\Vert ^2+\frac{1}{2\mu }\Vert Ax-b\Vert ^2\) and \(f(X)=\alpha \Vert X\Vert _F^2+\frac{1}{2\mu }\Vert {\mathcal {A}}(X)-b\Vert ^2\) are strongly convex and have Lipschitz continuous gradient; the functions \(g(y):=\Vert y\Vert _1\) and \(g(Y):=\Vert Y\Vert _*\) are convex and nonsmooth. In fact, \(\Vert \cdot \Vert ^2\) and \(\Vert \cdot \Vert _F^2\) can also be replaced by many other choices of functions that are strongly convex and have Lipschitz continuous gradient, or become so when restricted to a bounded set. Note that if \(\alpha =0\) then the functions f may not be strongly convex if the matrix A and the linear operator \({\mathcal {A}}\) do not have full column rank. In many applications, this is indeed the case since the number of observations of y and Y is usually smaller than their dimensions (i.e., \(m<n\) and \(m<n_1\cdot n_2\)). However, the parameter \(\alpha >0\) guarantees the strong convexity of f, and hence the global linear convergence of Algorithm 2 when applied to (4.4) and (4.6).

4.3 Consensus and Sharing Optimization

Consider in a network of N nodes, the problem of minimizing the sum of N functions, one from each node, over a common variable x. This problem can be written as
$$\begin{aligned} \min _{x\in \mathbb {R}^n} \sum _{i=1}^N f_{i}(x). \end{aligned}$$
(4.7)
Let each node i keep vector \(x_{i}\in \mathbb {R}^n\) as its copy of x. To reach a consensus among \(x_i\), \(i=1,\ldots ,N\), a common approach is to introduce a global common variable y and get
$$\begin{aligned} \min _{\{x_{i}\},y}\sum _{i=1}^N f_{i}(x_{i}),\quad \text {s.t.}~x_{i}-y=0,\quad i=1,\ldots ,N. \end{aligned}$$
(4.8)
This is the well-known global consensus problem; see [2] for a review. With an objective function g on the global variable y, we have the global variable consensus problem with regularization:
$$\begin{aligned} \min _{\{x_{i}\},y}\sum _{i=1}^N f_{i}(x_{i})+g(y),\quad \text {s.t.}~x_{i}-y=0, \quad i=1,\ldots ,N, \end{aligned}$$
(4.9)
where g(y) is a convex function.
The following sharing problem is also nicely reviewed in [2]:
$$\begin{aligned} \min _{\{x_{i}\},y}\sum _{i=1}^N f_{i}(x_{i})+g\left( \sum _{i=1}^N y_i\right) ,\quad \text {s.t.}~x_{i}-{y}_i=0,\quad i=1,\ldots ,N, \end{aligned}$$
(4.10)
where \(f_i\)’s are local cost functions and g is the shared cost function by all the nodes i.

Algorithm 2 applied to the problems (4.8), (4.9) and (4.10) converges linearly if each function \(f_i\) is strongly convex and has Lipschitz continuous gradient. The resulting ADM is particularly suitable for distributed implementation, since the x-subproblem can be decomposed into N independent \(x_{i}\)-subproblems, and the update to the multiplier \({\uplambda }\) can also be done at each node i.

5 Numerical Demonstration

We present the results of some simple numerical tests to demonstrate the linear convergence of Algorithm 2. The numerical performance is not the focus of this paper and will be investigated more thoroughly in future research.

5.1 Elastic Net

We apply Algorithm 2 with \(P=0\) and \(Q=0\) to a small elastic net problem (4.4), where the feature matrix A has \(m=250\) examples and \(n=1000\) features. We first generated the matrix A from the standard Gaussian distribution \({\mathcal {N}}(0,1)\) and then orthonormalized its rows. A sparse vector \(x^0\in \mathbb {R}^n\) was generated with 25 nonzero entries, each sampled from the standard Gaussian distribution. The observation vector \(b\in \mathbb {R}^m\) was then computed by \(b=Ax^0+\epsilon \), where \(\epsilon \sim {\mathcal {N}}(0,10^{-3}I)\). We chose the model parameters \(\alpha =0.1\) and \(\mu =10^{-2}\), which we found to yield reasonable accuracy for recovering the sparse solution. We initialized all the variables at zero and set the algorithm parameters \(\beta =100\) and \(\gamma =1\). We ran the algorithm for 200 iterations and recorded the errors at each iteration with respect to a precomputed reference solution \(u^*\).

Figure 1a shows the decreasing behavior of \(\Vert u^k-u^*\Vert _G^2(:=\beta \Vert x^k-x^*\Vert ^2+\Vert {\uplambda }^k-{\uplambda }^*\Vert ^2/\beta \)) as the algorithm progresses. Since variable y is not contained in the G-norm, we also plot the convergence curve of \(\Vert y^k-y^*\Vert ^2\) in Fig. 1b. We observe that both \(u^k\) and \(y^k\) converge at similar linear rates. In addition, the convergence appears to have different stages. The later stage exhibits faster convergence rate than the earlier stage. This can be clearly seen in Fig. 2 which depicts the Q-linear rate \(\Vert u^{k+1}-u^*\Vert _G^2/\Vert u^k-u^*\Vert _G^2\).
Fig. 1

Convergence curves of ADM for the elastic net problem. a\(\Vert u^k-u^*\Vert _G^2\) versus iteration. b\(\Vert y^k-y^*\Vert ^2\) versus iteration

Fig. 2

Q-linear convergence rate of ADM for the elastic net problem

Here, the strong convexity constant of f is \(\nu _f=2\alpha +{\uplambda }_{\min }(A^TA)/\mu =2\alpha \) and the Lipschitz constant of \(\nabla f\) is \(L_f=2\alpha +{\uplambda }_{\max }(A^TA)/\mu =2\alpha +1/\mu \). By (3.15), our bound for the global linear rate is \((1+\delta )^{-1}=0.996\), which roughly matches the early-stage rate shown in the figure. However, our theoretical bound is rather conservative, since it is a global worst-case bound and it does not take into account the properties of the \(\ell _1\) norm and the solution. In fact, the optimal solution \(x^*\) is very sparse and \(x^k\) will also become sparse after a number of iterations. Let \({\mathcal {S}}\) be an index set of the nonzero support of \((x^k-x^*)\), and \(A_{{\mathcal {S}}}\) be a submatrix composed of those columns of A indexed by \({\mathcal {S}}\). Then, the constants \(\nu _f\) and \(L_f\) in our bound can be effectively replaced by \(\bar{\nu }_f=2\alpha +{\uplambda }_{\min }(A_{{\mathcal {S}}}^TA_{{\mathcal {S}}})/\mu \) and \(\bar{L}_f=2\alpha +{\uplambda }_{\max }(A_{{\mathcal {S}}}^TA_{{\mathcal {S}}})/\mu \), thereby accounting for the faster convergence rate in the later stage. For example, letting \({\mathcal {S}}\) be the nonzero support of the optimal solution \(x^*\), we obtain an estimate of the (asymptotic) linear rate \((1+\delta )^{-1}=0.817\), which well matches the later-stage rate.

5.2 Distributed Lasso

We consider solving the Lasso problem in a distributed way [30]:
$$\begin{aligned} \begin{aligned} \min _{\{x_{i}\},y}&\quad \sum _{i=1}^N \frac{1}{2\mu }\Vert A_ix_{i}-b_i\Vert ^2+\Vert y\Vert _1\\ {\text {s.t.}}&\quad x_{i}-y=0,~i=1,\ldots ,N, \end{aligned} \end{aligned}$$
(5.1)
which is an instance of the global consensus problem with regularization (4.9).

We apply Algorithm 2 with \(P=0\) and \(Q=0\) to a small distributed Lasso problem (5.1) with \(N=5\), where each \(A_i\) has \(m=600\) examples and \(n=500\) features. Each \(A_i\) is a tall matrix and has full column rank, yielding a strongly convex objective function in \(x_i\). Therefore, Algorithm 2 is guaranteed to converge linearly.

We generated the data similarly as in the elastic net test. We randomly generated each \(A_i\) from the standard Gaussian distribution \({\mathcal {N}}(0,1)\), and then simply scaled its columns to have a unit length. We generated a sparse vector \(x^0\in \mathbb {R}^n\) with 250 nonzero entries, each sampled from the \({\mathcal {N}}(0,1)\) distribution. Each \(b_i\in \mathbb {R}^m\) was then computed by \(b_i=A_ix^0+\epsilon _i\), where \(\epsilon _i\sim {\mathcal {N}}(0,10^{-3}I)\). We chose the model parameter \(\mu =0.1\), which we found to yield reasonably good recovery quality. From the initial point at zero, we ran the algorithm with parameters \(\beta =10\) and \(\gamma =1\) for 50 iterations and computed the iterative errors.

Figure 3 demonstrates the clear linear convergence behavior of \(\Vert u^k-u^*\Vert _G^2\) and \(\Vert y^k-y^*\Vert ^2\). In Fig. 4, the Q-linear convergence rate of \(\Vert u^k-u^*\Vert _G^2\) is depicted. For this problem, the strong convexity constant is \(\nu _f=\min _i\{{\uplambda }_{\min }(A_i^TA_i)/\mu \}\) and the Lipschitz constant is \(L_f=\max _i\{{\uplambda }_{\max }(A_i^TA_i)/\mu \}\). However, the condition number \(\nu _f/L_f\) in this test is relatively big, and hence the theoretical linear rate specified by (3.16) is not a very tight bound for the observed fast rate. Note that all \(x_i\)’s tend to be equal and become sparse after a number of iterations. Similar to our previous discussion in Sect. 5.1, we can estimate the asymptotic linear rate by letting \(\bar{\nu }_f={\uplambda }_{\min }(A_{{\mathcal {S}}}^TA_{{\mathcal {S}}})/(\mu N)\) and \(\bar{L}_f={\uplambda }_{\max }(A_{{\mathcal {S}}}^TA_{{\mathcal {S}}})/(\mu N)\), where \(A\in \mathbb {R}^{Nm\times n}\) is formed by stacking all the matrices \(A_i~(i=1,\ldots ,N)\), and \({\mathcal {S}}\) is an index set of the nonzero support of \(x^*\). We obtained the asymptotic linear rate to be \((1+\delta )^{-1}=0.779\), which appears to be a much tighter bound.
Fig. 3

Convergence curves of ADM for the distributed Lasso problem. a\(\Vert u^k-u^*\Vert _G^2\) versus iteration. b\(\Vert y^k-y^*\Vert ^2\) versus iteration

Fig. 4

Q-linear convergence rate of ADM for the distributed Lasso problem

6 Conclusions

In this paper, we provide sufficient conditions for the global linear convergence of a general class of ADMs which solve subproblems either exactly or approximately in a certain manner. Among the conditions is a function that is strongly convex and has Lipschitz continuous gradient. These sufficient conditions cover a wide range of applications. We also extend the existing convergence theory to allow more generality on the step size \(\gamma \) for updating the multipliers.

In practice, how to choose the penalty parameter \(\beta \) is always an important issue. Our convergence rate analysis provides more insights on how the penalty parameter \(\beta \) affects the convergence speed, thereby providing some theoretical guidance for choosing \(\beta \).

Footnotes

  1. 1.

    The results continue to hold in many cases when strong convexity is relaxed to strict convexity (e.g., \(-\log (x)\) is strictly convex but not strongly convex over \(x>0\). ADM always generates a bounded sequence \(Ax^k, By^k, {\uplambda }^k\), where the bound only depends on the starting point and the solution, even when the feasible set is unbounded. When restricted to a compact set, a strictly convex function is strongly convex.

  2. 2.
    Suppose a sequence \(\{u^k\}\) converges to \(u^*\). We say the convergence is (in some norm \(\Vert \cdot \Vert \))
    • Q-linear, if there exists \(\mu \in (0,1)\) such that \(\frac{\Vert u^{k+1}-u^*\Vert }{\Vert u^{k}-u^*\Vert }\le \mu \);

    • R-linear, if there exists a sequence \(\{\sigma ^k\}\) such that \(\Vert u^{k}-u^*\Vert \le \sigma ^k\) and \(\sigma ^k\rightarrow 0\) Q-linearly.

Notes

Acknowledgments

The authors’ work is supported in part by ARL MURI Grant W911NF-09-1-0383 and NSF Grant DMS-1317602.

References

  1. 1.
    Boley, D.: Linear convergence of ADMM on a model problem. TR 12-009, Department of Computer Science and Engineering, University of Minnesota (2012)Google Scholar
  2. 2.
    Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2010)CrossRefMATHGoogle Scholar
  3. 3.
    Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Trans. Pattern Anal. Mach. Intell. 26(9), 1124–1137 (2004)CrossRefGoogle Scholar
  4. 4.
    Cai, J., Osher, S., Shen, Z.: Split Bregman methods and frame based image restoration. Multiscale Model. Simul. 8(2), 337 (2009)CrossRefMathSciNetGoogle Scholar
  5. 5.
    Chen, G., Teboulle, M.: A proximal-based decomposition method for convex minimization problems. Math. Program. 64(1), 81–101 (1994)CrossRefMathSciNetMATHGoogle Scholar
  6. 6.
    Davis, D., Yin, W.: Convergence rate analysis of several splitting schemes. arXiv preprint arXiv:1406.4834 (2014)
  7. 7.
    Davis, D., Yin, W.: Faster convergence rates of relaxed Peaceman–Rachford and ADMM under regularity assumptions. arXiv preprint arXiv:1407.5210 (2014)
  8. 8.
    Deng, W., Yin, W., Zhang, Y.: Group sparse optimization by alternating direction method. In: SPIE Optical Engineering+Applications, pp. 88580R–88580R (2013)Google Scholar
  9. 9.
    Douglas, J., Rachford, H.: On the numerical solution of heat conduction problems in two and three space variables. Trans. Am. Math. Soc. 82(2), 421–439 (1956)CrossRefMathSciNetMATHGoogle Scholar
  10. 10.
    Eckstein, J., Bertsekas, D.: An alternating direction method for linear programming. Division of Research, Harvard Business School, Laboratory for Information Technology, M.I., Systems (1990)Google Scholar
  11. 11.
    Eckstein, J., Bertsekas, D.P.: On the Douglas–Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program. 55(1–3), 293–318 (1992)CrossRefMathSciNetMATHGoogle Scholar
  12. 12.
    Esser, E.: Applications of Lagrangian-based alternating direction methods and connections to split Bregman. CAM report 09–31, UCLA (2009)Google Scholar
  13. 13.
    Gabay, D.: Chapter ix applications of the method of multipliers to variational inequalities. Stud. Math. Appl. 15, 299–331 (1983)Google Scholar
  14. 14.
    Gabay, D., Mercier, B.: A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Comput. Math. Appl. 2(1), 17–40 (1976)CrossRefMATHGoogle Scholar
  15. 15.
    Glowinski, R.: Numerical Methods for Nonlinear Variational Problems, Springer Series in Computational Physics. Springer, Berlin (1984)CrossRefGoogle Scholar
  16. 16.
    Glowinski, R., Marrocco, A.: Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité, d’une classe de problèmes de Dirichlet non linéaires. Laboria (1975)Google Scholar
  17. 17.
    Goldfarb, D., Ma, S.: Fast multiple splitting algorithms for convex optimization. SIAM J. Optim. 22(2), 533–556 (2012)CrossRefMathSciNetMATHGoogle Scholar
  18. 18.
    Goldfarb, D., Ma, S., Scheinberg, K.: Fast alternating linearization methods for minimizing the sum of two convex functions. Math. Program. 141(1–2), 349–382 (2013)Google Scholar
  19. 19.
    Goldfarb, D., Yin, W.: Parametric maximum flow algorithms for fast total variation minimization. SIAM J. Sci. Comput. 31(5), 3712–3743 (2009)CrossRefMathSciNetMATHGoogle Scholar
  20. 20.
    Goldstein, T., Bresson, X., Osher, S.: Geometric applications of the split Bregman method: segmentation and surface reconstruction. J. Sci. Comput. 45(1), 272–293 (2010)CrossRefMathSciNetMATHGoogle Scholar
  21. 21.
    Goldstein, T., O’Donoghue, B., Setzer, S., Baraniuk, R.: Fast alternating direction optimization methods. SIAM J. Imaging Sci. 7(3), 1588–1623 (2014)Google Scholar
  22. 22.
    Goldstein, T., Osher, S.: The split Bregman method for L1 regularized problems. SIAM J. Imaging Sci. 2(2), 323–343 (2009)CrossRefMathSciNetMATHGoogle Scholar
  23. 23.
    He, B., Liao, L., Han, D., Yang, H.: A new inexact alternating directions method for monotone variational inequalities. Math. Program. 92(1), 103–118 (2002)CrossRefMathSciNetMATHGoogle Scholar
  24. 24.
    He, B., Yuan, X.: On non-ergodic convergence rate of Douglas–Rachford alternating direction method of multipliers. Numerische Mathematik 130(3), 567–577 (2014)Google Scholar
  25. 25.
    He, B., Yuan, X.: On the \(O(1/n)\) convergence rate of the Douglas–Rachford alternating direction method. SIAM J. Numer. Anal. 50(2), 700–709 (2012)CrossRefMathSciNetMATHGoogle Scholar
  26. 26.
    Hong, M., Luo, Z.: On the linear convergence of the alternating direction method of multipliers. Arxiv preprint arXiv:1208.3922v3 (2013)
  27. 27.
    Jiang, H., Deng, W., Shen, Z.: Surveillance video processing using compressive sensing. Inverse Probl. Imaging 6(2), 201–214 (2012)Google Scholar
  28. 28.
    Liang, J., Fadili, J., Peyre, G., Luke, R.: Activity identification and local linear convergence of Douglas–Rachford/ADMM under partial smoothness. arXiv:1412.6858v5 (2015)
  29. 29.
    Lions, P.L., Mercier, B.: Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer. Anal. 16(6), 964–979 (1979)CrossRefMathSciNetMATHGoogle Scholar
  30. 30.
    Mateos, G., Bazerque, J., Giannakis, G.: Distributed sparse linear regression. IEEE Trans. Signal Process. 58(10), 5262–5276 (2010)CrossRefMathSciNetGoogle Scholar
  31. 31.
    Mendel, J., Burrus, C.: Maximum-Likelihood Deconvolution: A Journey into Model-Based Signal Processing. Springer, New York (1990)CrossRefGoogle Scholar
  32. 32.
    Rockafellar, R.: Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14(5), 877–898 (1976)CrossRefMathSciNetMATHGoogle Scholar
  33. 33.
    Rockafellar, R.: Convex Analysis, vol. 28. Princeton University Press, Princeton (1997)MATHGoogle Scholar
  34. 34.
    Wang, Y., Yang, J., Yin, W., Zhang, Y.: A new alternating minimization algorithm for total variation image reconstruction. SIAM J. Imaging Sci. 1(3), 248–272 (2008)CrossRefMathSciNetMATHGoogle Scholar
  35. 35.
    Yan, M., Yin, W.: Self equivalence of the alternating direction method of multipliers. arXiv:1407.7400 (2014)
  36. 36.
    Yang, J., Zhang, Y.: Alternating direction algorithms for \(\ell _1\)-problems in compressive sensing. SIAM J. Sci. Comput. 33(1–2), 250–278 (2011)CrossRefMathSciNetMATHGoogle Scholar
  37. 37.
    Zhang, X., Burger, M., Osher, S.: A unified primal–dual algorithm framework based on Bregman iteration. J. Sci. Comput. 46(1), 20–46 (2011)CrossRefMathSciNetMATHGoogle Scholar
  38. 38.
    Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005)CrossRefMathSciNetMATHGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  1. 1.Department of Computational and Applied MathematicsRice UniversityHoustonUSA
  2. 2.Department of MathematicsUniversity of CaliforniaLos AngelesUSA

Personalised recommendations