A Survey on Some Recent Developments of Alternating Direction Method of Multipliers

Recently, alternating direction method of multipliers (ADMM) attracts much attentions from various fields and there are many variant versions tailored for different models. Moreover, its theoretical studies such as rate of convergence and extensions to nonconvex problems also achieve much progress. In this paper, we give a survey on some recent developments of ADMM and its variants.


Introduction
In this paper, we survey the developments of the alternating direction method of multipliers (ADMM) and its variants for solving the minimization problem with linear constrains and a separable objective function which is the sum of many individual functions without coupled variables:  where θ i : R n i → R ∪ {∞} are closed proper functions; A i ∈ R l×n i ; X i ⊆ R n i are closed and convex nonempty sets; b ∈ R l ; and m i=1 n i = n. As a linearly constrained optimization problem, though the model (1) is special, it is rich enough to characterize many optimization problems arising from various application fields, e.g., the image alignment problem in [1], the robust principal component analysis model with noisy and incomplete data in [2], the latent variable Gaussian graphical model selection in [3], the quadratic discriminant analysis model in [4] and the quadratic conic programming in [5]; just list a few.
We now give some concrete application models.
1 -norm minimization: In some applications such as statistics, machine learning, and signal processing, one wants to find a 'sparse' solution from the given data. Let b ∈ R l denote the observed signal and we know that it comes from a linear transformation A ∈ R l×n and l << n. The task is to find the sparsest solution, i.e., the vector that contains as many zero elements as possible and satisfies the equation Ax = b. Let y 0 denote the number of nonzero elements of the vector y, then we can formulate the problem as min x∈R n where μ > 0 is a scalar regularization parameter that is usually chosen by crossvalidation. Introducing a new variable, we can reformulate (2) as min x∈R n which is a special case of (1) with m = 2.
Since the zero norm is discontinuous and nonconvex, researchers usually replace it with its convex hull, the 1 -norm. Then, (2) and (3) can be relaxed to min x∈R n and min x∈R n respectively. The model (4) is just the well-known lasso [6]. A generalization of the above model is that it is not the solution itself but its linear transformation is required to be sparse, and the optimization model is min x∈R n where F is an arbitrary linear transformation. Again, after introducing an auxiliary variable, we get a special case of (1) with m = 2, min x∈R n and replacing the zero norm with its convex hull, the 1 -norm, we obtain its relaxed model min x∈R n which is called generalized lasso. When F is the difference matrix then F x 1 is the total variation of x [7], which finds wide applications in image processing. Other 1 -norm minimization models that can be reformulated into (1) include basis pursuit [8], Huber function fitting [9], group lasso [10], etc.
Matrix completion: In some applications such as the movie ratings in the Netflix problem, part of the data (elements of a matrix) is unaccessible, and the task is filling in the missing entries of a partially observed matrix. That is, given a ratings matrix in which each entry (i, j) represents the rating of movie j by customer i if customer i has watched movie j and is otherwise missing, we would like to predict the remaining entries. A property that helps to accomplish the task is that the preferred matrix is low rank, or its rank is known as priori; otherwise the hidden entries could be assigned arbitrary values.
Let M be the matrix to be recovered and let be the set of locations corresponding to the observed entries ((i, j) ∈ if M i j is observed). The optimization model is [11] min x∈R l×n rank(x) x i j = M i j , for (i, j) ∈ .
As (5) to (2), we can also relax and reformulate (8) to the convex separable problem min x,y∈R l×n where x * denotes the nuclear norm of the matrix x which is defined as the sum of its singular values. Then, we obtain a special case of (1) for m = 2 and with matrix variables.
Robust principal component analysis: Given part of the elements of a data matrix which is the superposition of a low rank matrix and a sparse matrix, the robust principal component analysis (RPCA) is to recover each component individually [12]. Moreover, the given data may be corrupted by noises. As in the matrix completion 4 D. Han example, let be the set of locations corresponding to the given entries, and let P : R l×n → R l×n be the orthogonal projection onto the span of matrices vanishing outside of , i.e., the i j-th entry of P (x) is x i j if (i, j) ∈ and zero otherwise. The optimization model for the robust principal component analysis problem is min x∈R l×n rank(x) + τ 1 y 0 + τ 2 P (z) 2 F where M is the given data, · F denotes the Frobenius norm of a matrix. Relaxing the rank and the zero norm with their convex hull, we obtain a relaxation [2] min x∈R l×n Both (10) and (11) fall into the framework of (1) with m = 3. Note that the original application models (2), (6), (8) and (10) contain discrete terms such as the 0 -norm and the rank function, and they are nonconvex optimization problems. Solving these optimization problems are usually NP-hard (nondeterministic polynomial hard). Peoples thus heuristically turn to solving their relaxation problems (5), (7), (9) and (11). Fortunately, under suitable conditions, the relaxed problem and the original one share the same solutions [11,13].
The relaxation problems (5), (7), (9) and (11) are convex optimization problems, and there are many state-of-the-art solvers for solving them. In particular, the problems can be further reformulated as a linear programming (LP) or a semidefinite programming (SDP) and the interior point algorithm and Newton's methods can solve them. Nevertheless, the rapid increase in the dimension brings great challenge to the solvers [14]. However, the hardness induced by the high dimension can be alleviated by utilizing the problems' structure. In all these models, the objective function is the sum of several individual functions, and the constraints are linear equalities. We call these optimization models, and the uniform (1), separable optimization problems with linear constraints.
Among the methods that take advantage of the separable structure of the model (1), the alternating direction method of multipliers (ADMM) attracts much attentions. For solving these modern application problems arising from big data and artificial intelligence, ADMM performs reasonably well. Though the individual component functions can be nonsmooth, the subproblems in ADMM are very easy to solve, and they usually even possess closed-form solution. This makes the method relatively simple in application. Moreover, in these applications, extremely high accuracy is not required, and as a consequence, the slow 'tail convergence' in ADMM is not a serious impact.
With the rapid development of ADMM, there have been several survey papers. In particular: • In [15], the authors gave a thorough survey of ADMM, mainly from the viewpoint of the applications from statistics and machine learning. Essentially, it is these applications from big data and artificial intelligence that make the renaissance of ADMM.
• In [16], the author delineated the origin of ADMM from a historical point of view. ADMM originates from numerical partial differential equations [17][18][19] developed it and extended it to variational inequality problems. • In [20], the authors pointed out that while ADMM can be regarded as an inexact application of the well-known augmented Lagrangian method (ALM), it is helpful to the convergence analysis. They then suggested an accessible version of the 'operator splitting' version of the ADMM convergence proof.
Though these papers survey ADMM from different point of view, there are still very large part of the recent developments that are not mentioned, e.g., the extension of the classical two-block case to the multi-block case; the rate of convergence; and the extension to the nonconvex case. In this paper, we give a thorough survey on these aspects of the developments of ADMM. ADMM is an augmented Lagrangian-based method. Introducing a Lagrange multiplier λ to the linear constraint, we can write down the augmented Lagrangian function associated with (1) where β > 0 is a parameter. Throughout, the solution set of (1) is assumed to be nonempty. Moreover, we use the notation X = X 1 × X 2 × · · · × X m and x = (x 1 , x 2 , · · · , x m ) ∈ R n , and denote by x −i the subvector of x excluding only x i , i.e., Similarly, denote by X −i the subset of X excluding only X i , i.e., In the alternating direction method of multipliers for solving (1), the variable x 1 plays only an intermediate role and is not involved in the execution of the main recursion. Therefore, the input for executing the next iterate of ADMM is only x k −1 and λ k , and in the convergence analysis, we usually use the sequence {(x k −1 , λ k )}. For notation convenience, let v = (x −1 , λ) and w = (x, λ). The variables with superscript such as v * , v k , w * and w k can be defined similarly.
For any vector x, min{0, x} is a vector with the same dimension of x, whose ith element is x i if x i < 0 and 0 otherwise. The Euclidean norm is denoted by · . For any convex function θ : X → (−∞, +∞], we use dom θ to denote its effective domain, i.e., dom θ := {x ∈ X : θ(x) < ∞}; epi θ to denote its epigraph, i.e., epi θ := {(x, t) ∈ X × R : θ(x) t}; and θ * : X → (−∞, +∞] represents its Fenchel conjugate, i.e., θ * (y) = sup x {y x − θ(x)}. The multivalued mapping ∂θ : R n ⇒ R n is called the subdifferential of θ and its element ξ ∈ ∂θ(x) is a For a set C in R n , ri(C) denotes the relative interior of C; and P C denotes the Euclidean projector onto C, which maps a vector in R n onto the nearest point of C if it is closed.
The rest of the paper is organized as follows. In the next section, we give a simple review of the augmented Lagrangian method due to the fact that (one of the main understanding of ADMM), ADMM is an approximate application of ALM, using one path of block coordinate minimization to approximately minimize the augmented Lagrangian per iteration. We then review the classical ADMM in Sect. 3, where we divide it into four subsections focusing on four different factors. The first one is deriving ADMM from the Douglas-Rachford splitting method (DRSM) which helps understand the convergence analysis, and the other three factors will impact the efficiency of the implementation of ADMM, i.e., selection of the penalty parameter, easier subproblems from splitting, and approximate solution of the subproblems. In Sect. 4, we review the results on the rate of convergence of ADMM, including the sublinear rate and the linear rate of convergence. In Sect. 5, we review the extensions and variants of ADMM. We first present the counter example from [21], which shows that for the multi-block case, although the direct extension of ADMM converges and performs well in many applications, it does not converge for the convex separable problem of the form (1) when m 3. We then review two development directions for the multi-block case, i.e., conditions that guarantee the convergence and simple variants of the algorithm such as a correction-step. In Sect. 6, we review the recent develop of ADMM for solving (1) where some of the component functions θ i are nonconvex. Usually, it is regarded that when there is a nonconvex component, the minimization problem (1) is harder than its convex counterpart. We show that for some special cases, e.g., when m = 2 and the model is 'strongly+weakly' convex 1 , it is the same to the total convex case. For the general model when there is no strongly convex component, we can also have some results with the aid of the Kurdyka-Lojasiewicz inequality [22,23] or(and) the error bound conditions. In Sect. 7, we list some future research questions (topics) and conclude the paper.

The Augmented Lagrangian Method
The classical augmented Lagrangian method (ALM) [24,25] for solving the linearly constrained optimization problem where θ : R n → R ∪ {∞} is a closed proper convex function, A ∈ R l×n , and b ∈ R l . The iterative scheme of (ALM) is where τ ∈ (0, 2) is a parameter and is the augmented Lagrangian function, λ ∈ R l is the Lagrange multiplier associated with the linear equality constraint, and β > 0 is a penalty parameter. Let denote the augmented dual function of (12). Then, d β is a differentiable function and where x(λ) is the optimal solution of the following problem (with parameter λ) Here and throughout the paper, we use 'arg min' to denote the solution set of the minimization problem. Moreover, ∇d β is Lipschitz continuous with constant 1/β. Hence, the augmented Lagrangian method can be viewed as a gradient method for the augmented dual problem The differentiability of d β is based on the fact that, although the solution x(λ) is usually not unique, Ax is a constant over X (λ), where X (λ) denotes the set of optimal solutions for (14). The following two lemmas are Lemmas 2.1-2.2 in [26], respectively. The proof is mainly based on the Danskin's Theorem [27,Prop. 4.5.1].

Lemma 1
For any λ ∈ R l , Ax is a constant over X (λ). Thus, the dual function d β (·) is differentiable everywhere and where x(λ) ∈ X (λ) is any minimizer of (14).

D. Han
Lemma 2 ∇d β is Lipschitz continuous with constant 1/β. That is, The optimality condition for (12) are the primal feasibility and dual feasibility Assuming that strong duality holds, the optimal value of the primal and the dual problems are the same. Once we get the dual optimal solution λ * , the original solution can be obtained by solving the optimization problem A predecessor of the augmented Lagrangian method is the dual ascent method. The Lagrangian function for (12) is and the dual function is where θ * is the conjugate of θ [28]. The iterative scheme of the dual ascent method is where α k > 0 is a step size.
The augmented Lagrangian method is usually more efficient (less cpu time in finding an approximate solution) and robust (performance that does not heavily depend on parameters such as the initial point, the step size) than the dual ascent method. However, the dual ascent method has its advantage that it can lead to a decentralized algorithm in the primal subproblems, e.g., the objective function is the sum of some component functions. However, it is not an easy task to have a 'full' decentralized manner in both primal and dual variables; the interested reader is referred to [29].

The Alternating Direction Method of Multipliers
The classical alternating direction method of multipliers (ADMM) is for solving the linearly constrained convex optimization problem with two blocks of variables and functions min θ 1 ( where θ i : R n i → R∪{∞} are closed proper convex functions, A i ∈ R l×n i , i = 1, 2, and b ∈ R l .

Definition 1 A KKT point for problem (15) is some
Let λ ∈ R l be the Lagrange multiplier, and let β > 0 be a penalty parameter. The augmented Lagrangian function of (15) is and the iterative scheme of ADMM is where τ ∈ 0, 1+ . In most part of the paper, we omit the parameter τ for succinctness of the discussion (τ is fixed to be 1); though in applications, a larger τ is much advisable. ADMM was originally proposed by Glowinski and Marrocco [17], and Gabay and Mercier [18]. Recently, due to its great success in solving the optimization problems arising from machine learning, statistics, artificial intelligence, ADMM gets more and more attentions, and there have several survey papers from different point of view [15,16,20].
One can certainly group the two blocks of variables and functions into a single variable and a single function, and then use the augmented Lagrangian method to solve it, with the following iterative scheme: However, such a scheme does not exploit the separable structure of the problem (15), and usually, the (x 1 , x 2 )-minimization problem has to be iteratively solved. Obviously, the scheme (17) is capable of exploiting the properties of θ 1 and θ 2 individually, making the subproblems much easier and sometimes easy enough to have closedform solutions.
Comparing the (x 1 , x 2 )-minimization problem in (18) and (17), we can understand the ADMM as a Gauss-Seidel implementation for solving (18), approximately with a single iterative [16]. On the other hand, it can also be understood as the Douglas-Rachford splitting method applying to the dual problem of (15) [30].

Deriving ADMM from Douglas-Rachford Splitting
There are several ways deriving ADMM from the Douglas-Rachford splitting method and here we adopt that in [31]. Consider the problem (15) where θ 1 and θ 2 are proper closed convex functions. Its dual function is and the dual problem is where λ ∈ R l is the dual variable. From the Fermat's rule, if λ * is an optimal solution of the dual problem (19), then 0 ∈ ∂d(λ * ).
Let 'ri,' 'dom,' and '•' denote the relative interior of a set, the domain of a function, and the composition of two operators, respectively. Assuming that Problem (20)-(21) is a special case of the zero finding problem where For any α > 0, we have [30] 0 where for a maximal monotone operator T : R n ⇒ R n , J αT = (I +αT ) −1 denotes the resolvent of T , and R αT = 2J αT − I denotes its Cayley operator (reflection operator).

D. Han
Finally, we swap the order to get the correct dependency and substitutex k 1 = x k+1 1 to get the alternating direction method of multipliers: From the above subsection, we can see that the classical alternating direction method of multipliers for solving (15) can be viewed as an application of the Douglas-Rachford splitting method (24) to solving the optimality condition of the dual (19). As a consequence, the global convergence of ADMM can be obtained directly from the convergence result of the Douglas-Rachford splitting method [30,Thm. 8]. Here, we summarize the convergence of ADMM in a similar way of [20,Prop. 4.6]. (15) where θ 1 and θ 2 are proper closed convex functions. Let

Theorem 1 Consider the problem
Suppose that all subgradients of the functions d 1 and d 2 at each point λ ∈ R l take the form A 1x1 and A 2x2 , respectively, wherex i , i = 1, 2 attain the stated minimum over x i in (25). Let the constant β > 0 be given and there exists a KKT point for problem (15). Then, the sequences {x k 1 } ⊂ R n 1 , {x k 2 } ⊂ R n 2 , and {λ k } ⊂ R l conforming to the recursions (17) is some KKT point for problem (15).

Selection of the Penalty Parameter
The penalty parameter β in the augmented Lagrangian function plays an important role for the methods, both ALM and ADMM. Experience on applications has shown that, if the fixed penalty β is chosen too small or too large, the efficiency of the methods can be degraded significantly. In fact, there is no 'optimal' fixed parameter. For ALM, Rockafellar proposed that it should be varied along the iteration [33,34]. For ADMM, He and Yang [35] suggest to choose β k either in an increasing manner or in a decreasing manner; [36] took a decreasing sequence of penalty symmetric positivedefinite matrices; and in [37,38], the authors designed a self-adaptive strategy for choosing the parameter, i.e., the parameter can be increased and can be decreased according to some rules. The adjusting rule in [37,38] is: where μ, ν 1 , ν 2 are positive parameters, and denote the primal error and the dual error associated with the iteration, respectively. The philosophy behind this 'simple scheme that often works well' [15] is as follows. The optimality conditions for (15) are primal feasibility and dual feasibility From the ADMM iteration scheme (17) we can see that Hence, we can see that and the primal residual and the dual residual of the iterative scheme (17) to the optimality conditions (27)- (28) are r k+1 and s k+1 . Since ADMM is a primal-dual algorithm, it will be desirable that the primal residual and the dual residual behave in a coherent way. The adjust rule (26) is exactly that when comparing to the dual residual the primal residual is too large, it increases the penalty parameter; when comparing to the dual residual the primal residual is too small, it decreases the penalty parameter; otherwise, there is no need of adjustment for the parameter.
We have the following convergence results on the alternating direction methods of multipliers with self-adaptive parameter selecting strategy [37,Theorem 4.1]. (15) and let {x k 1 , x k 2 , λ k } be the sequence generated by the self-adaptive ADMM,

Easier Subproblems
One main reason for the renaissance of ADMM is that when applying to the modern application models, the subproblems are easy to solve, and in fact, in many cases, they possess closed-form solutions. For example, for solving the 1 -norm minimization problem (5), the iterative scheme is where for α > 0, S α is the soft thresholding operator and is defined as The x-minimization is a system of linear equations, and the coefficient matrix is symmetric and positive definite. Hence, it can be solved easily; particularly, when A has some circulant structure as those in signal/image processing, the system can be solved via fast Fourier transform and the computation cost is very low. However, when we have to solve the subproblems via iterative algorithms, we need to balance the cost on the subproblems solving and the outer iteration. That is, we need design variant of the classical ADMM with easier subproblems.
In applications, there are usually simple constraints for the variables, e.g., in image processing, the value of pixels should be bounded in [0, 255] or [0, 1], and we have the following optimization model [39]: where X i are closed convex subsets of R n i , i = 1, 2. Certainly, we can move the constraints to the objective function via the indicator function for a given closed Nevertheless, the corresponding subproblems are still constrained optimization problems, which usually exclude the possibility of having closed-form solutions, even for the case that both the function θ i and A i are simple enough, e.g., when θ i is the 1norm and A i is the identity. In [40], the authors designed a new variant ADMM-type method, in which the main cost is to solve the following optimization problems where α k is the step size that computed according to some formula to ensure convergence, which can also be chosen smaller than a threshold. Comparing the iterative scheme (31) with the ADMM scheme (17), we can find that 1. No matter if there is a constraint X i for the ith-minimization problem, and no matter what the constraint matrix A i is, the ith-minimization problem in (31) is always the evaluation of the proximal operator of the component function θ i at a given s k i . This will be particularly advantageous when both the function θ i and the set X i are simple in the sense that evaluation their individual proximal operator (projection onto the set) is easy, while evaluation the proximal operator jointly is difficult. 2. The iterative scheme (31) is essentially coming from the application of the Douglas-Rachford splitting method [32] to the optimality condition for (15), following with a simple decouple skill. Recall that ADMM is the application of the Douglas-Rachford splitting method [32] to the dual of (15), and (31) provides an alternative for solving (15). The numerical results reported in [40] show the advantage of (31) over ADMM, especially when the subproblems of ADMM do not possess closed-form solutions while (31) has.
The following result shows that the alternating direction methods of multipliers with easier subproblems is convergent [40,Thm. 4.1].
Remark 1 Besides the numerical advantage listed above, the customized decomposition algorithm also has some advantages from the theoretical point of view. In particular, as stated in Theorem 3, the 'whole' sequence

Approximate Solutions of the Subproblems
The slow 'tail convergence' in ADMM is used to be criticized by some researchers. However, in recent applications, the large-scale optimization problems are not required to be solved to obtain solutions with extremely high accuracy. Hence, there is no need to get solutions with high accuracy per iteration. On the same time, when the subproblems are solved numerically, we must design some stopping criteria to exit the inner iteration. Only after this, ADMM can be a practical algorithm. Although it is a long history adopting approximate solutions of subproblems [41], the first inexact ADMM comes from [38]. In [38], besides selecting the penalty parameter in a selfadaptive way, they designed an accuracy criterion, under which the algorithm still converges.
To get an inexact ADMM, the authors [38] modified the subproblems in (16) by adding a quadratic term as done in [42], and the iterative scheme is The accuracy criterion is The scaler sequences {ν i,k } are nonnegative sequences satisfying ∞ k=1 ν i,k < +∞, H k i α I with α > 0, andx k+1 i are the exact solutions of the corresponding subproblems.
At the first glance, it seems the accuracy criterion (33) is impractical, since it involves the exact solutionx k+1 i . However, since the objective functions in (33) are strongly convex with modulus r k > α > 0, according to Lemma 1 in [38], we have that for x 1 -minimization problem and ξ k+1 For the exact solutionx k+1 1 , replacing x k+1 We can take μ α −1 and find x k+1 This guarantees that x k+1 (37). Inequality (37) provides a practical and achievable condition of satisfying (33). Following the same discussion, a similar condition of satisfying (33) can be established for the x 2minimization problem.
The inequality (34) is an error bound inequality [43] in the sense that the left is the distance between a point x k+1 1 and a solution pointx k+1 1 , and the right-hand side is a function of x k+1 1 . This is due to the quadratic terms added to the objective functions, which makes the objective function being strongly convex. Besides this property, another important effect is to improve the condition number of the problem. Recall that the convergence property of the numerical algorithms for solving the optimization problems, from the steepest gradient method, the conjugate gradient method, to the Newton's type methods, all depend on the condition number. From this point of view, the quadratic term makes more state-of-the-art solvers to be available.
Note that we also have freedom in choosing the matrix H k , and in [38], it suggests that we choose it in a self-adaptive way as (26). Note however, in some applications as those listed in the introduction part, the original objective function θ i is easy enough in the sense that it possesses the separable structure, e.g., the 1 -norm, which is the sum of the absolute of its entries, while the coefficient matrix A i may not possess any structure. In other words, the difficulty in solving the ADMM subproblems is caused by the augmented quadratic term. In this case, we can select H k i = μ i,k I − β A i A i , leading to the proximal operator evaluating subproblems: whereλ The iterative scheme (38) is called linearized alternating direction method of multipliers [44][45][46], whose main computational task is the proximal operator evaluating, and has shown its advantage in solving modern optimization problems arising from statistics [47]. The technique of linearizing the quadratic term of the augmented Lagrangian function to reduce the computational cost in solving the subproblems can be dated back to the work in [48], where the authors proposed the following iterative scheme Comparing the two augmented Lagrangian-based splitting schemes (38) and (39), we can find that the only difference between them is that (38) solves the two minimization problems in a Gauss-Seidel manner, while (39) solves them in a Jacobian manner. Some variants of (39) can be found in [49].
In some applications, each component function θ i itself is composed with a smooth convex function and a simple nonsmooth function, i.e., θ i = ϑ i + ι i where ϑ i is a smooth convex function and ι i is a nonsmooth convex function. The optimization model (15) is There are two smooth parts in the x i -minimization problem in (17), ϑ i and the quadratic term. One can choose to either linearize one of them, or linearize both of them, depending on the problem data's structure. If one linearizes both of them, the resulting linearized ADMM has the following iterative scheme: and the x i -minimization problem is simply the evaluating of the proximal operator of the nonsmooth convex function ι i . Also, the minimization subproblems can be solved in a parallel manner, The inequality (33) and the summable requirement on the error parameter sequences {ν i,k } (or squares summable [50]) makes it not always handy in applications. Actually, one will prefer to relative error criteria. The relative criterion was introduced in [51] for finding the zero of a maximal monotone operator, which includes convex optimization and monotone variational inequality problems as special cases. Using an approximate solution under relative accuracy criterion, the next iterate is generated by a projection onto a hyperplane. Numerical algorithms with relative accuracy criteria are further studied in [52][53][54]. For the problem (1) with one block, i.e., m = 1, [55] proposed a practical relative error criterion for augmented Lagrangian method (13). Exploiting the relationship between the ADMM and both the proximal point algorithm and Douglas-Rachford splitting method for maximal monotone operators, recently, [56,57] proposed two new inexact ADMMs. In [56], a summable criterion and a relative criterion were presented. The relative criterion is presented in [56,57], while it is restrictive in the sense that it allows only one of the two subproblems to be minimized approximately, which covers commonly encountered special cases such as lasso. Most recently, [58] proposed two relative criteria for ADMM. In the first one, it was also restricted that only one subproblem can be solved approximately, and to improve the numerical performance, the parameter τ can be larger. In the second one, both subproblems were allowed to be solved approximately, while the dual step size parameter τ was restricted less than 1.

Rate of Convergence
With the popular of the research on the numerical variants and applications of the alternating direction method of multipliers, theoretical analysis on its convergence behavior, in addition of global convergence, rate of convergence attracts more and more attentions.

Sublinear Rate
For first-order-based method, a measure for the convergence speed is the iteration complexity, O(1/t), O(1/t 2 ) and so on, where t is the iteration counter [59]. A worstcase O(1/t) convergence rate means that the solution accuracy under certain criteria is of the order O(1/t) after t iterations of the iterative scheme, or equivalently, that it requires at most O(1/ε) iterations to find a solution to an accuracy of ε. For the convex optimization problem (15), ADMMs convergence rate appears to be at most sublinear, and [60,61] showed it has a worst-case iteration complexity of O(1/t) in an ergodic sense 2 .
There are several different measure functions in the literature. Here, we introduced two of them. The first one is as follows: Here, we use the notation with δ > 0 being a scalar. Moreover, The above definition was used in [60]. It arises essentially from characterizing the optimality condition of (15) with a variational inequality (VI), i.e., finding Let {w k } be the sequence generated by the ADMM (17), and definẽ [60] proved that Then, to obtain the conclusion, set consequently, for a given accuracy tolerance ε > 0, it holds that after at most There are two main differences between (47) and (45). The first one is that the inequality has been relaxed from zero to a nonnegative parameter ε; and the second one is that there is an artificial constraint Dw t . Although (47) reduces to (45) when ε = 0, the added constraint Dw t means that they are vastly different when ε > 0. In fact, if there is no constraint Dw t , the function defined by (43)  and if g(w) = 0, then w is a solution for (45). Hence, g can be used as an approximate function for variational inequality characterization [43]. The measure (43), due to the restrict Dw t , may cause the measure too loose [62].
We now introduce the second measure. Since (15) is a constrained optimization problem, a natural way measuring an approximate solution is that both the objective function value and the constraint validation [62,63].

Definition 3 Consider the constrained optimization problem
We say that u * is an ε-optimal solutionũ if where the first inequality represents the objective function approximation and the second one represents the constraint violation.

D. Han
Cai and Han [62] considered the sublinear rate of convergence of the generalized ADMM (G-ADMM) proposed in [30]: where the parameter α ∈ (0, 2) is an acceleration factor. It is usually suggested that we take α ∈ (1, 2). Note that the original ADMM scheme (17) is a special case of the G-ADMM scheme (49a)-(49c) with α = 1. Using these measures, it is proved that G-ADMM needs at most ω ε iterations to obtain an ε-optimal solution, where ω is Comparing with results in [60], the number of iterations in the above results required is dependent of the distance between the initial point w 0 and the solution set, which is meaningful since usually the number of iterations should be larger when the initial point is further away from the solution set and smaller when it is nearer to the solution set.
The results in [60,62] are in the ergodic sense, since they used the information from the whole iterative sequence. If we can establish the monotonic decrease in the measure sequence, the same results under the non-ergodic sense can be obtained, i.e., the measure only uses the current iterative information. Moreover, the complexity is essentially o(1/t) instead of O(1/t) [64]. For more results on the sublinear convergence of ADMM, we can refer to [65][66][67].

Linear Rate
Compared with the large amount of literature mainly being devoted to the applications of the ADMM, there is a much smaller number of papers targeting the linear rate, in particular the Q-linear rate, convergence analysis. Recall that the classical ADMM for the two-block model (15) can be viewed as the application of the Douglas-Rachford splitting method applied to the dual problem of (15), and the Douglas-Rachford splitting method can be viewed as a special application of the proximal point algorithm (PPA) for certain maximal monotone operator. As a consequence, one can derive the corresponding R-linear rate convergence of the ADMM from [68] on the Douglas-Rachford splitting method with a globally Lipschitz continuous and strongly monotone operator, and [33,41,69] on the convergence rates of the PPAs under various error bound conditions imposed on certain splitting operator. Using these connections. An early work [70] proved the global R-linear convergence rate of the ADMM when it is applied to linear programming.
Recently, more and more interesting developments on the linear convergence rate of the ADMM were established.
For convex quadratic programming, [71] provided a local linear convergence result under the conditions of the uniqueness of the optimal solutions to both the primal and dual problems and the strict complementarity. The quadratic programming problem considered in [71] is and to use ADMM, it was reformulated into which is a special case of the separable quadratic programming problem. A direct using of the classical ADMM yields the procedure 1. Set 2. Set z k+1 = arg min 3. Set The first step (53) is equivalent to a well-conditioned saddle-point problem, which is well-studied in linear algebra society. The second step (54) is essentially the orthogonal projection onto the nonnegative orthant, These special structures in the algorithm enable to use tools from numerical linear algebra to establish the local linear convergence of ADMM under the conditions mentioned above [71].
In [72], the authors established the local linear rate convergence of the generalized ADMM in the sense of [30], which does not assume the restrictive conditions such as uniqueness and strict complementarity of the solution. In the following, we introduce their results for the classical ADMM. The quadratic programming problem considered there is where Q and R are symmetric positive semidefinite matrices in R n×n , and R m×m , respectively; A ∈ R l×n and B ∈ R l×m are two given matrices, and q ∈ R n , r ∈ R m , and b ∈ R l are given vectors. X = {x|C x = c, x ∈ R n (R n + )} and Y = {y|Dy = d, y ∈ R m (R m + )} are two polyhedral sets. (55) includes (52) as a special case with X = {x|C x = c, x ∈ R n }; Y = {y|y ∈ R m + }; R and r are zero matrix and zero vector, respectively; and A = −B = I . The first key result established in [72] is the following lemma.

Lemma 3
The sequence {w k := (x k , y k , λ k )} generated by the ADMM satisfies where Based on (56), the main task is to bound the first term or the second term by the third one, i.e., bound To this end, recall the residual function of the first-order optimality condition for (55) and the following error bound result [43,73].

Lemma 4
Let be a polyhedral set. Then, there exists scalars ε > 0 and τ > 0 such that d(w, * ) τ e(w, 1) for all w ∈ with e(w, β) ε, where * is the set of primal dual solutions of the first-order optimality condition for (55), d(w, * ) is the distance from w to the set * .
Take γ = 1 is only for the purpose of simplicity. Based on this lemma, we have the following result.
Lemma 5 Let {w k := (x k , y k , λ k )} be the sequence generated by ADMM. Then, where Using (58) and (59), one can easily bound the distance between the iterates and the solution set with the two consecutive iterates. Hence, the local linear rate of convergence of ADMM for solving quadratic programming is established [72,Thm. 3.2].

Theorem 4
Let {w k } be the sequence generated by the ADMM scheme and denote {v k = (y k , λ k )}. When the iterative w k is close enough to * such that e(w k , 1) ε is satisfied, we have where Recently, [75] showed that the local linear rate result in [72] can be globalized under a slightly more general setting for the ADMM and a linearized ADMM. The key is that instead of the local error bound result, a new 'global' error bound for piecewise linear multifunction can be used.
A multi-valued mapping F is piecewise linear if its graph gph F := {(x, y)|y ∈ F(x)} is the union of finitely many polyhedral sets. One important class of piecewise linear multi-valued mappings is the subdifferential of convex piecewise linearquadratic functions. A closed proper convex function θ is said to be piecewise linear-quadratic if dom θ is the union of finitely many polyhedral sets and on each of these polyhedral sets, θ is either an affine or a quadratic function. In [77], it showed that a closed proper convex function θ is piecewise linear-quadratic if and only if the graph of ∂θ is piecewise polyhedral; see [78] for a complete proof and its extensions.
Most recently, [79] considered the model (32) and under a calmness condition only, it provides a global Q-linear rate convergence analysis for the ADMM. Here, the definition of calmness is taken from [80, Sec. 3.8(3H)], which says that a multifunction F : X ⇒ Y is calm at (x 0 , y 0 ) ∈ gph F with modulus κ 0 0 if there exist a neighborhood V of x 0 and a neighborhood W of y 0 such that where B Y denotes the unit ball in the Euclidean space Y.
Furthermore, it is well-known, e.g., [80,Thm. 3H.3], that for any (x 0 , y 0 ) ∈ gph F, the mapping F is calm at x 0 for y 0 if and only if F −1 , the inverse mapping of F, is metrically subregular at y 0 for x 0 , i.e., there exist a constant κ 0 0, a neighborhood W of y 0 , and a neighborhood V of x 0 such that In [65], the authors provided a number of scenarios on the linear rate convergence for the ADMM and ADMM with quadratic terms (similar to (32) but allowing H k i to be positive semidefinite, denoted by sPADMM) under the assumptions that either θ 1 or θ 2 is strongly convex with a Lipschitz continuous gradient, and the boundedness condition on the generated iteration sequence and others. [65] also made a detailed comparison between their linear rate convergence result and that of [68]. Other interesting results can be found in [81,82], etc.

Extensions and Variants
As shown in the introduction, modern applications arise a lot of problems that naturally (or after a simple reformulation) possess the structure as (1), where each component function θ i represents one natural property of the system. The most important intrinsic character of the model is that each component function θ i is simple enough in the sense that its proximal operator is easy to evaluate or even possesses closed-form solution, while the composition of any two (or more) of them is difficult. Hence, in the numerical algorithm, dealing them one by one, in a Gauss-Seidel manner or a Jacobi manner, is a fundamental choice. For the case that there is just one, or there are two component functions, the classical augmented Lagrangian method and the alternating direction method of multipliers, not only possess beautiful theoretical convergence results, but also exhibit good performance in many applications. While for the model (1) with m 3, the situation is totally different.
One certainly can treat the well-structured problem (1) using the ALM scheme (13), and can also regroup the variables and component functions into two blocks and then treat the resulting model using the ADMM scheme (17). However, these two schemes treat the problem on a generic purpose and ignore completely, or at least partially, the favorable separable structure in (1). Thus, this straightforward application of ALM or ADMM to (1) is not recommended. On the other hand, (13) and (17) provide us the possibility of developing customized algorithms with consideration of the specific structure of (1). Taking a close look at the minimization problem in (13) and (17), we find that the minimization tasks over the variables x i 's are coupled only by the or the quadratic term of partial regroup of them.
Therefore, we can split the minimization subproblem in (13) and (17) into m easier and smaller subproblems by applying a Gauss-Seidel or Jacobian decomposition to this quadratic term. With the alternating direction method of multipliers (17) in mind, one may naturally, incline to extend the scheme to the general case of (1) with m 3, obtaining the scheme (62) Though the numerical efficiency of (62) was verified empirically [1,2], the theoretical convergence was only partially understood. During the last decade, research was mostly focused on the following three topics about decomposition methods for (1): 1. Does the scheme (62), the direct extension of ADMM for m 3, converge for the convex case? 2. If the answer is 'no', then under what additional conditions does it converge? 3. Can the iterate generated by (62) be slightly twisted such that the convergence can be guaranteed?

A Counter Example
A simple example against the convergence of the heuristic extension of ADMM (62) was reported in [21], where they consider the special cases where θ i ≡ 0 for i = 1, 2, 3, and the problem (1) reduces to solving the linear homogeneous equation

D. Han
where A i ∈ R 3 and the matrix [A 1 , A 2 , A 3 ] is nonsingular and x i ∈ R. With these settings, the problem (63) has a unique solution x 1 = x 2 = x 3 = 0. When applying the iterative scheme (62) to (63), the recursion is where M = L −1 R and After getting the reformulation (64), the task to get a counterexample is then to set the conceret data A i for i = 1, 2, 3, such that the spectral radius of M, denoted by ρ(M), is larger than 1.
The concrete example constructed in [21] is A concrete example against the convergence of the heuristic extension of the alternating direction method of multipliers to m 3 is thus presented.

Conditions Guaranteeing Convergence
For the general convex case, people cannot manage to prove the convergence of the direct extension scheme (62) for m 3. (In fact, the example in the last subsection shows its divergence.) Prior to the work [21], people try to prove the convergence of (62) by imposing some additional conditions on the problems' data.
In [21], the authors also gave some conditions on the constraint matrices that guarantee the convergence of (62). For example, when A 1 A 2 = 0, or A 1 A 3 = 0, A 2 A 3 = 0, they proved that the sequence generated by (62) converges and the worst convergence complexity is O(1/t). Since for these cases the model is essentially a two-block model and (62) is in some sense equivalent to (17), the convergence is long-established.
Most work on conditions that guarantee convergence of ADMM for (1) considers the properties of the objective functions θ i . A strictly stronger condition than convexity is strong convexity. For the general case m 3, the first sufficient convergence condition for (62) was given in [83], where it states that:

Theorem 5
Suppose that for all i = 1, · · · , m, θ i are strongly convex with constant μ i > 0 and β < min Then, the sequence {x k } generated by (62) converges to a solution of (1) and {λ k } converges to an optimal multiplier.
for which the iteration scheme of the direct extension of ADMM (62) can also be reformulated as the recursion (64). For β = 1, the spectral radius of the involved matrix is 1.008 7, indicating that the direct extension recursion does not always converge for strongly convex optimization problems, justifying the necessary of choosing a suitable penalty parameter β, as (65). The requirement that all the component functions θ i , i = 1, · · · , m, are strongly convex is too restrictive, excluding the potential applications of ADMM (62). Some studies then try to weaken this restriction. In [84,85], the condition was relaxed and only m − 1 functions in the objective are required to be strongly convex to ensure the convergence of (62). Note that in ADMM, even for the classical case that m = 2, the first component does not appear in the convergence analysis, and it is usually regarded as an auxiliary part for the algorithm. From this point of view, it is not surprising that the strong convexity requirement can be relaxed from all components to m − 1 components. However, assuming the strong convexity for m−1 functions still excludes most of the applications that can be efficiently solved by the scheme (62). Thus, these conditions are only of theoretical interests and they are usually too strict to be satisfied by the mentioned applications.
Understanding the gap between the empirical efficiency (less cputime cost than other algorithms) of (62) and the lack of theoretical conditions that can both ensure the convergence of (62) and be satisfied by some applications of the abstract model (1) motivated further results. For notation simplicity, the results are for the case m = 3 and can be easily extended to the general case m > 3. In [5,86], the authors independently showed that the convergence of (62) can be ensured when one function in the objective of (1) is strongly convex, the penalty parameter β is appropriately restricted and some assumptions on the linear operators A i 's hold-some conditions that hold for some concrete applications of (1). The assumption on the penalty parameter β in [5] is determined through checking the positive definiteness of some operators because it targets a setting more general than (62) with larger step sizes for updating λ and semiproximal terms for regularizing the x i -subproblems. While in [86], it also established the convergence rate for the scheme (62), including the worst-case convergence rate measured by the iteration complexity and the globally linear convergence rate in asymptotical sense under some additional assumptions. See also [26,84,85,87] for some convergence rate analysis for (62).

Correction Step for Convergence
Besides additional conditions for ensuring the convergence of the direct extension of ADMM for the general case m 3, a parallel line is slightly twisting the iterate. Observing the high numerical efficiency of (62) in practice, the slighter of the twist, the more desirable in keeping the nice property of the algorithm.
This line of study is motivated by the observation that though the point generated by the direct extension of ADMM (62) is not qualified as the next iterate, it provides useful information, utilizing which judiciously can construct convergent and efficient algorithms.
Recall that one way to prove the convergence of the classical alternating direction method of multipliers is based on the contraction property of the iterates, or the Fejér monotonicity of the iterates with respect to the solution set. The Fejér monotonicity [88,Def. 5.1] of a sequence with respect to a closed convex set says that for two consecutive point v k+1 and v k , we have where α k is a step size, which can be a constant (setting down prior to the start of the iterative process according to some rule) or be computed along the iterate. Such a correction step can not only guarantee that the whole generated sequence converges to a solution of the problem, but also provide new freedom in the manner of generating the auxiliary pointx k . In [89], an ADMM with Gaussian back substitution was proposed. Let α ∈ (0, 1), and let Then, define Suppose that for all i = 2, · · · , m, A i A i is nonsingular. Consequently, M is nonsingular and H is symmetric and positive definite. With the given iterate w k , and letw k be generated by the ADMM scheme (62), then the new iterate w k+1 is generated as and The construction of the matrices M and Q is to guarantee that the generated sequence converges to a solution of the problem, based on the fact that for any v * ∈ V * , Based on this fact, the step size can be chosen dynamically [90] and the Gaussian back substitution procedure can be modified accordingly into the following form: where γ ∈ (0, 2) is a relaxation constant. From the implementation point of view, note that the matrix M is a lower triangular matrix and H is a diagonal matrix. As a consequence, the matrix M −T H in (68) is an upper triangular matrix. The procedure can be performed in an inverse order by updating λ, x m , x m−1 , · · · , x 2 , x 1 sequently, and this is where the name 'Gaussian back substitution' comes from.
The main cost in the back substitution step is in the computation of the inverse of the matrices A i A i . Hence, the algorithm has its potential advantages in the case that A i A i is easy to invert, especially for the case that A i A i = I , as for the robust principal component analysis problem (11).
Many efforts were devoted to simplifying the correction step. For the case that m = 3, [91] proposed a new correction scheme. Since the numerical experiments from various applications indicate the direct extension of ADMM is efficient, we should twist the component variables as few as possible. The prediction step and the correction step of the method in [91] are S1. Prediction step: (69d) and S2. Correction step: Generate the new iterate (x k+1 , y k+1 , ζ k+1 , λ k+1 ) by respectively.
Note that the subproblems (69a)-(69c) in the prediction step are slightly different from the ADMM scheme (62) in the quadratic terms of the augmented Lagrangian function. In (69a)-(69c), an auxiliary variable ζ was introduced, which took the place of Cz. Hence, the computational cost on solving the subproblems is the same as (62). However, it provides great benefit in the correction step. In the correction step, the x part and the y part were kept, and only the z part and the dual variable were twisted. Moreover, the cost is very low, and it does not involve any additional matrix-vector product. The rule behind the introduction of the auxiliary variable ζ is simple, which is just from the observation that in the x and the y subproblem, what we essentially need is the information of By k and Cz k , but not y k and z k . Hence, in the procedure, we provide the information of By k and Cz k via the variable y k and ζ k , respectively.
For the general separable model (1), [26] proposed the following algorithm: where τ > 0 is sufficiently small. Denote the point generated by (62) byw k , the new iterate w k+1 generated by (71) can be viewed as one that generated by and as a consequence, the algorithm can be viewed as a prediction-correction method that only twists the dual variable λ. Assuming an error bound condition and some others, [26] provided a linear rate convergence of (71). From the theoretical point of view, this constitutes important progress on understanding the convergence and the linear rate of convergence of the ADMM, while from the computational point of view, this is far from being satisfactory as in practical implementations one always prefers a larger step length for achieving numerical efficiency.
The correction step can not only guarantee the convergence of the ADMM, but also provide the freedom on generating the predictor. The predictor in ADMM scheme (62) is generated in a Gaussian-Seidel manner, and an alternative is to generate them simultaneously, leading to Jacobi-type splitting methods. In [92], a parallel splitting augmented Lagrangian method was proposed, whose prediction step and correction step are S1. Prediction step Generate the predictorw k = (x k ,λ k ) via solving the following convex programs for i = 1, · · · , m (possibly in parallel): x k i = arg min S2. Convex combination step to generate the new iterate The benefit introduced by the parallelization is great [92]; further results also indicated when there are many blocks of variables the advantage of Jacobi-type splitting methods becomes more obvious [93][94][95][96]. Computing the step size α k just involves matrix-vector production, which is simple. Nevertheless, in some applications the prediction step is also low cost. In this case, we need to further simplify the correction step, e.g., adopting a constant step. Essentially, it was proved in [92,Lem. 3.6] that the step size α k computed is uniformly bounded below from zero; i.e., there is α min := 1/(3m + 1), such that for all k 0, α k α min . Consequently, a constant step size α ∈ (0, α min ) is possible.
Extending the customized Douglas-Rachford splitting method (31) to the multiblock case, [97] proposed the following distributed splitting method: Combining the idea of Gauss-Seidel splitting and Jacobian splitting, several partial parallel splitting methods were proposed. For simplicity, let m = 3. [98] proposed to solve the first subproblem and then to solve the second and the third in a parallel manner; [99] presented a partial parallel splitting method and adopted a relaxation step with low computational cost, in which the variables were updated in the order [91] proposed to just corrected the last primal variable x 3 and the multiplier to keep the efficiency of the splitting method, and such an idea was exploited in [100] for image processing.

Solving Nonconvex Problems
Consider again the two-block model (1) with m = 2. For the case that both components of the objective function are convex, the convergence of ADMM is wellunderstood, both for the global convergence, the sublinear convergence measured by iteration complexity, and linear convergence. When one or both of the component objective functions are nonconvex, the convergence analysis for ADMM is much more challenging, and it is only partially understood. On the other hand, as stated in the introduction, the applications usually involve models (1) where at least one component function is nonconvex. This phenomenon inspires the recent interest in studying convergence of ADMM for nonconvex model (1).

Convergence under KL Properties
A very important technique to prove the convergence for nonconvex optimization problems relies on the assumption that the objective functions satisfies the Kurdyka-Lojasiewicz inequality. The importance of Kurdyka-Lojasiewicz (KL) inequality is due to the fact that many functions satisfy this inequality. In particular, when the function belongs to some functional classes, e.g., semi-algebraic (such as · p p , p ∈ [0, 1] is a rational number), real subanalytic, log-exp (see also [101][102][103][104] and references therein), it is often elementary to check that such an inequality holds. The inequality was established in the pioneering and fundamental works [22,23], and it was recently extended to nonsmooth functions in [103,104].
Definition 5 Let f : R n → R ∪ {+∞} be a proper lower semicontinuous function.
(i). The Fréchet subdifferential, or regular subdifferential, of f at x ∈ dom f , written as∂ f (x), is the set of vectors x * ∈ R n that satisfy . The limiting subdifferential, or simply the subdifferential, of f at x ∈ dom f , written as ∂ f (x), is defined as follows: Using Kurdyka-Lojasiewicz property, convergence of some optimization methods was established. Some latest references are gradient-related methods [106], proximal point algorithm [107][108][109], nonsmooth subgradient descent method [110], etc., (see also [111,112]). In the field of ADMM studies, relying on the assumption that the Kurdyka-Lojasiewicz inequality holds for certain functions, [113,114] proved convergence of variant ADMM for (15) with some special structures. [113] considered a special case of problem (15) with A 2 = −I and b = 0, i.e., the problem and proposed a variant of ADMM by regularizing the second subproblem. Their recursion is where the function φ is strictly convex and φ is the Bregman distance with respect to φ, When φ = 0, the algorithm (73) reduces to the classic ADMM (17). Assuming that the penalty parameter is chosen sufficiently large; and one of the component function, θ 1 , is twice continuously differentiable with uniformly bounded Hessian, Li and Pong proved that if the sequence generated by the algorithm (73) has a cluster point, then it gives a stationary point of the nonconvex problem (72), i.e., a point ( Reference [114] proposed to regularize both subproblems, resulting the following variant ADMM Under the assumptions that A 1 is full row rank, either L β (x 1 , x 2 , λ) with respect to x 1 or φ is μ 1 strongly convex with μ 1 > 0, and some other mild conditions, they proved the sequence generated by (74) converges to a stationary point. Reference [115] considered solving the model (72) with the classical ADMM (17), where they made the following assumption: Assumption 1 Consider the classical ADMM (17) solving the model (72). Suppose that θ 1 : R n → R ∪ {∞} is a proper lower semicontinuous function, θ 2 : R m → R is a continuously differentiable function with ∇θ 2 being Lipschitz continuous with modulus L > 0, and suppose that the augmented Lagrangian function associated to the model (72) is a KL function. Moreover, suppose following two items for the parameters hold: Note that the conditions assumed in Assumption 1 are much weaker than those in [113,114].
Measuring the iterates {w k = (x k 1 , x k 2 , λ k )} by the augmented Lagrangian function L β associated with the model (72), it was first proved that Note that if we can establish that then we get the 'sufficient' decrease in the iterate {w k } measured by L β . Thanks to the Lipschitz continuity of ∇θ 2 and the optimality condition for the x 2 -subproblem, we can reach the aim with judicious choice of the penalty parameter ensuring that Then with the aid of the KL property, the convergence and the rate of convergence of ADMM can be established, and we summarize them in the following two theorems.

Theorem 6
Suppose that Assumption 1 holds. Let {w k = (x k , y k , λ k )} be the sequence generated by the ADMM procedure (17) which is assumed to be bounded. Suppose that θ 1 and θ 2 are semi-algebraic functions, then {w k } has finite length, that is and as a consequence, {w k } converges to a critical point of L β (·).
Moreover, the concavity of ϕ means A simple further argument establishes (76) and the reader is referred to [113][114][115] for more details.
Convergence of ADMM for nonconvex optimization problems without using KL property, but with similar property was established recently, e.g., • Wang et al. [117] proposed the concept of restricted prox-regularity (see Definition 2 in [117]): For a lower semicontinuous function f , let M ∈ R + , f : R n → R ∪ {∞}, and let Then, f is called restricted prox-regular if, for any M > 0 and bounded set T ⊆ dom( f ), there exists γ > 0 such that Under the conditions that the objective function is restricted prox-regular and it is also coercive on the feasible set, it proved that the generated sequence has a cluster point which is solution of (72). • Jia et al. [118] proved that if certain error bound condition like (58) holds, the sequence generated by the ADMM for solving the nonconvex optimization problem (72) converges locally with a linear rate. • Jiang et al. [119] considered a proximal ADMM for solving linearly constrained nonconvex optimization models. Under the assumption that all the block variables except for the last block were updated with the proximal ADMM, while the last block was updated either with a gradient step or a majorization-minimization step, they showed the iterative sequence converges to a stationary point. • Zhang et al. [120] proposed a proximal ADMM for solving the linearly constrained nonconvex differentiable optimization problems, in which the authors introduced a "smoothed" sequence of primal iterates, and added to the augmented Lagrangian function an extra quadratic term. Under some additional conditions such as strict complementarity, Lipschitz continuity of the gradient of the objective function, they proved the iterative sequence converges to a stationary point and established its rate of convergence.
Sufficient conditions guaranteeing the boundedness of the generated iterate sequence {w k } are also presented; see Lemma 3.5 in [115].
In [121], the authors considered the general separable optimization problem with linear equality constraint (1) where one or more components are nonconvex, i.e., where θ 1 : R n 1 → R ∪ {∞} is a proper lower semicontinuous function, θ i : R n i → R, i = 2, · · · , m − 1 and θ m : R s → R are continuous differentiable functions with ∇θ i being Lipschitz continuous with modulus L i > 0, A i ∈ R s×n i , i = 1, · · · , m − 1 is a given matrix and b ∈ R s is a vector. Set L := max 2 i m L i and assume the following two items hold: Then, similar results as those for the two-block can be established.
Guo et al. [122] then gave another extension of the result in [115], where they considered the nonconvex optimization problem where θ 1 : R n 1 → R ∪ {∞} is a proper lower semicontinuous function, θ 2 : R n 2 → R is a continuously differentiable function whose gradient ∇θ 2 is Lipschitz continuous with constant L 2 > 0, H : R n 1 × R n 2 → R is a smooth function, A 1 ∈ R n 2 ×n 1 is a given matrix, and b ∈ R n 2 is a vector. Let β > 0 be a given parameter and the augmented Lagrangian function for problem (77) is where λ is the Lagrangian multiplier associated with the linear constraints. Based on alternately optimizing the augmented Lagrangian function L β (·) for one variable but with the others fixed, the alternating direction method of multipliers generates the iterative sequence with the following recursion: Assume that θ 1 : R n → R ∪ {∞} is a weakly convex function with constant ω > 0 and assume the following items hold: (i) All the component functions are bound below, i.e., inf (x,y)∈R n 1 ×R n 2 (ii) For any fixed x 1 , the partial gradient For any fixed x 2 , the partial gradient (iii) There exist L 2 , L 3 > 0 such that (iv) ∇ H is Lipschitz continuous on bounded subsets of R n 1 × R n 2 . In other words, for each bounded subset (v) A A μI for some μ > 0; (vi) The penalty parameter β satisfies β > max {β 1 , β 2 } , where and The convergence result is as follows: and as a consequence, we have {w k } converges to a critical point of L β (·).
Remark 2 Note that most algorithms mentioned above cannot solve the general nonconvex, constrained optimization problem (1), even for m = 2 (15). They rely on the assumption that their exists only one nonsmooth component, and the others are smooth with Lipschitz gradients. Another assumption is that A 1 is of full row rank or Image(A 2 ) ⊆ Image(A 1 ), which is not satisfied in many practical problems. Therefore, the convergence of ADMM for many nonconvex problems is still unknown.

Convergence with Special Structures
Studies on special classes of nonconvex optimization problems were performed recently, e.g., [123] proved the convergence of ADMM for consensus problems. In this subsection, we focused on the problems having the following structure: Recall that ADMM is the application of the Douglas-Rachford splitting algorithm to the dual of (15), which has the form of (79), and the convergence has been well-studied [30] when both the two component functions are convex. When there is a nonconvex component function, the study is in its infancy. Recently, the convergence of DRSM for solving (79) was established [124][125][126] with the assumption that f : R n → R ∪ {∞} and g : R n → R ∪ {∞} are proper and lower semicontinuous, f is strongly convex with constant μ > 0, and g is semiconvex with constant ω > 0. The model (79) arises frequently in the big data and artificial intelligence fields, where f (x) represents a data-fidelity term and g(x) is a sparsity-driven penalty term. Usually, weak convexity can often reduce bias in nonzero estimates (which is a serious problem for various convex penalty terms), see, e.g., [127][128][129] for some applications in sparse signal recovery applications. In particular, it was proved that several popular penalty functions are weakly convex [130].
1. The smoothly clipped absolute deviation (SCAD) penalty [131]: Let , λ < |θ | aλ, where a > 2 and λ > 0. Then SCAD is given by n i=1 g λ (|x i |) with with γ > 0 and λ > 0. For x ∈ R n , the minimax concave penalty (MCP) is given It was proved that P γ defined by (81) The classical convergence results [32,68] were built on the Krasnosel'skiȋ-Mann theorem [88,Thm. 5.14], which states that if D is a nonempty closed convex subset of R n and T : D → D is a nonexpansive operator such that Fix(T ) = ∅, then starting with z 0 ∈ D, the iterative scheme z k+1 := (1 − α)z k + αT (z k ) for 0 < α < 1 generates {z k } converging to a point in Fix(T ). Recall that the DRSM (24) for solving (79) takes the form ((24) is a special case with α = 1/2) where R μ f := 2prox μ f − I and R μg := 2prox μg − I are the reflection operators of f and g; and prox μ f and prox μg are their proximal operators, respectively, where for a proper lower semicontinuous function f and a scalar μ > 0, When f is convex, then it follows from [88,Prop. 12.27] that prox λ f is a single-valued mapping which is also firmly nonexpansive 3 . As a consequence, the reflection operator R μ f := 2prox μ f − I is nonexpansive [88,Prop. 4.2]. Then, according to [88,Prop. 4.21], if both f and g are convex, R μ f R μg is nonexpansive. The convergence of (82) is then a direct consequence of the Krasnosel'skiȋ-Mann theorem.
If either f or g is nonconvex, the situation is totally changed since the above argument is not correct due to the fact that the reflection operator of the nonconvex mapping may be expansive. To ensure the nonexpansiveness of R μ f R μg such that the Krasnosel'skiȋ-Mann theorem is still compliant, one needs further conditions on the data.
In [124], it established the convergence of DRSM under the further assumption that f is strongly convex with constant ν > 0, g is semiconvex with constant ω > 0, and ν = ω. Moreover, f is assumed to be second-order differentiable with ∇ f being Lipschitz continuous with constant σ > 0. One may argue that since the whole model is convex, we can setf and the DRSM (82) is transformed into and the convergence of (84) is well-understood. Unfortunately, although theoretically the model (83) is equivalent to (79), the numerical behavior of DRSM applying to them is different; results tested on a simple example reported in [124] indicated that in some cases, one prefers the original one (82) to the reformulation (84).
In [133], the DRSM is applied to a feasibility reformulation of the minimization of the sum of a proper lower semicontinuous function (not necessarily convex) and a weakly convex function whose gradient is required to be Lipschitz continuous, and both the functions are required to satisfy the Kurdyka-Lojasiewicz inequality. It should be mentioned that the strong convexity implies the Kurdyka-Lojasiewicz inequality, see [101,116], while there is no differentiability requirement on g in [124].
The differentiability on f required in [124] was removed in [125]; alternately, it assumed that f is strongly convex with constant ν and g is semiconvex with constant ω, and ν > ω. The scalar relationship ν > ω means the whole objective function in (79) is strongly convex and thus it satisfies the Kurdyka-Lojasiewicz inequality. But the assumption on g is weaker than [133] because there is no differentiability assumption on g. The justification of considering this situation is illustrated via the joint image denoising and sharpening problem in [134], where the optimization model is min In (85), x ∈ R n is the vector representation of a digital image to be recovered, b ∈ R n is an observed image; ∇ := (∇ 1 , ∇ 2 ) : R n → R n × R n denotes a discrete gradient operator where ∇ 1 : R n → R n and ∇ 2 : R n → R n are the standard finite difference with periodic boundary conditions in the horizontal and vertical directions, respectively; ∇x 2,1 is the total variational regularization term [7] to preserve sharp edges; − ν 2 ∇x 2 2,2 is a sharpening/edge enhancement term aiming at removing a blur if the blur is assumed to follow a diffusion process; and ι [0,1] (x) := 0, 0 x 1; ∞, else.
The main results of [125] are based on another important result, i.e., Fejér monotone theorem [88,Thm. 5.5], which states that in R n , if a sequence {x k } is Fejér monotone with respect to a nonempty subset D and every cluster point of {x k } belongs to D, then {x k } converges to a point in D. To prove the Fejér monotone of the sequence generated by (82), it first established in [125] that for a proper lower semicontinuous strongly convex function f with constant ν > 0 and for any x, y ∈ R n and μ > 0, we have On the same time, for a proper lower semicontinuous function g : R n → R ∪ {∞} which is weakly convex with constant ω > 0, we have that for any x, y ∈ R n and 0 < μ < 1 ω , R μg (x) − R μg (y) 2 x − y 2 + 4ωμ prox μg (x) − prox μg (y) 2 .
Besides the global convergence of the DRSM (82) for the 'strongly+weakly' convex model (79), [125] also establishes its sublinear rate and linear rate of convergence, under additional mild conditions, and we summarize them below. Assume that e(·, μ) = (I − R μ f R μg )(·) is metrically subregular at z * for 0 with neighborhood N (z * ) of z * and modulus κ > 0. Take sufficiently small r > 0 such that B(z * , r ) ⊆ N (z * ). Then for any starting point z 0 ∈ B(z * , r ) and for any k 0, we have That is, the DRSM (83) converges to Fix( T D R ) linearly.
[126] then further weaken the requirement ν > ω in [125] to ν = ω; and the objective function is only convex but not strongly convex. The cost is that f is further assumed to be continuously differentiable such that ∇ f is Lipschitz continuous with constant L > 0. Nonetheless, the assumption in [126] is still weaker than that in [124]. Besides, they also allowed that the reflection (proximal) operators to be evaluated approximately, which is important from numerical point of view.

Conclusions
In this paper, we gave a survey of the popular numerical algorithm, the alternating direction method of multiplier, for solving large-scale structured optimization problems, i.e., the optimization problem whose objective function is the sum of some component functions and each of them only depends on its own variable; the total variables are coupled via a linear equality. The classical ADMM is for solving the special case that there are only two block of variables, which can be dated back to the Douglas-Rachford splitting method, and interests on it rose up again in the last decade, along with the requirement for solving the general case with multi-block of variables. Though there has explosive growth of the results as we have surveyed, there are still some tasks to complete.
1. Improvements of the complexity. Several sublinear rate of convergence of ADMM, measured by certain merit functions, were established to be O(1/k). The existing works related to how to derive a worst-case O(1/k 2 ) convergence rate for the ADMM either require stronger assumptions or are eligible only for some variants of the original ADMM scheme [135,136].
2. Linear rate of convergence under weaker conditions. Similar for works on O(1/k 2 ) convergence rate for the ADMM, works for linear rate convergence of the ADMM also depend on additional assumptions which are only satisfied by special cases such as linear programming, piecewise linear-quadratic programming. 3. More efficient results on multi-block case. Currently, there are lots of works on solving the general case (1) with m 3. However, due to its complication there are still great demands for further developments of efficient 'customized' methods. 4. More results for solving the nonconvex case. The development of ADMM for solving (1) involving nonconvex component functions is still in its infancy; and the current works are very limited. For example, the results [115] is for the special case that θ 2 is smooth and A 2 = −I , while in applications, θ 2 is nonsmooth and A 2 is even not a square matrix.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.