On the Global and Linear Convergence of the Generalized Alternating Direction Method of Multipliers


The formulation

$$\begin{aligned} \min _{x,y} ~f(x)+g(y),\quad \text{ subject } \text{ to } Ax+By=b, \end{aligned}$$

where f and g are extended-value convex functions, arises in many application areas such as signal processing, imaging and image processing, statistics, and machine learning either naturally or after variable splitting. In many common problems, one of the two objective functions is strictly convex and has Lipschitz continuous gradient. On this kind of problem, a very effective approach is the alternating direction method of multipliers (ADM or ADMM), which solves a sequence of f/g-decoupled subproblems. However, its effectiveness has not been matched by a provably fast rate of convergence; only sublinear rates such as O(1 / k) and \(O(1/k^2)\) were recently established in the literature, though the O(1 / k) rates do not require strong convexity. This paper shows that global linear convergence can be guaranteed under the assumptions of strong convexity and Lipschitz gradient on one of the two functions, along with certain rank assumptions on A and B. The result applies to various generalizations of ADM that allow the subproblems to be solved faster and less exactly in certain manners. The derived rate of convergence also provides some theoretical guidance for optimizing the ADM parameters. In addition, this paper makes meaningful extensions to the existing global convergence theory of ADM generalizations.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4


  1. 1.

    The results continue to hold in many cases when strong convexity is relaxed to strict convexity (e.g., \(-\log (x)\) is strictly convex but not strongly convex over \(x>0\). ADM always generates a bounded sequence \(Ax^k, By^k, {\uplambda }^k\), where the bound only depends on the starting point and the solution, even when the feasible set is unbounded. When restricted to a compact set, a strictly convex function is strongly convex.

  2. 2.

    Suppose a sequence \(\{u^k\}\) converges to \(u^*\). We say the convergence is (in some norm \(\Vert \cdot \Vert \))

    • Q-linear, if there exists \(\mu \in (0,1)\) such that \(\frac{\Vert u^{k+1}-u^*\Vert }{\Vert u^{k}-u^*\Vert }\le \mu \);

    • R-linear, if there exists a sequence \(\{\sigma ^k\}\) such that \(\Vert u^{k}-u^*\Vert \le \sigma ^k\) and \(\sigma ^k\rightarrow 0\) Q-linearly.


  1. 1.

    Boley, D.: Linear convergence of ADMM on a model problem. TR 12-009, Department of Computer Science and Engineering, University of Minnesota (2012)

  2. 2.

    Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2010)

    Article  MATH  Google Scholar 

  3. 3.

    Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Trans. Pattern Anal. Mach. Intell. 26(9), 1124–1137 (2004)

    Article  Google Scholar 

  4. 4.

    Cai, J., Osher, S., Shen, Z.: Split Bregman methods and frame based image restoration. Multiscale Model. Simul. 8(2), 337 (2009)

    Article  MathSciNet  Google Scholar 

  5. 5.

    Chen, G., Teboulle, M.: A proximal-based decomposition method for convex minimization problems. Math. Program. 64(1), 81–101 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  6. 6.

    Davis, D., Yin, W.: Convergence rate analysis of several splitting schemes. arXiv preprint arXiv:1406.4834 (2014)

  7. 7.

    Davis, D., Yin, W.: Faster convergence rates of relaxed Peaceman–Rachford and ADMM under regularity assumptions. arXiv preprint arXiv:1407.5210 (2014)

  8. 8.

    Deng, W., Yin, W., Zhang, Y.: Group sparse optimization by alternating direction method. In: SPIE Optical Engineering+Applications, pp. 88580R–88580R (2013)

  9. 9.

    Douglas, J., Rachford, H.: On the numerical solution of heat conduction problems in two and three space variables. Trans. Am. Math. Soc. 82(2), 421–439 (1956)

    Article  MathSciNet  MATH  Google Scholar 

  10. 10.

    Eckstein, J., Bertsekas, D.: An alternating direction method for linear programming. Division of Research, Harvard Business School, Laboratory for Information Technology, M.I., Systems (1990)

  11. 11.

    Eckstein, J., Bertsekas, D.P.: On the Douglas–Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program. 55(1–3), 293–318 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  12. 12.

    Esser, E.: Applications of Lagrangian-based alternating direction methods and connections to split Bregman. CAM report 09–31, UCLA (2009)

  13. 13.

    Gabay, D.: Chapter ix applications of the method of multipliers to variational inequalities. Stud. Math. Appl. 15, 299–331 (1983)

    Google Scholar 

  14. 14.

    Gabay, D., Mercier, B.: A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Comput. Math. Appl. 2(1), 17–40 (1976)

    Article  MATH  Google Scholar 

  15. 15.

    Glowinski, R.: Numerical Methods for Nonlinear Variational Problems, Springer Series in Computational Physics. Springer, Berlin (1984)

    Google Scholar 

  16. 16.

    Glowinski, R., Marrocco, A.: Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité, d’une classe de problèmes de Dirichlet non linéaires. Laboria (1975)

  17. 17.

    Goldfarb, D., Ma, S.: Fast multiple splitting algorithms for convex optimization. SIAM J. Optim. 22(2), 533–556 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  18. 18.

    Goldfarb, D., Ma, S., Scheinberg, K.: Fast alternating linearization methods for minimizing the sum of two convex functions. Math. Program. 141(1–2), 349–382 (2013)

  19. 19.

    Goldfarb, D., Yin, W.: Parametric maximum flow algorithms for fast total variation minimization. SIAM J. Sci. Comput. 31(5), 3712–3743 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  20. 20.

    Goldstein, T., Bresson, X., Osher, S.: Geometric applications of the split Bregman method: segmentation and surface reconstruction. J. Sci. Comput. 45(1), 272–293 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  21. 21.

    Goldstein, T., O’Donoghue, B., Setzer, S., Baraniuk, R.: Fast alternating direction optimization methods. SIAM J. Imaging Sci. 7(3), 1588–1623 (2014)

  22. 22.

    Goldstein, T., Osher, S.: The split Bregman method for L1 regularized problems. SIAM J. Imaging Sci. 2(2), 323–343 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  23. 23.

    He, B., Liao, L., Han, D., Yang, H.: A new inexact alternating directions method for monotone variational inequalities. Math. Program. 92(1), 103–118 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  24. 24.

    He, B., Yuan, X.: On non-ergodic convergence rate of Douglas–Rachford alternating direction method of multipliers. Numerische Mathematik 130(3), 567–577 (2014)

  25. 25.

    He, B., Yuan, X.: On the \(O(1/n)\) convergence rate of the Douglas–Rachford alternating direction method. SIAM J. Numer. Anal. 50(2), 700–709 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  26. 26.

    Hong, M., Luo, Z.: On the linear convergence of the alternating direction method of multipliers. Arxiv preprint arXiv:1208.3922v3 (2013)

  27. 27.

    Jiang, H., Deng, W., Shen, Z.: Surveillance video processing using compressive sensing. Inverse Probl. Imaging 6(2), 201–214 (2012)

  28. 28.

    Liang, J., Fadili, J., Peyre, G., Luke, R.: Activity identification and local linear convergence of Douglas–Rachford/ADMM under partial smoothness. arXiv:1412.6858v5 (2015)

  29. 29.

    Lions, P.L., Mercier, B.: Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer. Anal. 16(6), 964–979 (1979)

    Article  MathSciNet  MATH  Google Scholar 

  30. 30.

    Mateos, G., Bazerque, J., Giannakis, G.: Distributed sparse linear regression. IEEE Trans. Signal Process. 58(10), 5262–5276 (2010)

    Article  MathSciNet  Google Scholar 

  31. 31.

    Mendel, J., Burrus, C.: Maximum-Likelihood Deconvolution: A Journey into Model-Based Signal Processing. Springer, New York (1990)

    Google Scholar 

  32. 32.

    Rockafellar, R.: Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14(5), 877–898 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  33. 33.

    Rockafellar, R.: Convex Analysis, vol. 28. Princeton University Press, Princeton (1997)

    Google Scholar 

  34. 34.

    Wang, Y., Yang, J., Yin, W., Zhang, Y.: A new alternating minimization algorithm for total variation image reconstruction. SIAM J. Imaging Sci. 1(3), 248–272 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  35. 35.

    Yan, M., Yin, W.: Self equivalence of the alternating direction method of multipliers. arXiv:1407.7400 (2014)

  36. 36.

    Yang, J., Zhang, Y.: Alternating direction algorithms for \(\ell _1\)-problems in compressive sensing. SIAM J. Sci. Comput. 33(1–2), 250–278 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  37. 37.

    Zhang, X., Burger, M., Osher, S.: A unified primal–dual algorithm framework based on Bregman iteration. J. Sci. Comput. 46(1), 20–46 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  38. 38.

    Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005)

    Article  MathSciNet  MATH  Google Scholar 

Download references


The authors’ work is supported in part by ARL MURI Grant W911NF-09-1-0383 and NSF Grant DMS-1317602.

Author information



Corresponding author

Correspondence to Wotao Yin.


Appendix 1: Proof of Lemma 2.1


  1. i)

    The optimality conditions of the subproblems of Algorithm 2 yield (2.6) and (2.7).

  2. ii)

    By the convexity of f (1.16), combining the optimality conditions (1.13) and (2.6) yield (2.8). Similarly, (2.9) follows from the convexity of g (1.17) and the optimality conditions (1.14) and (2.7).

  3. iii)

    Equation (2.10) follows directly from (1.15) and (2.1).

  4. iv)

    By adding (2.8) and (2.9) and using (2.10), we have

    $$\begin{aligned}&\frac{1}{\beta }\langle {\uplambda }^{k}-\hat{{\uplambda }},~\hat{{\uplambda }}-{\uplambda }^*-\beta A(x^k-x^{k+1})\rangle + \langle x^{k+1}-x^*, ~\hat{P}(x^k-x^{k+1})\rangle \nonumber \\&\qquad + \langle y^{k+1}-y^*, ~Q(y^k-y^{k+1})\rangle \nonumber \\&\quad \ge \nu _f\Vert x^{k+1}-x^{*}\Vert ^2+\nu _g\Vert y^{k+1}-y^*\Vert ^2, \end{aligned}$$

    which can be simplified as

    $$\begin{aligned} (\hat{u}-u^*)^TG_1(u^k-\hat{u})\ge & {} \langle A(x^k-x^{k+1}),~{\uplambda }^k-\hat{{\uplambda }}\rangle \nonumber \\&+\,\nu _f\Vert x^{k+1}-x^{*}\Vert ^2+\nu _g\Vert y^{k+1}-y^*\Vert ^2. \end{aligned}$$

    By rearranging the terms, we have

    $$\begin{aligned} (u^k-u^*)^TG_1(u^k-\hat{u})\ge & {} \Vert u^k-\hat{u}\Vert _{G_1}^2+\langle A(x^k-x^{k+1}),{\uplambda }^k-\hat{{\uplambda }}\rangle \nonumber \\&+\,\nu _f\Vert x^{k+1}-x^{*}\Vert ^2+\nu _g\Vert y^{k+1}-y^*\Vert ^2. \end{aligned}$$

    Then, from the identity \(\Vert a-c\Vert _G^2-\Vert b-c\Vert _G^2=2(a-c)^TG(a-b)-\Vert a-b\Vert _G^2\) and (2.5), it follows that

    $$\begin{aligned} \Vert u^k-u^*\Vert _G^2-\Vert u^{k+1}-u^*\Vert _G^2=2(u^k-u^*)^TG_1(u^k-\hat{u})-\Vert G_0(u^k-\hat{u})\Vert ^2_G. \end{aligned}$$

    Substituting (7.3) into the right-hand side of the above equation, we have

    $$\begin{aligned} \Vert u^k-u^*\Vert _G^2-\Vert u^{k+1}-u^*\Vert _G^2\ge & {} 2\Vert u^k-\hat{u}\Vert _{G_1}^2-\Vert u^k-\hat{u}\Vert ^2_{G_1G_0}\nonumber \\&+\,2\langle A(x^k-x^{k+1}),{\uplambda }^k-\hat{{\uplambda }}\rangle \nonumber \\&+\,2\nu _f\Vert x^{k+1}-x^{*}\Vert ^2+2\nu _g\Vert y^{k+1}-y^*\Vert ^2, \end{aligned}$$

    and thus (2.11) follows immediately.

Appendix 2: Proof of Theorem 2.1


  1. i)

    By the Cauchy–Schwarz inequality, we have

    $$\begin{aligned} 2({\uplambda }^k-\hat{{\uplambda }})^TA(x^k-x^{k+1})\ge -\frac{1}{\rho }\Vert A(x^k-x^{k+1})\Vert ^2-\rho \Vert {\uplambda }^k-\hat{{\uplambda }}\Vert ^2,~\forall \rho >0. \end{aligned}$$

    Substituting (8.1) into (2.11) and using \(\frac{1}{\gamma }({\uplambda }^k-{\uplambda }^{k+1})={\uplambda }^k-\hat{{\uplambda }}\), we have

    $$\begin{aligned}&\Vert u^k-u^*\Vert _G^2-\Vert u^{k+1}-u^*\Vert _G^2\nonumber \\&\quad \ge ~\Vert x^k-x^{k+1}\Vert _{\hat{P}-\frac{1}{\rho }A^TA}^2+\Vert y^k-y^{k+1}\Vert _Q^2+\left( \frac{2-\gamma }{\beta }-\rho \right) \frac{1}{\gamma ^2}\Vert {\uplambda }^k-{\uplambda }^{k+1}\Vert ^2\nonumber \\&\qquad +\,2\nu _f\Vert x^{k+1}-x^{*}\Vert ^2+2\nu _g\Vert y^{k+1}-y^*\Vert ^2,\quad \forall \rho >0. \end{aligned}$$

    To show that (2.14) holds for a certain \(\eta >0\), we only need \(\hat{P}-\frac{1}{\rho }A^TA\succ 0\) and \(\frac{2-\gamma }{\beta }-\rho >0\) for a certain \(\rho >0\), which is true if and only if we have \(\hat{P}\succ \frac{\beta }{2-\gamma }A^T A\) or, equivalently, (2.13).

  2. ii)

    For \(P=\mathbf {0}\), we first derive a lower bound for the cross term \(({\uplambda }^k-\hat{{\uplambda }})^TA(x^k-x^{k+1})\). Applying (2.6) at two consecutive iterations with \(P=\mathbf {0}\) and in light of the definition of \(\hat{{\uplambda }}\), we have

    $$\begin{aligned} \left\{ \begin{array}{l} A^T[{\uplambda }^{k-1}-\beta (Ax^{k}+By^{k}-b)] \in \partial f(x^{k}),\\ A^T\hat{{\uplambda }} \in \partial f(x^{k+1}). \end{array} \right. \end{aligned}$$

    The difference of the two terms on the left in (8.3) is

    $$\begin{aligned}&A^T\left[ {\uplambda }^{k-1}-\beta (Ax^{k}+By^{k}-b)-\hat{{\uplambda }}\right] \nonumber \\&\quad =A^T\left( {\uplambda }^k-\hat{{\uplambda }}\right) -(1-\gamma )\beta A^T\left( Ax^{k}+By^{k}-b\right) . \end{aligned}$$

    By (8.3), (8.4) and (1.16), we get

    $$\begin{aligned}&\langle A^T({\uplambda }^k-\hat{{\uplambda }}),x^k-x^{k+1}\rangle - \langle (1-\gamma )\beta A^T(Ax^{k}+By^{k}-b),x^k-x^{k+1}\rangle \nonumber \\&\quad \ge \nu _f\Vert x^k-x^{k+1}\Vert ^2, \end{aligned}$$

    to which applying the Cauchy–Schwarz inequality gives

    $$\begin{aligned}&({\uplambda }^k-\hat{{\uplambda }})^T A(x^k-x^{k+1})\nonumber \\&\quad \ge \langle \sqrt{\beta } (Ax^{k}+By^{k}-b), (1-\gamma )\sqrt{\beta }A(x^k-x^{k+1})\rangle +\nu _f\Vert x^k-x^{k+1}\Vert ^2\nonumber \\&\quad \ge -\frac{\beta }{2\rho }\Vert Ax^{k}+By^{k}-b\Vert ^2-\frac{(1-\gamma )^2\beta \rho }{2}\Vert A(x^k-x^{k+1})\Vert ^2\nonumber \\&\qquad +\nu _f\Vert x^k-x^{k+1}\Vert ^2,\quad \forall \rho >0. \end{aligned}$$

    Substituting (8.6) into (2.11) and using \(\hat{P}=P+\beta A^T A=\beta A^T A\) and the definition of \(\hat{{\uplambda }}\), we have

    $$\begin{aligned}&\Vert u^k-u^*\Vert _G^2+\frac{\beta }{\rho }\Vert Ax^{k}+By^{k}-b\Vert ^2 \nonumber \\&\quad \ge ~\Vert u^{k+1}-u^*\Vert _G^2+\frac{\beta }{\rho }\Vert Ax^{k+1}+By^{k+1}-b\Vert ^2\nonumber \\&\qquad +\,\beta \left( 2-\gamma -\frac{1}{\rho }\right) \Vert Ax^{k+1} +By^{k+1}-b\Vert ^2 +\beta \left( 1-(1-\gamma )^2\rho \right) \Vert A(x^k-x^{k+1})\Vert ^2\nonumber \\&\qquad +\,\Vert y^k-y^{k+1}\Vert ^2_Q+2\nu _f\Vert x^k-x^{k+1}\Vert ^2+2\nu _f\Vert x^{k+1}-x^*\Vert ^2+2\nu _g\Vert y^{k+1}-y^*\Vert ^2. \end{aligned}$$

    To prove such \(\eta >0\) exists for (2.16), we only need the existence of \(\rho >0\) such that \(2-\gamma -\frac{1}{\rho }>0\) and \(1-(1-\gamma )^2\rho >0\), which holds if and only if \(2-\gamma >(1-\gamma )^2\) or, equivalently, \(\gamma \in (0,\frac{1+\sqrt{5}}{2})\). In this case of \(P=\mathbf {0}\), if we set \(\gamma =1\), (8.6) reduces to \(({\uplambda }^k-\hat{{\uplambda }})^T A(x^k-x^{k+1})\ge \nu _f\Vert x^k-x^{k+1}\Vert ^2\), which substituting into (2.11) gives (2.17) with \(\eta = 1\).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Deng, W., Yin, W. On the Global and Linear Convergence of the Generalized Alternating Direction Method of Multipliers. J Sci Comput 66, 889–916 (2016). https://doi.org/10.1007/s10915-015-0048-x

Download citation


  • Alternating direction method of multipliers
  • Global convergence
  • Linear convergence
  • Strong convexity
  • Distributed computing