Abstract
Recently, the alternating direction method of multipliers (ADMM) has received intensive attention from a broad spectrum of areas. The generalized ADMM (GADMM) proposed by Eckstein and Bertsekas is an efficient and simple acceleration scheme of ADMM. In this paper, we take a deeper look at the linearized version of GADMM where one of its subproblems is approximated by a linearization strategy. This linearized version is particularly efficient for a number of applications arising from different areas. Theoretically, we show the worst-case \({\mathcal {O}}(1/k)\) convergence rate measured by the iteration complexity (\(k\) represents the iteration counter) in both the ergodic and a nonergodic senses for the linearized version of GADMM. Numerically, we demonstrate the efficiency of this linearized version of GADMM by some rather new and core applications in statistical learning. Code packages in Matlab for these applications are also developed.
Similar content being viewed by others
References
Anderson, T.W.: An introduction to multivariate statistical analysis, 3rd edn. Wiley (2003)
Bertsekas, D.P.: Constrained optimization and Lagrange multiplier methods. Academic Press, New York (1982)
Bickel, P.J., Levina, E.: Some theory for Fisher’s linear discriminant function, naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli 6, 989–1010 (2004)
Blum, E., Oettli, W.: Mathematische Optimierung. Grundlagen und Verfahren. Ökonometrie und Unternehmensforschung. Springer, Berlin (1975)
Boley, D.: Local linear convergence of ADMM on quadratic or linear programs. SIAM J. Optim. 23(4), 2183–2207 (2013)
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3, 1–122 (2011)
Cai, T.T., Liu, W.: A Direct estimation approach to sparse linear discriminant analysis. J. Amer. Stat. Assoc. 106, 1566–1577 (2011)
Cai, X., Gu, G., He, B., Yuan, X.: A proximal point algorithm revisit on alternating direction method of multipliers. Sci. China Math. 56(10), 2179–2186 (2013)
Candès, E.J., Tao, T.: The Dantzig selector: statistical estimation when \(p\) is much larger than \(n\). Ann. Stat. 35, 2313–2351 (2007)
Clemmensen, L., Hastie, T., Witten, D., Ersbøll, B.: Sparse discriminant analysis. Technometrics 53, 406–413 (2011)
Deng, W., Yin, W.: On the global and linear convergence of the generalized alternating direction method of multipliers. Manuscript (2012)
Eckstein, J.: Parallel alternating direction multiplier decomposition of convex programs. J. Optim. Theory Appli. 80(1), 39–62 (1994)
Eckstein, J., Yao, W.: Augmented Lagrangian and alternating direction methods for convex optimization: a tutorial and some illustrative computational results. RUTCOR Research Report RRR 32–2012 (2012)
Eckstein, J., Bertsekas, D.: On the Douglas-Rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program. 55, 293–318 (1992)
Fan, J., Fan, Y.: High dimensional classification using features annealed independence rules. Ann. Stat. 36, 2605–2037 (2008)
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001)
Fan, J., Feng, Y., Tong, X.: A road to classification in high dimensional space: the regularized optimal affine discriminant. J. R. Stat. Soc. Series B Stat. Methodol. 74, 745–771 (2012)
Fan, J., Zhang, J., Yu, K.: Vast portfolio selection with gross-exposure constraints. J. Am. Stat. Assoc. 107, 592–606 (2012)
Fazeland, M., Hindi, H., Boyd, S.: A rank minimization heuristic with application to minimum order system approximation. Proc. Am. Control Conf. (2001)
Fortin, M., Glowinski, R.: Augmented Lagrangian methods: applications to the numerical solutions of boundary value problems Stud. Math. Appl. 15. NorthHolland, Amsterdam (1983)
Gabay, D.: Applications of the method of multipliers to variational inequalities, Augmented Lagrange Methods: applications to the solution of boundary-valued problems. Fortin, M. Glowinski, R. eds. North Holland pp. 299–331 (1983)
Gabay, D., Mercier, B.: A dual algorithm for the solution of nonlinear variational problems via finite-element approximations. Comput. Math. Appl. 2, 17–40 (1976)
Glowinski, R.: On alternating directon methods of multipliers: a historical perspective. Springer Proceedings of a Conference Dedicated to J. Periaux (to appear)
Glowinski, R., Marrocco, A.: Approximation par éléments finis d’ordre un et résolution par pénalisation-dualité d’une classe de problèmes non linéaires. R.A.I.R.O., R2, pp. 41–76 (1975)
Gol’shtein, E.G., Tret’yakov, N.V.: Modified Lagrangian in convex programming and their generalizations. Math. Program. Study 10, 86–97 (1979)
Grier, H.E., Krailo, M.D., Tarbell, N.J., Link, M.P., Fryer, C.J., Pritchard, D.J., Gebhardt, M.C., Dickman, P.S., Perlman, E.J., Meyers, P.A.: Addition of ifosfamide and etoposide to standard chemotherapy for Ewing’s sarcoma and primitive neuroectodermal tumor of bone. New Eng. J. Med. 348, 694–701 (2003)
Han, D., Yuan, X.: Local linear convergence of the alternating direction method of multipliers for quadratic programs. SIAM J. Numer. Anal. 51(6), 3446–3457 (2013)
Hans, C.P., Weisenburger, D.D., Greiner, T.C., Gascone, R.D., Delabie, J., Ott, G., M’uller-Hermelink, H., Campo, E., Braziel, R., Elaine, S.: Confirmation of the molecular classification of diffuse large B-cell lymphoma by immunohistochemistry using a tissue microarray. Blood 103, 275–282 (2004)
He, B., Liao, L.-Z., Han, D.R., Yang, H.: A new inexact alternating directions method for monotone variational inequalities. Math. Program. 92, 103–118 (2002)
He, B., Yang, H.: Some convergence properties of a method of multipliers for linearly constrained monotone variational inequalities. Oper. Res. Let. 23, 151–161 (1998)
He, B., Yuan, X.: On the \(O(1/n)\) convergence rate of Douglas-Rachford alternating direction method. SIAM J. Numer. Anal. 50, 700–709 (2012)
He, B., Yuan, X.: On non-ergodic convergence rate of Douglas–Rachford alternating direction method of multipliers. Numerische Mathematik (to appear)
He, B., Yuan, X.: On convergence rate of the Douglas–Rachford operator splitting method. Math. Program (to appear)
Hestenes, M.R.: Multiplier and gradient methods. J. Optim. Theory Appl. 4, 302–320 (1969)
James, G.M., Paulson, C., Rusmevichientong, P.: The constrained LASSO. Manuscript (2012)
Lions, P.L., Mercier, B.: Splitting algorithms for the sum of two nonlinear operator. SIAM J. Numer. Anal. 16, 964–979 (1979)
Martinet, B.: Regularisation, d’inéquations variationelles par approximations succesives. Rev. Francaise d’Inform. Recherche Oper. 4, 154–159 (1970)
McCall, M.N., Bolstad, B.M., Irizarry, R.A.: Frozen robust multiarray analysis (fRMA). Biostatistics 11, 242–253 (2010)
Nemirovsky, A.S., Yudin, D.B.: Problem complexity and method efficiency in optimization, Wiley-Interscience series in discrete mathematics. Wiley, New York (1983)
Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate \(O(1/{k^2})\). Dokl. Akad. Nauk SSSR 269, 543–547 (1983)
Ng, M.K., Wang, F., Yuan, X.: Inexact alternating direction methods for image recovery. SIAM J. Sci. Comput. 33(4), 1643–1668 (2011)
Powell, M.J.D.: A method for nonlinear constraints in minimization problems. In: Fletcher, R. (ed.) Optimization. Academic Press (1969)
Shao, J., Wang, Y., Deng, X., Wang, S.: Sparse linear discriminant analysis by thresholding for high dimensional data. Ann. Stat. 39, 1241–1265 (2011)
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Series B Stat. Methodol. 58, 267–288 (1996)
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K.: Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Series B Stat. Methodol. 67, 91–108 (2005)
Tibshirani, R.J., Taylor, J.: The solution path of the generalized lasso. Ann. Stat. 39, 1335–1371 (2011)
Wang, L., Zhu, J., Zou, H.: Hybrid huberized support vector machines for microarray classification and gene selection. Bioinformatics 24, 412–419 (2008)
Wang, X., Yuan, X.: The linearized alternating direction method of multipliers for Dantzig Selector. SIAM J. Sci. Comput. 34, 2782–2811 (2012)
Witten, D.M., Tibshirani, R.: Penalized classification using Fisher’s linear discriminant. J. R. Stat. Soc. Series B Stat. Methodol. 73, 753–772 (2011)
Yang, J., Yuan, X.: Linearized augmented Lagrangian and alternating direction methods for nuclear norm minimization. Math. Comput. 82, 301–329 (2013)
Zhang, C.-H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38, 894–942 (2010)
Zhang, X.Q., Burger, M., Osher, S.: A unified primal-dual algorithm framework based on Bregman iteration. J. Sci. Comput. 6, 20–46 (2010)
Zhang, X.Q., Burger, M., Bresson, X., Osher, S.: Bregmanized nonlocal regularization for deconvolution and sparse reconstruction. SIAM J. Imag. Sci. 3(3), 253–276 (2010)
Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101, 1418–1429 (2006)
Author information
Authors and Affiliations
Corresponding author
Additional information
Xiaoming Yuan: This author was supported by the Faculty Research Grant from HKBU: FRG2/13-14/061 and the General Research Fund from Hong Kong Research Grants Council: 203613.
Bingsheng He: This author was supported by the NSFC Grant 11471156.
Appendices
Appendices
We show that our analysis in Sects. 3 and 4 can be extended to the case where both the \(\mathbf {x}\)- and \(\mathbf {y}\)-subproblems in (3) are linearized. The resulting scheme, called doubly linearized version of the GADMM (“DL-GADMM” for short), reads as
where the matrices \(\mathbf {G}_1\in {\mathbb {R}}^{n_1\times n_1}\) and \(\mathbf {G}_2\in {\mathbb {R}}^{n_2\times n_2}\) are both symmetric and positive definite.
For further analysis, we define two matrices, which are analogous to \(\mathbf {H}\) and \(\mathbf {Q}\) in (11), respectively, as
Obviously, we have
where \(\mathbf {M}\) is defined in (10). Note that the equalities (8) and (9) still hold.
1.1 A worst-case \({\mathcal {O}}(1/k)\) convergence rate in the ergodic sense for (79)
We first establish a worst-case \({\mathcal {O}}(1/k)\) convergence rate in the ergodic sense for the DL-GADMM (79). Indeed, using the relationship (81), the resulting proof is nearly the same as that in Sect. 3 for the L-GADMM (4). We thus only list two lemmas (analogous to Lemmas 1 and 2) and one theorem (analogous to Theorem 2) to demonstrate a worst-case \({\mathcal {O}}(1/k)\) convergence rate in the ergodic sense for (79), and omit the details of proofs.
Lemma 7
Let the sequence \(\{\mathbf {w}^t\}\) be generated by the DL-GADMM (79) with \(\alpha \in (0,2)\) and the associated sequence \(\{\widetilde{\mathbf {w}}^t\}\) be defined in (7). Then we have
where \(\mathbf {Q}_2\) is defined in (80).
Lemma 8
Let the sequence \(\{\mathbf {w}^t\}\) be generated by the DL-GADMM (79) with \(\alpha \in (0,2)\) and the associated sequence \(\{\widetilde{\mathbf {w}}^t\}\) be defined in (7). Then for any \(\mathbf {w}\in \varOmega \), we have
Theorem 7
Let \(\mathbf {H}_2\) be given by (80) and \(\{\mathbf {w}^t\}\) be the sequence generated by the DL-GADMM (79) with \(\alpha \in (0,2)\). For any integer \(k>0\), let \(\widehat{\mathbf {w}}_k\) be defined by
where \(\widetilde{\mathbf {w}}^t\) is defined in (7). Then, \(\widehat{\mathbf {w}}_k\in \varOmega \) and
1.2 A worst-case \({\mathcal {O}}(1/k)\) convergence rate in a nonergodic sense for (79)
Next, we prove a worst-case \({\mathcal {O}}(1/k)\) convergence rate in a nonergodic sense for the DL-GADMM (79). Note that Lemma 4 still holds by replacing \(\mathbf {H}\) with \(\mathbf {H}_2\). That is, if \(\Vert \mathbf {w}^t-\mathbf {w}^{t+1}\Vert _{\mathbf {H}_2}^2 = 0\), \(\widetilde{\mathbf {w}}^t\) defined in (7) is an optimal solution point to (5). Thus, for the sequence \(\{\mathbf {w}^t\}\) generated by the DL-GADMM (79), it is reasonable to measure the accuracy of an iterate by \(\Vert \mathbf {w}^t-\mathbf {w}^{t+1}\Vert _{\mathbf {H}_2}^2\).
Proofs of the following two lemmas are analogous to those of Lemmas 5 and 6, respectively. We thus omit them.
Lemma 9
Let the sequence \(\{\mathbf {w}^t\}\) be generated by the DL-GADMM (79) with \(\alpha \in (0,2)\) and the associated \(\{ \widetilde{\mathbf {w}}^t\}\) be defined in (7); the matrix \(\mathbf {Q}_2\) be defined in (80). Then, we have
Lemma 10
Let the sequence \(\{\mathbf {w}^t\}\) be generated by the DL-GADMM (79) with \(\alpha \in (0,2)\) and the associated \(\{ \widetilde{\mathbf {w}}^t\}\) be defined in (7); the matrices \(\mathbf {M}\), \(\mathbf {H}_2\), \(\mathbf {Q}_2\) be defined in (10) and (80). Then, we have
Based on the above two lemmas, we see that the sequence \(\{\Vert \mathbf {w}^t-\mathbf {w}^{t+1}\Vert _{\mathbf {H}_2}\}\) is monotonically non-increasing. That is, we have the following theorem.
Theorem 8
Let the sequence \(\{\mathbf {w}^t\}\) be generated by the DL-GADMM (79) and the matrix \(\mathbf {H}_2\) be defined in (80). Then, we have
Note that for the DL-GADMM (79), the \(\mathbf {y}\)-subproblem is also proximally regularized, and we can not extend the inequality (31) to this new case. This is indeed the main difficulty for proving a worst-case \({\mathcal {O}}(1/k)\) convergence rate in a nonergodic sense for the DL-GADMM (79). A more elaborated analysis is needed. Let us show one lemma first to bound the left-hand side in (31).
Lemma 11
Let \(\{\mathbf {y}^t\}\) be the sequence generated by the DL-GADMM (79) with \(\alpha \in (0,2)\). Then, we have
Proof
It follows from the optimality condition of the \(\mathbf {y}\)-subproblem in (79) that
Similarly, we also have,
Setting \(\mathbf {y}= \mathbf {y}^{t}\) in (86) and \(\mathbf {y}=\mathbf {y}^{t+1}\) in (87), and summing them up, we have
where the second inequality holds by the fact that \(\mathbf {a}^T\mathbf {b}\ge - \frac{1}{2}(\Vert \mathbf {a}\Vert ^2 + \Vert \mathbf {b}\Vert ^2)\). The assertion (85) is proved. \(\square \)
Two more lemmas should be proved in order to establish a worst-case \({\mathcal {O}}(1/k)\) convergence rate in a nonergodic sense for the DL-GADMM (79).
Lemma 12
The sequence \(\{\mathbf {w}^t\}\) generated by the DL-GADMM (79) with \(\alpha \in (0,2)\) and the associated \(\{ \widetilde{\mathbf {w}}^t\}\) be defined in (7), then we have
where \(c_\alpha \) is defined in (37).
Proof
By the definition of \(\mathbf {Q}_2\), \(\mathbf {M}\) and \(\mathbf {H}_2\), we have
which implies the assertion (88) immediately. \(\square \)
In the next lemma, we refine the bound of \((\mathbf {w}-\widetilde{\mathbf {w}}^t)^T\mathbf {Q}_2(\mathbf {w}^{t}-\widetilde{\mathbf {w}}^t)\) in (82). The refined bound consists of the terms \(\Vert \mathbf {w}-\mathbf {w}^{t+1}\Vert _{\mathbf {H}_2}^2\) recursively, which is favorable for establishing a worst-case \({\mathcal {O}}(1/k)\) convergence rate in a nonergodic sense for the DL-GADMM (79).
Lemma 13
Let \(\{\mathbf {w}^t\}\) be the sequence generated by the DL-GADMM (79) with \(\alpha \in (0,2)\). Then, \(\widetilde{\mathbf {w}}^t\in \varOmega \) and
where \(\mathbf {M}\) is defined in (10), and \(\mathbf {H}_2\) and \(\mathbf {Q}_2\) are defined in (80).
Proof
By the identity \(\mathbf {Q}_2(\mathbf {w}^t - \widetilde{\mathbf {w}}^t) = \mathbf {H}_2(\mathbf {w}^t - \mathbf {w}^{t+1})\), it holds that
Setting \(\mathbf {a}=\mathbf {w}\), \(\mathbf {b}=\widetilde{\mathbf {w}}^t\), \(\mathbf {c}=\mathbf {w}^t\) and \(\mathbf {d}= \mathbf {w}^{t+1}\) in the identity
we have
Meanwhile, we have
where the last equality comes from the identity \(\mathbf {Q}_2 = \mathbf {H}_2\mathbf {M}\).
Substituting the above identity into (90), we have, for all \(\mathbf {w}\in \varOmega \),
Plugging this identity into (82), our claim follows immediately. \(\square \)
Then, we show the boundedness of the sequence \(\{\mathbf {w}^t\}\) generated by the DL-GADMM (79), which essentially implies the convergence of \(\{\mathbf {w}^t\}\).
Theorem 9
Let \(\{\mathbf {w}^t\}\) be the sequence generated by the DL-GADMM (79) with \(\alpha \in (0,2)\). Then, it holds that
where \(\mathbf {H}_2\) is defined in (80).
Proof
Setting \(\mathbf {w}= \mathbf {w}^*\) in (89), we have
Then, recall (5), we have
It is easy to see that \(\mathbf {Q}^T_2+\mathbf {Q}_2-\mathbf {M}^T\mathbf {H}_2\mathbf {M}\succeq {\varvec{0}}\). Thus, it holds
which completes the proof. \(\square \)
Finally, we establish a worst-case \({\mathcal {O}}(1/k)\) convergence rate in a nonergodic sense for the DL-GADMM (79).
Theorem 10
Let the sequence \(\{\mathbf {w}^t\}\) be generated by the scheme DL-GADMM (79) with \(\alpha \in (0,2)\). It holds that
Proof
By the definition of \(\mathbf {H}_2\) in (80), we have
Using (85, 88, 91) and (93), we obtain
By Theorem 8, the sequence \(\{\Vert \mathbf {w}^t-\mathbf {w}^{t+1}\Vert _{\mathbf {H}_2}^2\}\) is non-increasing. Thus, we have
and the assertion (92) is proved. \(\square \)
Recall that for the sequence \(\{\mathbf {w}^t\}\) generated by the DL-GADMM (79), it is reasonable to measure the accuracy of an iterate by \(\Vert \mathbf {w}^t-\mathbf {w}^{t+1}\Vert _{\mathbf {H}_2}^2\). Thus, Theorem 10 demonstrates a worst-case \({\mathcal {O}}(1\!/k)\) convergence rate in a nonergodic sense for the DL-GADMM (79).
Rights and permissions
About this article
Cite this article
Fang, E.X., He, B., Liu, H. et al. Generalized alternating direction method of multipliers: new theoretical insights and applications. Math. Prog. Comp. 7, 149–187 (2015). https://doi.org/10.1007/s12532-015-0078-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12532-015-0078-2
Keywords
- Convex optimization
- Alternating direction method of multipliers
- Convergence rate
- Variable selection
- Discriminant analysis
- Statistical learning