Abstract
This paper introduces a parallel and distributed algorithm for solving the following minimization problem with linear constraints:
where \(N \ge 2\), \(f_i\) are convex functions, \(A_i\) are matrices, and \({\mathcal {X}}_i\) are feasible sets for variable \(\mathbf{x}_i\). Our algorithm extends the alternating direction method of multipliers (ADMM) and decomposes the original problem into N smaller subproblems and solves them in parallel at each iteration. This paper shows that the classic ADMM can be extended to the N-block Jacobi fashion and preserve convergence in the following two cases: (i) matrices \(A_i\) are mutually near-orthogonal and have full column-rank, or (ii) proximal terms are added to the N subproblems (but without any assumption on matrices \(A_i\)). In the latter case, certain proximal terms can let the subproblem be solved in more flexible and efficient ways. We show that \(\Vert {\mathbf {x}}^{k+1} - {\mathbf {x}}^k\Vert _M^2\) converges at a rate of o(1 / k) where M is a symmetric positive semi-definte matrix. Since the parameters used in the convergence analysis are conservative, we introduce a strategy for automatically tuning the parameters to substantially accelerate our algorithm in practice. We implemented our algorithm (for the case ii above) on Amazon EC2 and tested it on basis pursuit problems with >300 GB of distributed data. This is the first time that successfully solving a compressive sensing problem of such a large scale is reported.
Similar content being viewed by others
References
Awanou, G., Lai, M.J., Wenston, P.: The multivariate spline method for numerical solution of partial differential equations and scattered data interpolation. In: Chen, G., Lai, M.J. (eds.) Wavelets and Splines, pp. 24–74. Nashboro Press, Nashville (2006)
Bertsekas, D., Tsitsiklis, J.: Parallel and Distributed Computation: Numerical Methods, 2nd edn. Athena Scientific, Belmont (1997)
Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)
Chandrasekaran, V., Parrilo, P.A., Willsky, A.S.: Latent variable graphical model selection via convex optimization. Ann. Stat. 40(4), 1935–1967 (2012)
Chen, C., He, B.S., Ye, Y.Y., Yuan, X.M.: The direct extension of admm for multi-block convex minimization problems is not necessarily convergent. Math. Program. 155(1), 57–79 (2016)
Chen, C., Shen, Y., You, Y.: On the convergence analysis of the alternating direction method of multipliers with three blocks. Abstr. Appl. Anal. 2013, 183961 (2013)
Chen, G., Teboulle, M.: A proximal-based decomposition method for convex minimization problems. Math. Program. 64(1), 81–101 (1994)
Corman, E., Yuan, X.M.: A generalized proximal point algorithm and its convergence rate. SIAM J. Optim. 24(4), 1614–1638 (2014)
Davis, D., Yin, W.: Convergence rate analysis of several splitting schemes. UCLA CAM Report, pp. 14–51 (2014)
Davis, D., Yin, W.: Convergence rates of relaxed peaceman–rachford and admm under regularity assumptions. UCLA CAM Report, pp. 14–58 (2014)
Davis, D., Yin, W.: A three-operator splitting scheme and its optimization applications. UCLA CAM Report, pp. 15–13 (2015)
Deng, W., Yin, W.: On the global and linear convergence of the generalized alternating direction method of multipliers. J. Sci. Comput. 66(3), 889–916 (2016)
Everett, H.: Generalized lagrange multiplier method for solving problems of optimum allocation of resources. Oper. Res. 11(3), 399–417 (1963)
Gabay, D., Mercier, B.: A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Comput. Math. Appl. 2(1), 17–40 (1976)
Glowinski, R.: Numerical methods for nonlinear variational problems. Springer Series in Computational Physics. Springer, Berlin (1984)
Glowinski, R., Marrocco, A.: Sur l’approximation, par éléments finis d’ordre un, et la résolution, par pénalisation-dualité, d’une classe de problèmes de Dirichlet non linéaires. Laboria (1975)
Goldstein, T., O’Donoghue, B., Setzer, S., Baraniuk, R.: Fast alternating direction optimization methods. SIAM J. Imaging Sci. 7(3), 1588–1623 (2014)
Han, D., Yuan, X.: A note on the alternating direction method of multipliers. J. Optim. Theory Appl. 155(1), 227–238 (2012)
He, B.S.: A class of projection and contraction methods for monotone variational inequalities. Appl. Math. Optim. 35(1), 69–76 (1997)
He, B.S.: Parallel splitting augmented lagrangian methods for monotone structured variational inequalities. Comput. Optim. Appl. 42(2), 195–212 (2009)
He, B.S., Hou, L.S., Yuan, X.M.: On full Jacobian decomposition of the augmented lagrangian method for separable convex programming. SIAM J. Optim. 25, 2274–2312 (2015)
He, B.S., Tao, M., Yuan, X.M.: Alternating direction method with gaussian back substitution for separable convex programming. SIAM J. Optim. 22(2), 313–340 (2012)
He, B.S., Yuan, X.M.: On the \(O(1/n)\) convergence rate of the Douglas-Rachford alternating direction method. SIAM J. Numer. Anal. 50(2), 700–709 (2012)
He, B.S., Yuan, X.M.: On non-ergodic convergence rate of Douglas-Rachford alternating direction method of multipliers. Numer. Math. 130(3), 567–577 (2015)
Hong, M., Luo, Z.Q.: On the Linear Convergence of the Alternating Direction Method of Multipliers. arXiv:1208.3922 (2012)
Li, M., Sun, D., Toh, K.C.: A Convergent 3-Block Semi-Proximal ADMM for Convex Minimization Problems with One Strongly Convex Block. arXiv:1410.7933 [math] (2014)
Lin, T., Ma, S., Zhang, S.: On the Convergence Rate of Multi-Block ADMM. arXiv:1408.4265 [math] (2014)
Lions, P.L., Mercier, B.: Splitting algorithms for the sum of two nonlinear operators. SIAM J. Numer. Anal. 16(6), 964–979 (1979)
Mota, J.F., Xavier, J.M., Aguiar, P.M., Puschel, M.: D-admm: a communication-efficient distributed algorithm for separable optimization. IEEE Trans. Signal Process. 61, 2718–2723 (2013)
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005)
Peng, Y., Ganesh, A., Wright, J., Xu, W., Ma, Y.: RASL: robust alignment by sparse and low-rank decomposition for linearly correlated images. IEEE Trans. Pattern Anal. Mach. Intell. 34, 2233–2246 (2012)
Peng, Z., Yan, M., Yin, W.: Parallel and distributed sparse optimization. In: IEEE Asilomar Conference on Signals Systems and Computers (2013)
Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1997)
Shor, N.Z., Kiwiel, K.C., Ruszcayski, A.: Minimization Methods for Non-differentiable Functions. Springer, New York (1985)
Tao, M., Yuan, X.M.: Recovering low-rank and sparse components of matrices from incomplete and noisy observations. SIAM J. Optim. 21(1), 57–81 (2011)
Wang, X.F., Hong, M.Y., Ma, S.Q., Luo, Z.Q.: Solving multiple-block separable convex minimization problems using two-block alternating direction method of multipliers. Pac. J. Optim. 11(4), 57–81 (2015)
Yang, J.F., Zhang, Y.: Alternating direction algorithms for \(\ell _1\)-problems in compressive sensing. SIAM J. Sci. Comput. 33(1), 250–278 (2011)
Zhang, X., Burger, M., Osher, S.: A unified primal-dual algorithm framework based on Bregman iteration. J. Sci. Comput. 46(1), 20–46 (2011)
Acknowledgements
Wei Deng is partially supported by NSF grant ECCS-1028790. Ming-Jun Lai is partially supported by a Simon collaboration grant for 2013–2018 and by NSF grant DMS-1521537. Zhimin Peng and Wotao Yin are partially supported by NSF grants DMS-0748839 and DMS-1317602, and ARO/ARL MURI grant FA9550-10-1-0567. The authors would like to thank anonymous reviewers for their careful reading and suggestions.
Author information
Authors and Affiliations
Corresponding author
Appendix: On o(1 / k) Convergence Rate of ADMM
Appendix: On o(1 / k) Convergence Rate of ADMM
The convergence of the standard two-block ADMM has been long established in the literature [14, 16]. Its convergence rate has been actively studied; see [9, 10, 12, 17, 23,24,25, 28] and the references therein. In the following, we briefly review the convergence analysis for ADMM (\(N=2\)) and then improve the O(1 / k) convergence rate established in [24] slightly to o(1 / k) by using the same technique as in Sect. 2.2.
As suggested in [24], the quantity \(\Vert {\mathbf {w}}^k-{\mathbf {w}}^{k+1}\Vert _H^2\) can be used to measure the optimality of the iterations of ADMM , where
and \({\mathbf {I}}\) is the identity matrix of size \(m\times m\). Note that \(\mathbf{x}_1\) is not part of \({\mathbf {w}}\) because \(\mathbf{x}_1\) can be regarded as an intermediate variable in the iterations of ADMM, whereas \((\mathbf{x}_2,\lambda )\) are the essential variables [3]. In fact, if \(\Vert {\mathbf {w}}^k-{\mathbf {w}}^{k+1}\Vert _H^2=0\) then \({\mathbf {w}}^{k+1}\) is optimal. The reasons are as follows. Recall the subproblems of ADMM:
By the formula for \(\lambda ^{k+1}\), their optimality conditions can be written as:
In comparison with the KKT conditions (1.14a) and (1.14b), we can see that \(\mathbf{u}^{k+1}=\left( \mathbf{x}^{k+1}_1,\mathbf{x}^{k+1}_2,\lambda ^{k+1}\right) \) is a solution of (1.1) if and only if the following holds:
By the update formula for \(\lambda ^{k+1}\), we can write \({\mathbf {r}}_p\) equivalently as
Clearly, if \(\Vert {\mathbf {w}}^k-{\mathbf {w}}^{k+1}\Vert _H^2=0\) then the optimality conditions (6.5) and (6.6) are satisfied, so \({\mathbf {w}}^{k+1}\) is a solution. On the other hand, if \(\Vert {\mathbf {w}}^k-{\mathbf {w}}^{k+1}\Vert _H^2\) is large, then \({\mathbf {w}}^{k+1}\) is likely to be far away from being a solution. Therefore, the quantity \(\Vert {\mathbf {w}}^k-{\mathbf {w}}^{k+1}\Vert _H^2\) can be viewed as a measure of the distance between the iteration \({\mathbf {w}}^{k+1}\) and the solution set. Furthermore, based on the variational inequality (1.15) and the variational characterization of the iterations of ADMM, it is reasonable to use the quadratic term \(\Vert {\mathbf {w}}^k-{\mathbf {w}}^{k+1}\Vert _H^2\) rather than \(\Vert {\mathbf {w}}^k-{\mathbf {w}}^{k+1}\Vert _H\) to measure the convergence rate of ADMM (see [24] for more details).
The work [24] proves that \(\Vert {\mathbf {w}}^k-{\mathbf {w}}^{k+1}\Vert _H^2\) converges to zero at a rate of O(1 / k). The key steps of the proof are to establish the following properties:
-
the sequence \(\{{\mathbf {w}}^k\}\) is contractive:
$$\begin{aligned} \Vert {\mathbf {w}}^k-{\mathbf {w}}^*\Vert ^2_H-\Vert {\mathbf {w}}^{k+1}-{\mathbf {w}}^*\Vert ^2_H\ge \Vert {\mathbf {w}}^{k}-{\mathbf {w}}^{k+1}\Vert ^2_H, \end{aligned}$$(6.8) -
the sequence \(\Vert {\mathbf {w}}^k-{\mathbf {w}}^{k+1}\Vert _H^2\) is monotonically non-increasing:
$$\begin{aligned} \Vert {\mathbf {w}}^k-{\mathbf {w}}^{k+1}\Vert ^2_H\le \Vert {\mathbf {w}}^{k-1}-{\mathbf {w}}^k\Vert ^2_H. \end{aligned}$$(6.9)
The contraction property (6.8) has been long established and its proof dates back to [14, 16]. Inspired by [24], we provide a shorter proof for (6.9) than the one in [24].
Proof (Proof of (6.9))
Let \(\Delta \mathbf{x}_i^{k+1}=\mathbf{x}_i^k-\mathbf{x}_i^{k+1}\) and \(\Delta \lambda ^{k+1}=\lambda ^k-\lambda ^{k+1}\). By Lemma 1.1, i.e., (1.17), the optimality condition 6.3 at the k-th and \((k+1)\)-th iterations yields: \(\langle \Delta \mathbf{x}_1^{k+1},~A_1^\top \Delta \lambda ^{k+1}-\rho A_1^\top A_2(\Delta \mathbf{x}_2^k-\Delta \mathbf{x}_2^{k+1})\rangle \ge 0. \) Similarly for (6.4), we obtain \( \langle \Delta \mathbf{x}_2^{k+1},~A_2^\top \Delta \lambda ^{k+1}\rangle \ge 0. \) Adding the above two inequalities together, we have
Using the equality according to (6.7):
(6.10) becomes \(\frac{1}{\rho }\left( \Delta \lambda ^k-\Delta \lambda ^{k+1}\right) ^\top \Delta \lambda ^{k+1}- \left( \Delta \lambda ^k-\Delta \lambda ^{k+1}-\rho A_2\Delta \mathbf{x}_2^{k+1}\right) ^ \top A_2\left( \Delta \mathbf{x}_2^k\right. \left. -\Delta \mathbf{x}_2^{k+1}\right) \ge 0. \) After rearranging the terms, we get
By the Cauchy-Schwarz inequality, we have \((a_1+b_1)^\top (a_2+b_2)\le (\Vert a_1+b_1\Vert ^2+\Vert a_2+b_2\Vert ^2)/2, \) or equivalently, \((a_1+b_1)^\top (a_2+b_2)-a_1^\top b_1-a_2^\top b_2 \le (\Vert a_1\Vert ^2+\Vert b_1\Vert ^2+\Vert a_2\Vert ^2+\Vert b_2\Vert ^2)/2. \) Applying this inequality to the left-hand side of (6.12), we have
and thus (6.9) follows immediately.
We are now ready to improve the convergence rate from O(1 / k) to o(1 / k).
Theorem 6.1
The sequence \(\{{\mathbf {w}}^k\}\) generated by Algorithm 2 (for \(N=2\)) converges to a solution \({\mathbf {w}}^*\) of problem (1.1) in the H-norm, i.e., \(\Vert {\mathbf {w}}^k-{\mathbf {w}}^*\Vert _H^2\rightarrow 0\), and \(\Vert {\mathbf {w}}^k-{\mathbf {w}}^{k+1}\Vert ^2_H = o(1/k)\). Therefore,
Proof
Using the contractive property of the sequence \(\{{\mathbf {w}}^k\}\) (6.8) along with the optimality conditions, the convergence of \(\Vert {\mathbf {w}}^k-{\mathbf {w}}^*\Vert _H^2\rightarrow 0\) follows from the standard analysis for contraction methods [19].
By (6.8), we have
Therefore, \(\sum _{k=1}^{\infty } \Vert {\mathbf {w}}^k-{\mathbf {w}}^{k+1}\Vert ^2_H<\infty \). By (6.9), \(\Vert {\mathbf {w}}^k-{\mathbf {w}}^{k+1}\Vert ^2_H\) is monotonically non-increasing and nonnegative. So Lemma 1.1 indicates that \(\Vert {\mathbf {w}}^k-{\mathbf {w}}^{k+1}\Vert ^2_H = o(1/k)\), which further implies that \(\Vert A_2\mathbf{x}_2^k - A_2\mathbf{x}_2^{k+1}\Vert ^2 = o(1/k)\) and \(\Vert \lambda ^k- \lambda ^{k+1}\Vert ^2 = o(1/k)\). By (6.11), we also have \(\Vert A_1\mathbf{x}_1^k- A_1\mathbf{x}_1^{k+1}\Vert ^2= o(1/k)\). Thus (6.13) follows immediately. \(\square \)
Remark 6.1
The proof technique based on Lemma 1.1 can be applied to improve some other existing convergence rates of O(1 / k) (e.g., [8, 21]) to o(1 / k) as well.
Rights and permissions
About this article
Cite this article
Deng, W., Lai, MJ., Peng, Z. et al. Parallel Multi-Block ADMM with o(1 / k) Convergence. J Sci Comput 71, 712–736 (2017). https://doi.org/10.1007/s10915-016-0318-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10915-016-0318-2