Skip to main content
Log in

Global Convergence of ADMM in Nonconvex Nonsmooth Optimization

  • Published:
Journal of Scientific Computing Aims and scope Submit manuscript

Abstract

In this paper, we analyze the convergence of the alternating direction method of multipliers (ADMM) for minimizing a nonconvex and possibly nonsmooth objective function, \(\phi (x_0,\ldots ,x_p,y)\), subject to coupled linear equality constraints. Our ADMM updates each of the primal variables \(x_0,\ldots ,x_p,y\), followed by updating the dual variable. We separate the variable y from \(x_i\)’s as it has a special role in our analysis. The developed convergence guarantee covers a variety of nonconvex functions such as piecewise linear functions, \(\ell _q\) quasi-norm, Schatten-q quasi-norm (\(0<q<1\)), minimax concave penalty (MCP), and smoothly clipped absolute deviation penalty. It also allows nonconvex constraints such as compact manifolds (e.g., spherical, Stiefel, and Grassman manifolds) and linear complementarity constraints. Also, the \(x_0\)-block can be almost any lower semi-continuous function. By applying our analysis, we show, for the first time, that several ADMM algorithms applied to solve nonconvex models in statistical learning, optimization on manifold, and matrix decomposition are guaranteed to converge. Our results provide sufficient conditions for ADMM to converge on (convex or nonconvex) monotropic programs with three or more blocks, as they are special cases of our model. ADMM has been regarded as a variant to the augmented Lagrangian method (ALM). We present a simple example to illustrate how ADMM converges but ALM diverges with bounded penalty parameter \(\beta \). Indicated by this example and other analysis in this paper, ADMM might be a better choice than ALM for some nonconvex nonsmooth problems, because ADMM is not only easier to implement, it is also more likely to converge for the concerned scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. This is the best that one hope (except for very specific problems) since [62, Section 1] shows a convex 2-block problem, which ADMM fails to converge.

  2. “Globally” here means regardless of where the initial point is.

  3. A nonnegative sequence \({a_k}\) induces its running best sequence \(b_k=\min \{a_i : i\le k\}\); therefore, \({a_k}\) has running best rate of o(1 / k) if \(b_k=o(1/k)\).

References

  1. Attouch, H., Bolte, J., Svaiter, B.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods. Math. Program. 137(1–2), 91–129 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  2. Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with sparsity-inducing penalties. Found. Trends Mach. Learn. 4(1), 1–106 (2012)

    Article  MATH  Google Scholar 

  3. Bertsekas, D.P.: Constrained Optimization and Lagrange Multiplier Methods. Academic Press, London (2014)

    MATH  Google Scholar 

  4. Birgin, E.G., Martínez, J.M.: Practical Augmented Lagrangian Methods for Constrained Optimization, vol. 10. SIAM, Philadelphia (2014)

    Book  MATH  Google Scholar 

  5. Bolte, J., Daniilidis, A., Lewis, A.: The Lojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim. 17(4), 1205–1223 (2007)

    Article  MATH  Google Scholar 

  6. Bouaziz, S., Tagliasacchi, A., Pauly, M.: Sparse iterative closest point. In: Computer graphics forum, vol. 32, pp. 113–123. Wiley Online Library (2013)

  7. Cai, J.F., Candès, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20(4), 1956–1982 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  8. Chartrand, R.: Nonconvex splitting for regularized low-rank \(+\) sparse decomposition. IEEE Trans. Signal Process. 60(11), 5810–5819 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  9. Chartrand, R., Wohlberg, B.: A nonconvex ADMM algorithm for group sparsity with sparse groups. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6009–6013. IEEE (2013)

  10. Chen, C., He, B., Ye, Y., Yuan, X.: The direct extension of ADMM for multi-block convex minimization problems is not necessarily convergent. Math. Program. 155, 57–79 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  11. Chen C., Yuan, X., Zeng, S., Zhang, J.: Penalty splitting methods for solving mathematical program with equilibrium constraints. Manuscript (private communication) (2016)

  12. Conn, A.R., Gould, N.I., Toint, P.: A globally convergent augmented Lagrangian algorithm for optimization with general constraints and simple bounds. SIAM J. Numer. Anal. 28(2), 545–572 (1991)

    Article  MathSciNet  MATH  Google Scholar 

  13. Cottle, R., Dantzig, G.: Complementary pivot theory of mathematical programming. Linear Algebra Appl. 1, 103–125 (1968)

    Article  MathSciNet  MATH  Google Scholar 

  14. Daubechies, I., DeVore, R., Fornasier, M., Güntürk, C.S.: Iteratively reweighted least squares minimization for sparse recovery. Commun. Pure Appl. Math. 63(1), 1–38 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  15. Davis, D., Yin, W.: Convergence rate analysis of several splitting schemes. In: Glowinski, R., Osher, S., Yin, W. (eds.) Splitting Methods in Communication, Imaging, Science and Engineering. Springer, New York (2016)

    Google Scholar 

  16. Davis, D., Yin, W.: Convergence rates of relaxed Peaceman-Rachford and ADMM under regularity assumptions. Math. Oper. Res. 42(3), 783–805 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  17. Deng, W., Lai, M.J., Peng, Z., Yin, W.: Parallel multi-block ADMM with \(o (1/k)\) convergence. J. Sci. Comput. 71, 712–736 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  18. Ding, C., Sun, D., Sun, J., Toh, K.C.: Spectral operators of matrices. Math. Program. 168(1–2), 509–531 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  19. Gabay, D., Mercier, B.: A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Comput. Math. Appl. 2(1), 17–40 (1976)

    Article  MATH  Google Scholar 

  20. Glowinski, R.: Numerical Methods for Nonlinear Variational Problems. Springer Series in Computational Physics. Springer, New York (1984)

    Book  MATH  Google Scholar 

  21. Glowinski, R., Marroco, A.: On the approximation by finite elements of order one, and resolution, penalisation-duality for a class of nonlinear dirichlet problems. ESAIM Math. Model. Numer. Anal. 9(R2), 41–76 (1975)

    MATH  Google Scholar 

  22. He, B., Yuan, X.: On the \(o(1/n)\) convergence rate of the Douglas–Rachford alternating direction method. SIAM J. Numer. Anal. 50(2), 700–709 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  23. Hestenes, M.R.: Multiplier and gradient methods. J. Optim. Theory Appl. 4(5), 303–320 (1969)

    Article  MathSciNet  MATH  Google Scholar 

  24. Hong, M., Luo, Z.Q., Razaviyayn, M.: Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM J. Optim. 26(1), 337–364 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  25. Hu, Y., Chi, E., Allen, G.I.: ADMM algorithmic regularization paths for sparse statistical machine learning. In: Glowinski, R., Osher, S., Yin, W. (eds.) Splitting Methods in Communication, Imaging, Science and Engineering. Springer, New York (2016)

    Google Scholar 

  26. Ivanov, M., Zlateva, N.: Abstract subdifferential calculus and semi-convex functions. Serdica Math. J. 23(1), 35p–58p (1997)

    MathSciNet  MATH  Google Scholar 

  27. Iutzeler, F., Bianchi, P., Ciblat, P., Hachem, W.: Asynchronous distributed optimization using a randomized alternating direction method of multipliers. In: 2013 IEEE 52nd Annual Conference On Decision and Control (CDC), pp. 3671–3676. IEEE (2013)

  28. Jiang, B., Ma, S., Zhang, S.: Alternating direction method of multipliers for real and complex polynomial optimization models. Optimization 63(6), 883–898 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  29. Knopp, K.: Infinite Sequences and Series. Courier Corporation, Chelmsford (1956)

    MATH  Google Scholar 

  30. Kryštof, V., Zajíček, L.: Differences of two semiconvex functions on the real line. Preprint (2015)

  31. Lai, R., Osher, S.: A splitting method for orthogonality constrained problems. J. Sci. Comput. 58(2), 431–449 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  32. Li, G., Pong, T.K.: Global convergence of splitting methods for nonconvex composite optimization. SIAM J. Optim. 25(4), 2434–2460 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  33. Li, R.C., Stewart, G.: A new relative perturbation theorem for singular subspaces. Linear Algebra Appl. 313(1), 41–51 (2000)

    Article  MathSciNet  MATH  Google Scholar 

  34. Liavas, A.P., Sidiropoulos, N.D.: Parallel algorithms for constrained tensor factorization via the alternating direction method of multipliers. IEEE Trans. Signal Process. 63(20), 5450–5463 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  35. Łojasiewicz, S.: Sur la géométrie semi-et sous-analytique. Ann. Inst. Fourier (Grenoble) 43(5), 1575–1595 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  36. Lu, Z., Zhang, Y.: An augmented lagrangian approach for sparse principal component analysis. Math. Program. 135(1–2), 149–193 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  37. Magnússon, S., Weeraddana, P.C., Rabbat, M.G., Fischione, C.: On the convergence of alternating direction Lagrangian methods for nonconvex structured optimization problems. IEEE Trans. Control Netw. Syst. 3(3), 296–309 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  38. Mifflin, R.: Semismooth and semiconvex functions in constrained optimization. SIAM J. Control Optim. 15(6), 959–972 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  39. Miksik, O., Vineet, V., Pérez, P., Torr, P.H., Cesson Sévigné, F.: Distributed non-convex ADMM-inference in large-scale random fields. In: British Machine Vision Conference. BMVC (2014)

  40. Möllenhoff, T., Strekalovskiy, E., Moeller, M., Cremers, D.: The primal-dual hybrid gradient method for semiconvex splittings. SIAM J. Imaging Sci. 8(2), 827–857 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  41. Oymak, S., Mohan, K., Fazel, M., Hassibi, B.: A simplified approach to recovery conditions for low rank matrices. In: 2011 IEEE International Symposium on Information Theory Proceedings (ISIT), pp. 2318–2322. IEEE (2011)

  42. Peng, Z., Xu, Y., Yan, M., Yin, W.: ARock: an algorithmic framework for asynchronous parallel coordinate updates. SIAM J. Sci. Comput. 38(5), A2851–A2879 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  43. Poliquin, R., Rockafellar, R.: Prox-regular functions in variational analysis. Trans. Am. Math. Soc. 348(5), 1805–1838 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  44. Powell, M.J.: A method for non-linear constraints in minimization problems. UKAEA (1967)

  45. Rockafellar, R.T., Wets, R.J.B.: Variational Analysis. Springer Science & Business Media (2009)

  46. Rosenberg, J., et al.: Applications of analysis on Lipschitz manifolds. In: Proceedings of Miniconferences on Harmonic Analysis and Operator Algebras (Canberra, t987), Proceedings Centre for Mathematical Analysis, vol. 16, pp. 269–283 (1988)

  47. Shen, Y., Wen, Z., Zhang, Y.: Augmented Lagrangian alternating direction method for matrix separation based on low-rank factorization. Optim. Methods Softw. 29(2), 239–263 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  48. Slavakis, K., Giannakis, G., Mateos, G.: Modeling and optimization for big data analytics: (statistical) learning tools for our era of data deluge. IEEE Sig. Process. Mag. 31(5), 18–31 (2014)

    Article  Google Scholar 

  49. Sun, D.L., Fevotte, C.: Alternating direction method of multipliers for non-negative matrix factorization with the beta-divergence. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6201–6205. IEEE (2014)

  50. Sun, R., Luo, Z.-Q., Ye, Y.: On the expected convergence of randomly permuted ADMM. arXiv preprint arXiv:1503.06387 (2015)

  51. Wang, F., Cao, W., Xu, Z.: Convergence of multi-block Bregman ADMM for nonconvex composite problems. arXiv preprint arXiv:1505.03063 (2015)

  52. Wang, F., Xu, Z., Xu, H.K.: Convergence of Bregman alternating direction method with multipliers for nonconvex composite problems. arXiv preprint arXiv:1410.8625 (2014)

  53. Wang, X., Hong, M., Ma, S., Luo, Z.Q.: Solving multiple-block separable convex minimization problems using two-block alternating direction method of multipliers. arXiv preprint arXiv:1308.5294 (2013)

  54. Wang, Y., Zeng, J., Peng, Z., Chang, X., Xu, Z.: Linear convergence of adaptively iterative thresholding algorithm for compressed sensing. IEEE Trans. Signal Process. 63(11), 2957–2971 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  55. Watson, G.A.: Characterization of the subdifferential of some matrix norms. Linear Algebra Appl. 170, 33–45 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  56. Wen, Z., Peng, X., Liu, X., Sun, X., Bai, X.: Asset allocation under the basel accord risk measures. arXiv preprint arXiv:1308.1321 (2013)

  57. Wen, Z., Yang, C., Liu, X., Marchesini, S.: Alternating direction methods for classical and ptychographic phase retrieval. Inverse Prob. 28(11), 115010 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  58. Wen, Z., Yin, W.: A feasible method for optimization with orthogonality constraints. Math. Program. 142(1–2), 397–434 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  59. Wikipedia: Schatten norm—Wikipedia, the free encyclopedia (2015). (Online; Accessed 18 Oct 2015)

  60. Xu, Y., Yin, W.: A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM J. Imaging Sci. 6(3), 1758–1789 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  61. Xu, Y., Yin, W., Wen, Z., Zhang, Y.: An alternating direction algorithm for matrix completion with nonnegative factors. Front. Math. China 7(2), 365–384 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  62. Yan, M., Yin, W.: Self equivalence of the alternating direction method of multipliers. In: Glowinski, R., Osher, S., Yin, W. (eds.) Splitting Methods in Communication, Imaging, Science and Engineering, pp. 165–194. Springer, New York (2016)

    Chapter  Google Scholar 

  63. Yang, L., Pong, T.K., Chen, X.: Alternating direction method of multipliers for nonconvex background/foreground extraction. SIAM J. Imaging Sci. 10(1), 74–110 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  64. You, S., Peng, Q.: A non-convex alternating direction method of multipliers heuristic for optimal power flow. In: 2014 IEEE International Conference on Smart Grid Communications (SmartGridComm), pp. 788–793. IEEE (2014)

  65. Zeng, J., Lin, S., Xu, Z.: Sparse regularization: convergence of iterative jumping thresholding algorithm. IEEE Trans. Signal Process. 64(19), 5106–5117 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  66. Zeng, J., Peng, Z., Lin, S.: A Gauss–Seidel iterative thresholding algorithm for \(\ell_q\) regularized least squares regression. J. Comput. Appl. Math. 319, 220–235 (2017)

    Article  MathSciNet  MATH  Google Scholar 

  67. Zeng, J., Lin, S., Wang, Y., Xu, Z.: \(L_{1/2}\) regularization: convergence of iterative half thresholding algorithm. IEEE Trans. Signal Process. 62(9), 2317–2329 (2014)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

We would like to thank Drs. Wei Shi, Ting Kei Pong, and Qing Ling for their insightful comments, and Drs. Xin Liu and Yangyang Xu for helpful discussions. We thank the three anonymous reviewers for their review and helpful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jinshan Zeng.

Additional information

The work of W. Yin was supported in part by NSF Grants DMS-1720237 and ECCS-1462397, and ONR Grant N00014171216. The work of J. Zeng was supported in part by the NSFC Grants (61603162, 11501440, 61772246, 61603163) and the doctoral start-up foundation of Jiangxi Normal Univerity.

Appendix

Appendix

Proof of Proposition 1

The fact that convex functions and the \(C^1\) regular functions are prox-regular has been proved in the previous literature, for example, see [43]. Here we only prove the second part of the proposition.

(1): For functions \(r( x) = \sum _{i} |x_i|^q\) where \(0< q < 1\), the set of general subgradient of \(r(\cdot )\) is

$$\begin{aligned} \partial r(x) = \left\{ d=[d_1;\ldots ;d_n]\left| d_i = q\cdot \mathrm {sign}(x_i)|x_i|^{q-1} \text { if } x_i \ne 0 \text {; } d_i\in \mathbb {R}\text { if } x_i = 0\right. \right\} . \end{aligned}$$

For any two positive constants \(C>0\) and \(M>1\), take \(\gamma = \max \left\{ \frac{4({n}C^q+MC)}{c^2},q(1-q)c^{q-2}\right\} \), where

\(c \triangleq \frac{1}{3}(\frac{q}{M})^{\frac{1}{1-q}}\). The exclusion set \(S_{M}\) contains the set \(\{ x|\min _{x_i\ne 0} |x_i|\le 3c\}\). For any point \(z\in \mathbb {B}(0,C)/S_{M}\) and \(y\in \mathbb {B}(0,C)\), if \(\Vert z-y\Vert \le c\), then \(\mathrm {supp}(z) \subset \mathrm {supp}(y)\) and \(\Vert z\Vert _0 \le \Vert y\Vert _0\), where \(\mathbb {B}(0,C) \triangleq \{x| \Vert x\Vert <C\}\), \(\mathrm {supp}(z)\) denotes the index set of all non-zero elements of z and \(\Vert z\Vert _0\) denotes the cardinality of \(\mathrm {supp}(z)\). Define

$$\begin{aligned} y'_{i} = \left\{ \begin{array}{cc} y_{i} &{} \quad i\in \mathrm {supp}(z)\\ 0 &{} \quad i\not \in \mathrm {supp}(z) \end{array}\right. ,~i=1,\ldots , p. \end{aligned}$$

Then for any \(d\in \partial r(z)\), the following line of proof holds,

$$\begin{aligned} \Vert y\Vert _q^q-\Vert z\Vert _q^q - \big<d,y-z\big> {\mathop {\ge }\limits ^{(a)}}\,&\Vert y'\Vert _q^q-\Vert z\Vert _q^q - \big <d,y'-z\big >\nonumber \\ {\mathop {\ge }\limits ^{(b)}}&- \frac{q(1-q)}{2}c^{q-2}\Vert z-y'\Vert ^2\nonumber \\ {\mathop {\ge }\limits ^{(c)}}&- \frac{q(1-q)}{2}c^{q-2}\Vert z-y\Vert ^2, \end{aligned}$$
(51)

where (a) holds for \(\Vert y\Vert _q^q = \Vert y'\Vert _q^q + \Vert y-y'\Vert _q^q\) by the definition of \(y'\),

(b) holds because r(x) is twice differentiable along the line segment connecting z and \(y'\), and the second order derivative is no bigger than \(q(1-q)c^{q-2}\), and (c) holds because \(\Vert z-y\Vert \ge \Vert z-y'\Vert \). While if \(\Vert z-y\Vert > c\), then for any \(d\in \partial r(z)\), we have

$$\begin{aligned} \Vert y\Vert _q^q-\Vert z\Vert _q^q-\big <d,y-z\big >\ge -(2nC^q + 2MC) \ge -\frac{2nC^q+2MC}{c^2}\Vert y-z\Vert ^2. \end{aligned}$$
(52)

Combining (51) and (52) yields the result.

(2): We are going to verify that Schatten-q quasi-norm \({\Vert \cdot \Vert }_q\) is restricted prox-regular. Without loss of generality, suppose \(A\in \mathbb {R}^{n\times n}\) is a square matrix.

Suppose the singular value decomposition (SVD) of A is

$$\begin{aligned} A = U\varSigma V^T = [U_1,U_2]\cdot \left[ \begin{array}{cc} \varSigma _1 &{} 0\\ 0 &{} 0 \end{array}\right] \cdot \left[ \begin{array}{c} V_1^T\\ V_2^T \end{array}\right] , \end{aligned}$$
(53)

where \(U,V\in \mathbb {R}^{n\times n}\) are orthogonal matrices, and \(\varSigma _1\in \mathbb {R}^{K\times K}\) is diagonal whose diagonal elements are \(\sigma _i(A)\), \(i=1,\ldots ,K\). Then the general subgradient of \({\Vert A \Vert }_q^q\) [55] is

$$\begin{aligned} \partial {\Vert A \Vert }_q^q = U_1DV_1^T + \{U_2\varGamma V_2^T\big | \varGamma \text { is an arbitrary matrix}\}, \end{aligned}$$

where \(D\in \mathbb {R}^{K\times K}\) is a diagonal matrix whose ith diagonal element is \( d_i = q\sigma _i(A)^{q-1}\).

Now we are going to prove \({\Vert \cdot \Vert }_q^q\) is restricted prox-regular, i.e., for any positive parameters \(M, P>0\), there exists \(\gamma >0\) such that for any \({\Vert B \Vert }_F<P\), \({\Vert A \Vert }_F<P\), \(A\not \in S_M = \{A| \forall ~X\in \partial {\Vert A \Vert }_q^q,~{\Vert X \Vert }_F>M\}\), and \(T = U_{1} D V_{1}^T + U_{2}\varGamma V_{2}^T\in \partial {\Vert A \Vert }_q^q, {\Vert T \Vert }_F\le M\), we hope to show

$$\begin{aligned} {\Vert B \Vert }_q^q - {\Vert A \Vert }_q^q - \big <T,B - A\big > \ge -\frac{\gamma }{2}{\Vert A-B \Vert }^2_F. \end{aligned}$$
(54)

Let \(\epsilon _0 = \frac{1}{3}(M/q)^{1/(q - 1)}\). If \({\Vert B - A \Vert } > \epsilon _0\), we have

$$\begin{aligned}&{\Vert B \Vert }_q^q - {\Vert A \Vert }_q^q - \big <T,B - A\big > \ge - n^2P^q - M\cdot \Vert B - A\Vert _F\nonumber \\&\quad \ge -(M\epsilon _0^{-1}+n^2P^q\epsilon _0^{-2}){\Vert A-B \Vert }_F^2. \end{aligned}$$
(55)

If \({\Vert B-A \Vert }_F<\epsilon _0\), consider the decomposition of \(B = U_B \varSigma ^B V_B^T = B_1 + B_2\) where \(B_1 = U_B \varSigma ^B_1 V_B^T\), \(\varSigma ^B_1\) is the diagonal matrix preserving elements of \(\varSigma ^B\) bigger than \(\frac{1}{3}(M/q)^{1/(q - 1)}\), and \(B_2 = U_B \varSigma ^B_2 V_B^T\) where \(\varSigma ^B_2 = \varSigma ^B - \varSigma ^B_1\).

Define a set \(S' \triangleq \{T\in {\mathbb {R}}^{n \times n}|{\Vert T \Vert }_F \le P,~ \min _{\sigma _i>0} \sigma _i(T) \ge \epsilon _0\}\). Let’s prove \(A, B_1\in S'\). If \(\min _{\sigma _i >0} \sigma _i(A) < (M/q)^{1/(q - 1)}\), then for any \(X\in \partial {\Vert A \Vert }_q^q\), \(X = U_1DV_1^T + U_2\varGamma V_2^T\) and

$$\begin{aligned} {\Vert X \Vert }_F \ge {\Vert U_1DV_1^T \Vert }_F \ge \min _{\sigma _i>0} q\sigma _i^{q-1} \ge M, \end{aligned}$$

which contradicts with the face that \(A\not \in S_M\). As for \(B_1\), because of \({\Vert A - B \Vert }_F\le \epsilon _0\) and \(\min _{\sigma _i >0} \sigma _i(A) < (M/q)^{1/(q - 1)}\), using Weyl inequalities will we get \(B_1\in S'\).

Define the function \(F:S'\subset \mathbb {R}^{n\times n}\rightarrow \mathbb {R}^{n\times n}\), for \(A = U_1\varSigma V_1^T\), \(F(A) = U_1 D V_1^T\), where

$$\begin{aligned} D = \mathrm {diag}(q\sigma _1(A)^{q-1},\ldots ,q\sigma _1(A)^{q-1}) \end{aligned}$$

and (\(0^{q-1} = 0\)). Based on [18, Theorem 4.1] and the compactness of \(S'\), F(A) is Lipschitz continuous in \(S'\), i.e., there exists \(L>0\), for any two matrices \(A, B\in S'\) , \({\Vert F(A) - F(B) \Vert }_F\le L{\Vert A - B \Vert }_F\). This implies

$$\begin{aligned} {\Vert B_1 \Vert }_q^q - {\Vert A \Vert }_q^q - \big <U_1DV_1^T,B_1 - A\big > \ge -\frac{L}{2}{\Vert B_1 - A \Vert }_F^2. \end{aligned}$$
(56)

In addition, because \({\Vert U_{2}^TU_B \Vert }_F< {\Vert B_1 - A \Vert }_F/\epsilon _0\) and \({\Vert V_{2}^TV_B \Vert }_F < {\Vert B_1 - A \Vert }_F/\epsilon _0\) (see [33]),

$$\begin{aligned} \big<U_{2}\varGamma V_{2}^T,B_1 - A\big>= \big <\varGamma ,U_{2}^TU_B\varSigma _BV_B^TV_{2}\big > \ge - \frac{M^2}{\epsilon _0^2} {\Vert B_1 - A \Vert }_F^2. \end{aligned}$$
(57)

Furthermore, \({\Vert B_2 \Vert }_q^q - \big <T,B_2\big > \ge 0\) and \({\Vert B_1 - A \Vert }_F \le {\Vert B - A \Vert }_F+{\Vert B - B_1 \Vert }_F\le 2{\Vert B - A \Vert }_F\), together with (56) and (57) we have

$$\begin{aligned} {\Vert B \Vert }_q^q - {\Vert A \Vert }_q^q - \big< T,B - A\big>&= {\Vert B_1 \Vert }_q^q - {\Vert A \Vert }_q^q - \big<T,B_1 - A\big> + {\Vert B_2 \Vert }_q^q - \big <T,B_2\big >\nonumber \\&\ge -\left( \frac{L}{2} + \frac{M^2}{\epsilon _0^2}\right) {\Vert B_1 - A \Vert }_F^2 \ge -\left( 2L + \frac{4M^2}{\epsilon _0^2}\right) {\Vert B - A \Vert }_F^2. \end{aligned}$$
(58)

Combining (55) and (58), we finally prove (54) with appropriate \(\gamma \).

(3): We need to show that the indicator function \(\iota _S\) of a p-dimensional compact \(C^2\) manifold S is restricted prox-regular. First of all, by definition, the exclusion set \(S_M\) of \(\iota _S\) is empty for any \(M>0\). Since S is compact and \(C^2\), there are a series of \(C^2\) homeomorphisms \(h_\eta : \mathbb {R}^{p} \mapsto \mathbb {R}^n\), \(\eta \in \{1,\ldots , m\}\) and \(\delta >0\) such that for any x, there exist an \(\eta \) and an \(\alpha _x\) satisfying \(x = h_\eta (\alpha _x)\in S\). Furthermore, for any \(\Vert y - x\Vert \le \delta \), we can find an \(\alpha _y\) satisfying \(y = h_\eta (\alpha _y)\).

Note that \(\partial \iota _{S}(x)= \mathrm {Im}(J_{h_\eta }(x))^\perp \), where \(J_{h_\eta }\) is the Jacobian of \(h_\eta \). For any \(d\in \partial \iota _S(x)\), \(\Vert d\Vert \le M\) and \(\Vert x-y\Vert \le \delta \),

$$\begin{aligned} \iota _S(y) - \iota _S(x) - \big<d,y - x\big> \nonumber =&-\big<d,h_\eta (\alpha _y) - h_\eta (\alpha _x)\big> \\ \nonumber =&-\big <d,h_\eta (\alpha _y) - h_\eta (\alpha _x) - J_{h_\eta }(\alpha _y - \alpha _x)\big > \\ \nonumber \ge&-\Vert d\Vert \cdot \gamma \Vert \alpha _y - \alpha _x\Vert ^2 \\ \ge&-M \gamma C^2 \Vert x - y\Vert ^2, \end{aligned}$$
(59)

where \(\gamma \) and C are the Lipschitz constants of \(\nabla h_\eta \) and \( h^{-1}_\eta \), respectively. For any \(\Vert y-x\Vert \ge \delta \),

$$\begin{aligned} \iota _S(y) - \iota _S(x) - \big<d,y - x\big> \nonumber =&- \big <d,y - x\big > \\ \nonumber \ge&- \Vert d\Vert \cdot \Vert y - x\Vert \\ \ge&-\frac{M}{\delta }\Vert y - x\Vert ^2, \end{aligned}$$
(60)

where M is the maximum of \(\Vert d\Vert \) over \(\partial \iota _S(x)\). Combining (59) and (60) shows that \(\iota _{S}\) is restricted prox-regular. \(\square \)

Proof

(Lemma1) By the definitions of H in A3(a) and \(y^k\), we have \(y^k = H(By^k)\). Therefore, \(\Vert y^{k_1} - y^{k_2}\Vert =\Vert H(By^{k_1}) - H(By^{k_2})\Vert \le {\bar{M}} \Vert By^{k_1} - By^{k_2}\Vert .\) Similarly, by the optimality of \(x^k_i\), we have \(x^k_i = F_i(A_ix_i^k)\). Therefore, \(\Vert x^{k_1}_i - x_i^{k_2}\Vert =\Vert F_i(A_ix_i^{k_1}) - F_i(A_ix_i^{k_2})\Vert \le {\bar{M}} \Vert A_ix_i^{k_1} - A_ix_i^{k_2}\Vert .\)\(\square \)

Proof

(Lemma 2) Let us first show that the y-subproblem is well defined. To begin with, we will show that h(y) is lower bounded by a quadratic function of By:

$$\begin{aligned} h(y) \ge h(H(0)) - \left( {\bar{M}}\Vert \nabla h(H(0))\Vert \right) \cdot \Vert By\Vert -\frac{L_h{\bar{M}}^2}{2} \Vert By\Vert ^2. \end{aligned}$$

By A3, we know h(y) is lower bounded by h(H(By)):

$$\begin{aligned} h(y)\ge h(H(By)). \end{aligned}$$

Because of A5 and A3, h(H(By)) is lower bounded by a quadratic function of By:

$$\begin{aligned} h(H(By))&\ge h(H(0)) + \big <\nabla h(H(0)),H(By) - H(0)\big > -\frac{L_h}{2} \Vert H(By) - H(0)\Vert ^2 \end{aligned}$$
(61)
$$\begin{aligned}&\ge h(H(0)) - \Vert \nabla h(H(0))\Vert \cdot {\bar{M}}\cdot \Vert By\Vert -\frac{L_h{\bar{M}}^2}{2} \Vert By\Vert ^2 \end{aligned}$$
(62)

Therefore h(y) is also bounded by the quadratic function:

$$\begin{aligned} h(y) \ge h(H(0)) - \Vert \nabla h(H(0))\Vert \cdot {\bar{M}}\cdot \Vert By\Vert -\frac{L_h{\bar{M}}^2}{2} \Vert By\Vert ^2. \end{aligned}$$

Recall that y-subproblem is to minimize the Lagrangian function w.r.t. y, by neglecting other constants, it is equivalent to minimize:

$$\begin{aligned} {{\mathrm{argmin}}}&~ P(y) := h(y) + \big <w^{k} + \beta {\mathbf {A}}{\mathbf {x}}^+, By\big > + \frac{\beta }{2}\Vert By\Vert ^2. \end{aligned}$$
(63)

Because h(y) is lower bounded by \(-\frac{L_h{\bar{M}}^2}{2}\Vert By\Vert ^2\), when \(\beta > L_h{\bar{M}}\), \(P(y)\rightarrow \infty \) as \(\Vert By\Vert \rightarrow \infty \). This shows that y-subproblem is coercive with respect to By. Because P(y) is lower semi-continuous and \({{\mathrm{argmin}}}h(y) \ \text {s.t.} \ By = u\) has a unique solution for each u, the minimal point of P(y) must exist and the y-subproblem is well defined.

As for the \(x_i\)-subproblem, \(i = 0,\ldots , p\), ignoring the constants yields

$$\begin{aligned} {{\mathrm{argmin}}}\ {\mathcal L}_\beta (x^{+}_{<i},x_i,x^{k}_{>i},y^k,w^k)&={{\mathrm{argmin}}}\ f(x^{+}_{<i},x_i,x^k_{>i}) \nonumber \\&\quad + \frac{\beta }{2}\Vert \frac{1}{\beta }w^k + A_{<i}x^+_{<i} + A_{>i}x^k_{>i} + A_ix_i + By^k\Vert ^2 \end{aligned}$$
(64)
$$\begin{aligned}&={{\mathrm{argmin}}}\ f(x^{+}_{<i},x_i,x^k_{>i}) + h(u) - h(u) \nonumber \\&\quad + \frac{\beta }{2}\Vert Bu-By^k-\frac{1}{\beta }w^k\Vert ^2. \end{aligned}$$
(65)

where \(u = H(-A_{<i}x^+_{<i} - A_{>i}x^k_{>i} - A_ix_i)\). The first two terms are coercive bounded because \(A_{<i}x^+_{<i} + A_{>i}x^k_{>i} + A_ix_i + Bu = 0\) and A1. The third and fourth terms are lower bounded because h is Lipschitz differentiable. Because the function is lower semi-continuous, all the subproblems are well defined. \(\square \)

Proof

(Proposition 1) Define the augmented Lagrangian function to be

$$\begin{aligned} {\mathcal L}_{\beta }(x,y,w) = x^2 - y^2 + w(x-y) + \frac{\beta }{2} \Vert x - y\Vert ^2. \end{aligned}$$

It is clear that when \(\beta =0\), \({\mathcal L}_{\beta }\) is not lower bounded for any w. We are going to show that for any \(\beta >2\), the duality gap is not zero.

$$\begin{aligned} \inf _{x\in [-1,1],y\in \mathbb {R}}\sup _{w\in \mathbb {R}} {\mathcal L}_{\beta }(x,y,w) > \sup _{w\in \mathbb {R}}\inf _{x\in [-1,1],y\in \mathbb {R}} {\mathcal L}_{\beta }(x,y,w). \end{aligned}$$

On one hand, because \(\sup _{w\in \mathbb {R}} {\mathcal L}_{\beta }(x,y,w) = +\infty \) when \(x\ne y\) and \(\sup _{w\in \mathbb {R}} {\mathcal L}_{\beta }(x,y,w) = 0\) when \(x=y\), we have

$$\begin{aligned} \inf _{x\in [-1,1],y\in \mathbb {R}}\sup _{w\in \mathbb {R}} {\mathcal L}_{\beta }(x,y,w) = 0. \end{aligned}$$

On the other hand, let \(t=x-y\),

$$\begin{aligned}&\sup _{w\in \mathbb {R}}\inf _{x\in [-1,1],y\in \mathbb {R}} {\mathcal L}_{\beta }(x,y,w) = \sup _{w\in \mathbb {R}}\inf _{x\in [-1,1],t\in \mathbb {R}} t(2x-t)+ wt+\frac{\beta }{2} t^2 \nonumber \\&\quad = \sup _{w\in \mathbb {R}}\inf _{x\in [-1,1],t\in \mathbb {R}} (w+2x)t+\frac{\beta -2}{2} t^2 \end{aligned}$$
(66)
$$\begin{aligned}&\quad = \sup _{w\in \mathbb {R}}\inf _{x\in [-1,1]} -\frac{(w+2x)^2}{2(\beta -2)}= \sup _{w\in \mathbb {R}} -\frac{\max \{(w-2)^2,(w+2)^2\}}{2(\beta -2)}= -\frac{2}{\beta - 2}. \end{aligned}$$
(67)

This shows the duality gap is not zero (but it goes to 0 as \(\beta \) tends to \(\infty \)).

Then let us show that ALM does not converge if \(\beta ^k\) is bounded, i.e., there exists \(\beta >0\) such that \(\beta ^k\le \beta \) for any \(k\in {\mathbb {N}}\). Without loss of generality, we assume that \(\beta ^k\) equals to the constant \(\beta \) for all \(k\in {\mathbb {N}}\). This will not affect the proof. ALM consists of two steps

  1. 1)

    \((x^{k+1},y^{k+1}) = \text {argmin}_{x,y} {\mathcal L}_{\beta }(x,y,w^k),\)

  2. 2)

    \(w^{k+1} = w^k + \tau (x^{k+1} - y^{k+1}).\)

Since \((x^{k+1} - y^{k+1})\in \partial \psi (w^k)\) where \(\psi (w) = \inf _{x,y} {\mathcal L}_{\beta }(x,y,w)\), and we already know

$$\begin{aligned} \inf _{x,y} {\mathcal L}_{\beta }(x,y,w) = -\frac{\max ((w-2)^2,(w+2)^2)}{2(\beta -2)}, \end{aligned}$$

we have

$$\begin{aligned} w^{k+1} = \left\{ \begin{array}{cc} (1-\frac{\tau }{\beta -2}) w^k - \frac{2\tau }{\beta -2} &{} \text { if } w^{k} \ge 0\\ (1-\frac{\tau }{\beta -2}) w^k + \frac{2\tau }{\beta - 2} &{} \text { if } w^{k} \le 0 \end{array} \right. . \end{aligned}$$

Note that when \(w^k = 0\), the optimization problem \(\inf _{x,y} L(x,y,0)\) has two distinct minimal points which lead to two different values. This shows no matter how small \(\tau \) is, \(w^k\) will oscillate around 0 and never converge.

However, although the duality gap is not zero, ADMM still converges in this case. There are two ways to prove it. The first way is to check all the conditions in Theorem 1. Another way is to check the iterates directly. The ADMM iterates are

$$\begin{aligned} x^{k+1}&= \max \left( -1,\min (1,\frac{\beta }{\beta + 2}(y^k - \frac{w^k}{\beta }))\right) , \quad y^{k+1} = \frac{\beta }{\beta - 2}\big (x^{k+1}+\frac{w^k}{\beta }\big ),\nonumber \\ w^{k+1}&= w^k + \beta (x^{k+1} - y^{k+1}). \end{aligned}$$
(68)

The second equality shows that \(w^{k} = -2y^k\), substituting it into the first and second equalities, we have

$$\begin{aligned} x^{k+1} = \max \{-1,\min \{1,y^k\}\},\quad y^{k+1} = \frac{1}{\beta - 2}\left( \beta x^{k+1} - 2y^k\right) . \end{aligned}$$
(69)

Here \(|y^{k+1}| \le \frac{\beta }{\beta -2} + \frac{2}{\beta -2}|y^k|\). Thus after finite iterations, \(|y^{k}| \le 2\) (assume \(\beta >4\)). If \(|y^k| \le 1\), the ADMM sequence converges obviously. If \(|y^k| > 1\), without loss of generality, we could assume \(2>y^k>1\). Then \(x^{k+1} = 1\). It means \(0<y^{k+1}<1\), so the ADMM sequence converges. Thus, we know for any initial point \(y^0\) and \(w^0\), ADMM converges. \(\square \)

Proof

(Theorem 2) Similar to the proof of Theorem 1, we only need to verify P1–P4 in Proposition 2. Proof of P2: Similar to Lemmas 4 and 5, we have

$$\begin{aligned}&{\mathcal L}_\beta ({\mathbf {x}}^{k},y^k,w^k)-{\mathcal L}_\beta ({\mathbf {x}}^{k+1},y^{k+1},w^{k+1}) \nonumber \\&\quad \ge -\frac{1}{\beta }\Vert w^{k} - w^{k+1}\Vert ^2 + \sum _{i=1}^p\frac{\beta - L_\phi {\bar{M}}}{2}\Vert A_ix_i^k-A_ix_i^{k+1}\Vert ^2 \nonumber \\&\quad + \frac{\beta - L_\phi {\bar{M}}}{2}\Vert By^k - By^{k+1}\Vert ^2. \end{aligned}$$
(70)

Since \(B^Tw^k = - \partial _y \phi ({\mathbf {x}}^k,y^k)\) for any \(k\in {\mathbb {N}}\), we have

$$\begin{aligned} \Vert w^k - w^{k+1}\Vert \le C_1L_\phi \left( \sum _{i=0}^p \Vert x_i^k - x_i^{k+1}\Vert + \Vert y^k - y^{k+1}\Vert \right) , \end{aligned}$$

where \(C_1 = \sigma _{\min }(B)\), \(\sigma _{\min }(B)\) is the smallest positive singular value of B, and \(L_\phi \) is the Lipschitz constant for \(\phi \). Therefore, we have

$$\begin{aligned}&{\mathcal L}_\beta ({\mathbf {x}}^{k},y^k,w^k)-{\mathcal L}_\beta ({\mathbf {x}}^{k+1},y^{k+1},w^{k+1}) \nonumber \\&\quad \ge \left( \frac{\beta - L_\phi {\bar{M}}}{2} - \frac{CL_\phi {\bar{M}}}{\beta }\right) \sum _{i=0}^p\Vert A_ix_i^k-A_ix_i^{k+1}\Vert ^2\nonumber \\&\qquad + \left( \frac{\beta - L_\phi {\bar{M}}}{2}-\frac{C_1L_\phi {\bar{M}}}{\beta }\right) \Vert By^k - By^{k+1}\Vert ^2. \end{aligned}$$
(71)

When \(\beta > \max \{1,L_\phi {\bar{M}} + 2C_1L_\phi {\bar{M}}\}\), P2 holds.

Proof of P1: First of all, we have already shown \({\mathcal L}_\beta ({\mathbf {x}}^{k},y^k,w^k)\ge {\mathcal L}_\beta ({\mathbf {x}}^{k+1},y^{k+1},w^{k+1})\), which means \({\mathcal L}_\beta ({\mathbf {x}}^{k},y^k,w^k)\) decreases monotonically. There exists \(y'\) such that \({\mathbf {A}}{\mathbf {x}}^k + By' = 0\) and \(y' = H(By')\). In order to show \({\mathcal L}_\beta ({\mathbf {x}}^k,y^k,w^k)\) is lower bounded, we apply A1–A3 to get

$$\begin{aligned}&{\mathcal L}_\beta ({\mathbf {x}}^{k},y^k,w^k)=\phi ({\mathbf {x}}^{k},y^k)+\big<w^{k}, \sum _{i=0}^p A_ix^{k}_i + By^k\big>+ \frac{\beta }{2}\Vert \sum _{i=0}^p A_ix^{k}_i + By^k\Vert ^2\nonumber \\&\quad = \phi ({\mathbf {x}}^{k},y^k)+\big <d_y^k,y' - y^k\big>+ \frac{\beta }{2}\Vert By^k - By'\Vert ^2 \ge \phi ({\mathbf {x}}^{k},y')\nonumber \\&\qquad +\frac{\beta }{4}\Vert \sum _{i=0}^p A_ix^{k}_i + By^k\Vert ^2 > -\infty , \end{aligned}$$
(72)

for some \(d_y^k \in \partial _y \phi ({\mathbf {x}}^{k},y^k)\). This shows that \(\mathcal{L}_{\beta }({\mathbf {x}}^{k},y^k,w^k)\) is lower bounded. If we view (72) from the opposite direction, it can be observed that

$$\begin{aligned} \phi ({\mathbf {x}}^k,y')+\frac{\beta }{4}\Vert \sum _{i=1}^p A_ix^{k}_i + By^k\Vert ^2 \end{aligned}$$

is upper bounded by \({\mathcal L}_\beta ({\mathbf {x}}^0,y^0,w^0)\). Then A1 ensures that \(\{{\mathbf {x}}^k,y^k\}\) is bounded. Therefore, \(w^k\) is bounded too.

Proof of P3, P4: This part is trivial as \(\phi \) is Lipschitz differentiable. Hence we omit it.

\(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Y., Yin, W. & Zeng, J. Global Convergence of ADMM in Nonconvex Nonsmooth Optimization. J Sci Comput 78, 29–63 (2019). https://doi.org/10.1007/s10915-018-0757-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10915-018-0757-z

Keywords

Navigation