Skip to main content
Log in

A Globally Convergent Algorithm for Nonconvex Optimization Based on Block Coordinate Update

  • Published:
Journal of Scientific Computing Aims and scope Submit manuscript

Abstract

Nonconvex optimization arises in many areas of computational science and engineering. However, most nonconvex optimization algorithms are only known to have local convergence or subsequence convergence properties. In this paper, we propose an algorithm for nonconvex optimization and establish its global convergence (of the whole sequence) to a critical point. In addition, we give its asymptotic convergence rate and numerically demonstrate its efficiency. In our algorithm, the variables of the underlying problem are either treated as one block or multiple disjoint blocks. It is assumed that each non-differentiable component of the objective function, or each constraint, applies only to one block of variables. The differentiable components of the objective function, however, can involve multiple blocks of variables together. Our algorithm updates one block of variables at a time by minimizing a certain prox-linear surrogate, along with an extrapolation to accelerate its convergence. The order of update can be either deterministically cyclic or randomly shuffled for each cycle. In fact, our convergence analysis only needs that each block be updated at least once in every fixed number of iterations. We show its global convergence (of the whole sequence) to a critical point under fairly loose conditions including, in particular, the Kurdyka–Łojasiewicz condition, which is satisfied by a broad class of nonconvex/nonsmooth applications. These results, of course, remain valid when the underlying problem is convex. We apply our convergence results to the coordinate descent iteration for non-convex regularized linear regression, as well as a modified rank-one residue iteration for nonnegative matrix factorization. We show that both applications have global convergence. Numerically, we tested our algorithm on nonnegative matrix and tensor factorization problems, where random shuffling clearly improves the chance to avoid low-quality local solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. A function f is proximable if it is easy to obtain the minimizer of \(f(x)+\frac{1}{2\gamma }\Vert x-y\Vert ^2\) for any input y and \(\gamma >0\).

  2. A function F on \(\mathbb {R}^n\) is differentiable at point \({\mathbf {x}}\) if there exists a vector \({\mathbf {g}}\) such that \(\lim _{{\mathbf {h}}\rightarrow 0}\frac{|F({\mathbf {x}}+{\mathbf {h}})-F({\mathbf {x}})-{\mathbf {g}}^\top {\mathbf {h}}|}{\Vert {\mathbf {h}}\Vert }=0\)

  3. Note that from Remark 2, for convex problems, we can take larger extrapolation weight but require it to be uniformly less than one. Hence, although our algorithm framework includes FISTA as a special case, our whole sequence convergence result does not imply that of FISTA.

  4. Another restarting option is tested based on gradient information.

  5. It is stated in [14] that the sequence generated by (42) converges to a coordinate-wise minimizer of (38). However, the result is obtained directly from [55], which only guarantees subsequence convergence.

References

  1. Aharon, M., Elad, M., Bruckstein, A.: K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process. 54(11), 4311–4322 (2006)

    Article  Google Scholar 

  2. Allen, G.: Sparse higher-order principal components analysis. In: International Conference on Artificial Intelligence and Statistics, pp. 27–36. (2012)

  3. Attouch, H., Bolte, J.: On the convergence of the proximal algorithm for nonsmooth functions involving analytic features. Math. Program. 116(1), 5–16 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  4. Attouch, H., Bolte, J., Redont, P., Soubeyran, A.: Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka–Lojasiewicz inequality. Math. Oper. Res. 35(2), 438–457 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  5. Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods. Math. Program. 137(1–2), 91–129 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  6. Bagirov, A.M., Jin, L., Karmitsa, N., Al Nuaimat, A., Sultanova, N.: Subgradient method for nonconvex nonsmooth optimization. J. Optim. Theory Appl. 157(2), 416–435 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  7. Beck, A., Teboulle, M.: A fast iterative shrinkage–thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  8. Beck, A., Tetruashvili, L.: On the convergence of block coordinate descent type methods. SIAM J. Optim. 23(4), 2037–2060 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  9. Bertsekas, D.P.: Nonlinear Programming. Athena Scientific, Belmont (1999)

    MATH  Google Scholar 

  10. Blumensath, T., Davies, M.E.: Iterative hard thresholding for compressed sensing. Appl. Comput. Harmon. Anal. 27(3), 265–274 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  11. Bolte, J., Daniilidis, A., Lewis, A.: The Lojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM J. Optim. 17(4), 1205–1223 (2007)

    Article  MATH  Google Scholar 

  12. Bolte, J., Daniilidis, A., Ley, O., Mazet, L.: Characterizations of łojasiewicz inequalities: subgradient flows, talweg, convexity. Trans. Am. Math. Soc. 362(6), 3319–3363 (2010)

    Article  MATH  Google Scholar 

  13. Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. 146(1), 459–494 (2014)

    MathSciNet  MATH  Google Scholar 

  14. Breheny, P., Huang, J.: Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann. Appl. Stat. 5(1), 232–253 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  15. Burke, J.V., Lewis, A.S., Overton, M.L.: A robust gradient sampling algorithm for nonsmooth, nonconvex optimization. SIAM J. Optim. 15(3), 751–779 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  16. Chang, K.W., Hsieh, C.J., Lin, C.J.: Coordinate descent method for large-scale l2-loss linear support vector machines. J. Mach. Learn. Res. 9, 1369–1398 (2008)

    MathSciNet  MATH  Google Scholar 

  17. Chartrand, R., Yin, W.: Iteratively reweighted algorithms for compressive sensing. In: IEEE International Conference on Acoustics, Speech and Signal Processing, 2008. ICASSP 2008, pp. 3869–3872. IEEE (2008)

  18. Chen, X.: Smoothing methods for nonsmooth, nonconvex minimization. Math. Program. 134(1), 71–99 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  19. Donoho, D., Stodden, V.: When does non-negative matrix factorization give a correct decomposition into parts. In: Advances in Neural Information Processing Systems, vol. 16. (2003)

  20. Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  21. Fuduli, A., Gaudioso, M., Giallombardo, G.: Minimizing nonconvex nonsmooth functions via cutting planes and proximity control. SIAM J. Optim. 14(3), 743–756 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  22. Ghadimi, S., Lan, G.: Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program. 156(1), 59–99 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  23. Grippo, L., Sciandrone, M.: Globally convergent block-coordinate techniques for unconstrained optimization. Optim. Methods Softw. 10(4), 587–637 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  24. Hildreth, C.: A quadratic programming procedure. Naval Res. Logist. Q. 4(1), 79–85 (1957)

    Article  MathSciNet  Google Scholar 

  25. Ho, N., Van Dooren, P., Blondel, V.: Descent methods for nonnegative matrix factorization. In: Numerical Linear Algebra in Signals, Systems and Control, pp. 251–293. Springer, Netherlands (2011)

  26. Hong, M., Wang, X., Razaviyayn, M., Luo, Z.Q.: Iteration complexity analysis of block coordinate descent methods. arXiv preprint arXiv:1310.6957 (2013)

  27. Hoyer, P.: Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. Res. 5, 1457–1469 (2004)

    MathSciNet  MATH  Google Scholar 

  28. Kim, J., He, Y., Park, H.: Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework. J. Global Optim. 58(2), 285–319 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  29. Kolda, T., Bader, B.: Tensor decompositions and applications. SIAM Rev. 51(3), 455 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  30. Kruger, A.Y.: On fréchet subdifferentials. J. Math. Sci. 116(3), 3325–3358 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  31. Kurdyka, K.: On gradients of functions definable in o-minimal structures. Ann. Inst. Fourier. 48(3), 769–783 (1998)

  32. Lai, M.J., Xu, Y., Yin, W.: Improved iteratively reweighted least squares for unconstrained smoothed \(\ell _q\) minimization. SIAM J. Numer. Anal. 51(2), 927–957 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  33. Lee, D., Seung, H.: Learning the parts of objects by non-negative matrix factorization. Nature 401(6755), 788–791 (1999)

    Article  Google Scholar 

  34. Li, G., Pong, T.K.: Calculus of the exponent of Kurdyka–Lojasiewicz inequality and its applications to linear convergence of first-order methods. arXiv preprint arXiv:1602.02915 (2016)

  35. Ling, Q., Xu, Y., Yin, W., Wen, Z.: Decentralized low-rank matrix completion. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012, pp. 2925–2928. IEEE (2012)

  36. Łojasiewicz, S.: Sur la géométrie semi-et sous-analytique. Ann. Inst. Fourier (Grenoble) 43(5), 1575–1595 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  37. Lu, Z., Xiao, L.: Randomized block coordinate non-monotone gradient method for a class of nonlinear programming. arXiv preprint arXiv:1306.5918 (2013)

  38. Lu, Z., Xiao, L.: On the complexity analysis of randomized block-coordinate descent methods. Math. Program. 152(1–2), 615–642 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  39. Luo, Z.Q., Tseng, P.: On the convergence of the coordinate descent method for convex differentiable minimization. J. Optim. Theory Appl. 72(1), 7–35 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  40. Mairal, J., Bach, F., Ponce, J., Sapiro, G.: Online dictionary learning for sparse coding. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 689–696. ACM (2009)

  41. Mohan, K., Fazel, M.: Iterative reweighted algorithms for matrix rank minimization. J. Mach. Learn. Res. 13(1), 3441–3473 (2012)

    MathSciNet  MATH  Google Scholar 

  42. Natarajan, B.K.: Sparse approximate solutions to linear systems. SIAM J. Comput. 24(2), 227–234 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  43. Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  44. Nesterov, Y.: Introductory lectures on convex optimization: a basic course, vol. 87. Springer Science & Business Media, Berlin (2013)

    MATH  Google Scholar 

  45. Nocedal, J., Wright, S.J.: Numerical Optimization, Springer Series in Operations Research and Financial Engineering., 2nd edn. Springer, New York (2006)

    Google Scholar 

  46. O’Donoghue, B., Candes, E.: Adaptive restart for accelerated gradient schemes. Found. Comput. Math. 15(3), 715–732 (2013)

  47. Paatero, P., Tapper, U.: Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5(2), 111–126 (1994)

    Article  Google Scholar 

  48. Peng, Z., Wu, T., Xu, Y., Yan, M., Yin, W.: Coordinate friendly structures, algorithms and applications. Ann. Math. Sci. Appl. 1(1), 57–119 (2016)

    Google Scholar 

  49. Razaviyayn, M., Hong, M., Luo, Z.Q.: A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM J. Optim. 23(2), 1126–1153 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  50. Recht, B., Fazel, M., Parrilo, P.: Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 52(3), 471–501 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  51. Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 144(1), 1–38 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  52. Rockafellar, R., Wets, R.: Variational Analysis, vol. 317. Springer, Berlin (2009)

    MATH  Google Scholar 

  53. Saha, A., Tewari, A.: On the nonasymptotic convergence of cyclic coordinate descent methods. SIAM J. Optim. 23(1), 576–601 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  54. Shi, H.J.M., Tu, S., Xu, Y., Yin, W.: A primer on coordinate descent algorithms. arXiv preprint arXiv:1610.00040 (2016)

  55. Tseng, P.: Convergence of a block coordinate descent method for nondifferentiable minimization. J. Optim. Theory Appl. 109(3), 475–494 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  56. Tseng, P., Yun, S.: A coordinate gradient descent method for nonsmooth separable minimization. Math. Program. 117(1), 387–423 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  57. Welling, M., Weber, M.: Positive tensor factorization. Pattern Recogn. Lett. 22(12), 1255–1261 (2001)

    Article  MATH  Google Scholar 

  58. Wen, Z., Yin, W., Zhang, Y.: Solving a low-rank factorization model for matrix completion by a nonlinear successive over-relaxation algorithm. Math. Program. Comput. 4(4), 333–361 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  59. Xu, Y.: Alternating proximal gradient method for sparse nonnegative tucker decomposition. Math. Program. Comput. 7(1), 39–70 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  60. Xu, Y., Akrotirianakis, I., Chakraborty, A.: Proximal gradient method for huberized support vector machine. Pattern Anal. Appl. 19(4), 989–1005 (2016)

    Article  MathSciNet  Google Scholar 

  61. Xu, Y., Yin, W.: A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM J. Imaging Sci. 6(3), 1758–1789 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  62. Xu, Y., Yin, W.: A fast patch-dictionary method for whole image recovery. Inverse Probl. Imaging 10(2), 563–583 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  63. Zhang, C.H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38(2), 894–942 (2010)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

Funding was provided in part by National Science Foundation (Grant No. DMS-1317602 and EECS-1462397) and Office of Naval Research (Grant No. N000141712162).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yangyang Xu.

Additional information

This work is supported in part by NSF DMS-1317602, EECS-1462397, and ONR N000141712162.

Appendices

Appendix 1: Proofs of Key Lemmas

In this section, we give proofs of the lemmas and also propositions we used.

1.1 Proof of Lemma 1

We show the general case of \(\alpha _k=\frac{1}{\gamma L_k},\forall k\) and \(\tilde{\omega }_i^j\le \frac{\delta (\gamma -1)}{2(\gamma +1)}\sqrt{\tilde{L}_i^{j-1}/\tilde{L}_i^{j}},\,\forall i,j\). Assume \(b_k=i\). From the Lipschitz continuity of \(\nabla _{{\mathbf {x}}_i}f({\mathbf {x}}_{\ne i}^{k-1},{\mathbf {x}}_i)\) about \({\mathbf {x}}_i\), it holds that (e.g., see Lemma 2.1 in [61])

$$\begin{aligned} f({\mathbf {x}}^{k})\le f({\mathbf {x}}^{k-1})+\langle \nabla _{{\mathbf {x}}_i}f({\mathbf {x}}^{k-1}),{\mathbf {x}}_i^{k}-{\mathbf {x}}_i^{k-1}\rangle +\frac{L_k}{2}\Vert {\mathbf {x}}_i^{k}-{\mathbf {x}}_i^{k-1}\Vert ^2. \end{aligned}$$
(53)

Since \({\mathbf {x}}_i^{k}\) is the minimizer of (2), then

$$\begin{aligned}&\langle \nabla _{{\mathbf {x}}_i}f({\mathbf {x}}_{\ne i}^{k-1},\hat{{\mathbf {x}}}_i^k),{\mathbf {x}}_i^{k}-\hat{{\mathbf {x}}}_i^k\rangle +\frac{1}{2\alpha _k}\Vert {\mathbf {x}}_i^{k}-\hat{{\mathbf {x}}}_i^k\Vert ^2+ r_i({\mathbf {x}}_i^{k})\nonumber \\&\qquad \le \langle \nabla _{{\mathbf {x}}_i}f({\mathbf {x}}_{\ne i}^{k-1},\hat{{\mathbf {x}}}_i^k),{\mathbf {x}}_i^{k-1}-\hat{{\mathbf {x}}}_i^k\rangle + \frac{1}{2\alpha _k}\Vert {\mathbf {x}}_i^{k-1}-\hat{{\mathbf {x}}}_i^k\Vert ^2+r_i({\mathbf {x}}_i^{k-1}). \end{aligned}$$
(54)

Summing (53) and (54) and noting that \({\mathbf {x}}_j^{k+1}={\mathbf {x}}_j^k,\forall j\ne i\), we have

$$\begin{aligned}&F({\mathbf {x}}^{k-1})-F({\mathbf {x}}^{k})\\&\quad = f({\mathbf {x}}^{k-1})+r_i({\mathbf {x}}_i^{k-1})-f({\mathbf {x}}^k)-r_i({\mathbf {x}}_i^k)\\&\quad \ge \langle \nabla _{{\mathbf {x}}_i}f({\mathbf {x}}_{\ne i}^{k-1},\hat{{\mathbf {x}}}_i^k)-\nabla _{{\mathbf {x}}_i}f({\mathbf {x}}^{k-1}),{\mathbf {x}}_i^{k}-{{\mathbf {x}}}_i^{k-1}\rangle +\frac{1}{2\alpha _k}\Vert {\mathbf {x}}_i^{k}-\hat{{\mathbf {x}}}_i^k\Vert ^2\\&\qquad -\frac{1}{2\alpha _k}\Vert {\mathbf {x}}_i^{k-1}-\hat{{\mathbf {x}}}_i^k\Vert ^2-\frac{L_k}{2}\Vert {\mathbf {x}}_i^{k}-{\mathbf {x}}_i^{k-1}\Vert ^2\\&\quad = \langle \nabla _{{\mathbf {x}}_i}f({\mathbf {x}}_{\ne i}^{k-1},\hat{{\mathbf {x}}}_i^k)-\nabla _{{\mathbf {x}}_i}f({\mathbf {x}}^{k-1}),{\mathbf {x}}_i^{k}-{{\mathbf {x}}}_i^{k-1}\rangle +\frac{1}{\alpha _k}\langle {\mathbf {x}}_i^{k}-{\mathbf {x}}_i^{k-1},{\mathbf {x}}_i^{k-1}-\hat{{\mathbf {x}}}_i^k\rangle \\&\qquad +\left( \frac{1}{2\alpha _k}-\frac{L_k}{2}\right) \Vert {\mathbf {x}}_i^{k}-{\mathbf {x}}_i^{k-1}\Vert ^2\\&\quad \ge -\Vert {\mathbf {x}}_i^{k}-{\mathbf {x}}_i^{k-1}\Vert \left( \Vert \nabla _{{\mathbf {x}}_i}f({\mathbf {x}}_{\ne i}^{k-1},\hat{{\mathbf {x}}}_i^k)-\nabla _{{\mathbf {x}}_i}f({\mathbf {x}}^{k-1})\Vert +\frac{1}{\alpha _k}\Vert {\mathbf {x}}_i^{k-1}-\hat{{\mathbf {x}}}_i^k\Vert \right) \\&\qquad +\left( \frac{1}{2\alpha _k}-\frac{L_k}{2}\right) \Vert {\mathbf {x}}_i^{k}-{\mathbf {x}}_i^{k-1}\Vert ^2\\&\quad \ge -\left( \frac{1}{\alpha _k}+L_k\right) \Vert {\mathbf {x}}_i^{k}-{\mathbf {x}}_i^{k-1}\Vert \cdot \Vert {\mathbf {x}}_i^{k-1}-\hat{{\mathbf {x}}}_i^k\Vert +\left( \frac{1}{2\alpha _k}-\frac{L_k}{2}\right) \Vert {\mathbf {x}}_i^{k}-{\mathbf {x}}_i^{k-1}\Vert ^2\\&\qquad \quad \overset{(6)}{=} -\left( \frac{1}{\alpha _k}+L_k\right) \omega _k\Vert {\mathbf {x}}_i^{k}-{\mathbf {x}}_i^{k-1}\Vert \cdot \Vert {\mathbf {x}}_i^{k-1}-\tilde{{\mathbf {x}}}_i^{d_i^{k-1}-1}\Vert +\left( \frac{1}{2\alpha _k}-\frac{L_k}{2}\right) \Vert {\mathbf {x}}_i^{k}-{\mathbf {x}}_i^{k-1}\Vert ^2\\&\quad \ge \frac{1}{4}\left( \frac{1}{\alpha _k}-L_k\right) \Vert {\mathbf {x}}_i^{k}-{\mathbf {x}}_i^{k-1}\Vert ^2-\frac{\left( 1/\alpha _k+L_k\right) ^2}{1/\alpha _k-L_k}\omega _k^2\Vert {\mathbf {x}}_i^{k-1}-\tilde{{\mathbf {x}}}_i^{d_i^{k-1}-1}\Vert ^2\\&\quad =\frac{\left( \gamma -1\right) L_k}{4}\Vert {\mathbf {x}}_i^{k}-{\mathbf {x}}_i^{k-1}\Vert ^2-\frac{\left( \gamma +1\right) ^2}{\gamma -1}L_k\omega _k^2\Vert {\mathbf {x}}_i^{k-1}-\tilde{{\mathbf {x}}}_i^{d_i^{k-1}-1}\Vert ^2. \end{aligned}$$

Here, we have used Cauchy–Schwarz inequality in the second inequality, Lipschitz continuity of \(\nabla _{{\mathbf {x}}_i}f({\mathbf {x}}_{\ne i}^{k-1},{\mathbf {x}}_i)\) in the third one, the Young’s inequality in the fourth one, the fact \({\mathbf {x}}_i^{k-1}=\tilde{{\mathbf {x}}}_i^{d_i^k-1}\) to have the third equality, and \(\alpha _k=\frac{1}{\gamma L_k}\) to get the last equality. Substituting \(\tilde{\omega }_i^j\le \frac{\delta (\gamma -1)}{2(\gamma +1)}\sqrt{\tilde{L}_i^{j-1}/\tilde{L}_i^{j}}\) and recalling (8) completes the proof.

1.2 Proof of the Claim in Remark 2

Assume \(b_k=i\) and \(\alpha _k=\frac{1}{L_k}\). When f is block multi-convex and \(r_i\) is convex, from Lemma 2.1 of [61], it follows that

$$\begin{aligned}&F({\mathbf {x}}^{k-1})-F({\mathbf {x}}^k)\\&\qquad \ge \frac{L_k}{2}\Vert {\mathbf {x}}_i^k-\hat{{\mathbf {x}}}_i^k\Vert ^2+L_k\langle \hat{{\mathbf {x}}}_i^k-{\mathbf {x}}_i^{k-1},{\mathbf {x}}_i^k-\hat{{\mathbf {x}}}_i^k\rangle \\&\qquad \quad \overset{(6)}{=} \frac{L_k}{2}\Vert {\mathbf {x}}_i^k-{{\mathbf {x}}}_i^{k-1}-\omega _k({{\mathbf {x}}}_i^{k-1}-{{\mathbf {x}}}_i^{d_i^{k-1}-1})\Vert ^2\\&\qquad +L_k\omega _k\left\langle {{\mathbf {x}}}_i^{k-1}-{{\mathbf {x}}}_i^{d_i^{k-1}-1}, {\mathbf {x}}_i^k-{{\mathbf {x}}}_i^{k-1}-\omega _k\left( {{\mathbf {x}}}_i^{k-1}-{{\mathbf {x}}}_i^{d_i^{k-1}-1}\right) \right\rangle \\&\quad = \frac{L_k}{2}\Vert {\mathbf {x}}_i^k-{{\mathbf {x}}}_i^{k-1}\Vert ^2-\frac{L_k\omega _k^2}{2}\Vert {\mathbf {x}}_i^{k-1}-{{\mathbf {x}}}_i^{d_i^{k-1}-1}\Vert ^2. \end{aligned}$$

Hence, if \(\omega _k\le \delta \sqrt{\tilde{L}_i^{j-1}/\tilde{L}_i^j}\), we have the desired result.

1.3 Proof of Proposition 1

Summing (14) over k from 1 to K gives

$$\begin{aligned} F(\mathbf{x}^0)-F(\mathbf{x}^K)\ge&\ \sum _{i=1}^s\sum _{k=1}^K\sum _{j=d_i^{k-1}+1}^{d_i^k}\left( \frac{\tilde{L}_i^j}{4}\Vert \tilde{\mathbf{x}}_i^{j-1}-\tilde{\mathbf{x}}_i^j\Vert ^2-\frac{\tilde{L}_i^{j-1}\delta ^2}{4}\Vert \tilde{\mathbf{x}}_i^{j-2}-\tilde{\mathbf{x}}_i^{j-1}\Vert ^2\right) \\ =&\ \sum _{i=1}^s\sum _{j=1}^{d_i^{K}}\left( \frac{\tilde{L}_i^j}{4}\Vert \tilde{\mathbf{x}}_i^{j-1}-\tilde{\mathbf{x}}_i^j\Vert ^2-\frac{\tilde{L}_i^{j-1}\delta ^2}{4}\Vert \tilde{\mathbf{x}}_i^{j-2}-\tilde{\mathbf{x}}_i^{j-1}\Vert ^2\right) \\ \ge&\ \sum _{i=1}^s\sum _{j=1}^{d_i^K}\frac{\tilde{L}_i^j(1-\delta ^2)}{4}\Vert \tilde{\mathbf{x}}_i^{j-1}-\tilde{\mathbf{x}}_i^j\Vert ^2\\ \ge&\ \sum _{i=1}^s\sum _{j=1}^{d_i^K}\frac{\ell (1-\delta ^2)}{4}\Vert \tilde{\mathbf{x}}_i^{j-1}-\tilde{\mathbf{x}}_i^j\Vert ^2, \end{aligned}$$

where we have used the fact \(d_i^0=0,\forall i\) in the first equality, \(\tilde{\mathbf{x}}_i^{-1}=\tilde{\mathbf{x}}_i^0,\forall i\) to have the second inequality, and \(\tilde{L}_i^j\ge \ell , \forall i,j\) in the last inequality. Letting \(K\rightarrow \infty \) and noting \(d_i^K\rightarrow \infty \) for all i by Assumption 3, we conclude from the above inequality and the lower boundedness of F in Assumption 1 that

$$\begin{aligned} \sum _{i=1}^s\sum _{j=1}^\infty \Vert \tilde{{\mathbf {x}}}_i^{j-1}-\tilde{{\mathbf {x}}}_i^j\Vert ^2<\infty , \end{aligned}$$

which implies (15).

1.4 Proof of Proposition 2

From Corollary 5.20 and Example 5.23 of [52], we have that if \({\mathbf {prox}}_{\alpha _kr_i}\) is single valued near \({\mathbf {x}}_i^{k-1}-\alpha _k\nabla _{{\mathbf {x}}_i}f({\mathbf {x}}^{k-1})\), then \({\mathbf {prox}}_{\alpha _kr_i}\) is continuous at \({\mathbf {x}}_i^{k-1}-\alpha _k\nabla _{{\mathbf {x}}_i}f({\mathbf {x}}^{k-1})\). Let \(\hat{{\mathbf {x}}}^k_i(\omega )\) explicitly denote the extrapolated point with weight \(\omega \), namely, we take \(\hat{{\mathbf {x}}}^k_i(\omega _k)\) in (6). In addition, let \({\mathbf {x}}^k_i(\omega )={\mathbf {prox}}_{\alpha _kr_i}\big (\hat{{\mathbf {x}}}_i^k(\omega )-\alpha _k\nabla _{{\mathbf {x}}_i}f(\mathbf{x}_{\ne i}^{k-1},\hat{\mathbf{x}}_i^k(\omega ))\big )\). Note that (14) implies

$$\begin{aligned} F({\mathbf {x}}^{k-1})-F({\mathbf {x}}^k(0))\ge \Vert {\mathbf {x}}^{k-1}-{\mathbf {x}}^k(0)\Vert ^2\overset{(19)}{>}0. \end{aligned}$$
(55)

From the optimality of \({\mathbf {x}}^k_i(\omega )\), it holds that

$$\begin{aligned}&\langle \nabla _{{\mathbf {x}}_i} f(\mathbf{x}_{\ne i}^{k-1},\hat{\mathbf{x}}_i^k(\omega )), \mathbf{x}_i^k(\omega )-\hat{\mathbf{x}}_i^k(\omega )\rangle +\frac{1}{2\alpha _k}\Vert \mathbf{x}_i^k(\omega )-\hat{\mathbf{x}}_i^k(\omega )\Vert ^2+r_i(\mathbf{x}_i^k(\omega ))\\&\qquad \le \langle \nabla _{{\mathbf {x}}_i} f(\mathbf{x}_{\ne i}^{k-1},\hat{\mathbf{x}}_i^k(\omega )), {\mathbf{x}}_i-\hat{\mathbf{x}}_i^k(\omega )\rangle +\frac{1}{2\alpha _k}\Vert {\mathbf{x}}_i-\hat{\mathbf{x}}_i^k(\omega )\Vert ^2+r_i({\mathbf{x}}_i),\,\forall {\mathbf {x}}_i. \end{aligned}$$

Taking limit superior on both sides of the above inequality, we have

$$\begin{aligned}&\langle \nabla _{{\mathbf {x}}_i} f(\mathbf{x}^{k-1}), \mathbf{x}_i^k(0)-{\mathbf{x}}_i^{k-1}\rangle +\frac{1}{2\alpha _k}\Vert \mathbf{x}_i^k(0)-{\mathbf{x}}_i^{k-1}\Vert ^2+\limsup _{\omega \rightarrow 0^+}r_i(\mathbf{x}_i^k(\omega ))\\&\qquad \le \langle \nabla _{{\mathbf {x}}_i} f(\mathbf{x}^{k-1}), {\mathbf{x}}_i-{\mathbf{x}}_i^{k-1}\rangle +\frac{1}{2\alpha _k}\Vert {\mathbf{x}}_i-{\mathbf{x}}_i^{k-1}\Vert ^2+r_i({\mathbf{x}}_i),\,\forall {\mathbf {x}}_i, \end{aligned}$$

which implies \(\underset{\omega \rightarrow 0^+}{\limsup }\, r_i(\mathbf{x}_i^k(\omega ))\le r_i(\mathbf{x}_i^k(0))\). Since \(r_i\) is lower semicontinuous, \(\underset{\omega \rightarrow 0^+}{\liminf }\, r_i(\mathbf{x}_i^k(\omega ))\ge r_i(\mathbf{x}_i^k(0))\). Hence, \(\underset{\omega \rightarrow 0^+}{\lim }r_i(\mathbf{x}_i^k(\omega ))= r_i(\mathbf{x}_i^k(0))\), and thus \(\underset{\omega \rightarrow 0^+}{\lim }F(\mathbf{x}^k(\omega ))= F(\mathbf{x}^k(0))\). Together with (55), we conclude that there exists \(\bar{\omega }_k>0\) such that \(F({\mathbf {x}}^{k-1})-F(\mathbf{x}^k(\omega ))\ge 0,\,\forall \omega \in [0,\bar{\omega }_k]\). This completes the proof.

1.5 Proof of Lemma 2

Let \({\mathbf {a}}_m\) and \({\mathbf {u}}_m\) be the vectors with their ith entries

$$\begin{aligned} ({\mathbf {a}}_m)_i=\sqrt{\alpha _{i,n_{i,m}}},\quad ({\mathbf {u}}_m)_i=A_{i,n_{i,m}}. \end{aligned}$$

Then (21) can be written as

$$\begin{aligned}&\Vert {\mathbf {a}}_{m+1}\odot {\mathbf {u}}_{m+1}\Vert ^2+(1-\beta ^2)\sum _{i=1}^s\sum _{j=n_{i,m}+1}^{n_{i,m+1}-1}\alpha _{i,j}A_{i,j}^2\nonumber \\&\quad \le \beta ^2\Vert {\mathbf {a}}_{m}\odot {\mathbf {u}}_{m}\Vert ^2+B_m\sum _{i=1}^s\sum _{j=n_{i,m-1}+1}^{n_{i,m}}A_{i,j}. \end{aligned}$$
(56)

Recall

$$\begin{aligned} \underline{\alpha }=\inf _{i,j}\alpha _{i,j},\quad \overline{\alpha }=\sup _{i,j}\alpha _{i,j}. \end{aligned}$$

Then it follows from (56) that

$$\begin{aligned} \Vert {\mathbf {a}}_{m+1}\odot {\mathbf {u}}_{m+1}\Vert ^2+\underline{\alpha }(1-\beta ^2)\sum _{i=1}^s\sum _{j=n_{i,m}+1}^{n_{i,m+1}-1}A_{i,j}^2 \le \beta ^2\Vert {\mathbf {a}}_{m}\odot {\mathbf {u}}_{m}\Vert ^2+B_m\sum _{i=1}^s\sum _{j=n_{i,m-1}+1}^{n_{i,m}}A_{i,j}. \end{aligned}$$
(57)

By the Cauchy–Schwarz inequality and noting \(n_{i,m+1}-n_{i,m}\le N,\forall i,m\), we have

$$\begin{aligned} \left( \sum _{i=1}^s\sum _{j=n_{i,m}+1}^{n_{i,m+1}-1}A_{i,j}\right) ^2\le sN\sum _{i=1}^s\sum _{j=n_{i,m}+1}^{n_{i,m+1}-1}A_{i,j}^2 \end{aligned}$$
(58)

and for any positive \(C_1\),

$$\begin{aligned}&(1+\beta )C_1\Vert {\mathbf {a}}_{m+1}\odot {\mathbf {u}}_{m+1}\Vert \left( \sum _{i=1}^s\sum _{j=n_{i,m}+1}^{n_{i,m+1}-1}A_{i,j}\right) \nonumber \\&\qquad \le \sum _{i=1}^s\sum _{j=n_{i,m}+1}^{n_{i,m+1}-1}\left( \frac{4-(1+\beta )^2}{4sN}\Vert {\mathbf {a}}_{m+1}\odot {\mathbf {u}}_{m+1}\Vert ^2+\frac{(1+\beta )^2C_1^2sN}{4-(1+\beta )^2}A_{i,j}^2\right) \nonumber \\&\qquad \le \frac{4-(1+\beta )^2}{4}\Vert {\mathbf {a}}_{m+1}\odot {\mathbf {u}}_{m+1}\Vert ^2 + \frac{(1+\beta )^2C_1^2sN}{4-(1+\beta )^2}\sum _{i=1}^s\sum _{j=n_{i,m}+1}^{n_{i,m+1}-1}A_{i,j}^2. \end{aligned}$$
(59)

Taking

$$\begin{aligned} C_1 \le \sqrt{\frac{\underline{\alpha }(1-\beta ^2)(4-(1+\beta )^2)}{4sN}}, \end{aligned}$$
(60)

we have from (58) and (59) that

$$\begin{aligned}&\frac{1+\beta }{2}\Vert {\mathbf {a}}_{m+1}\odot {\mathbf {u}}_{m+1}\Vert +C_1\sum _{i=1}^s\sum _{j=n_{i,m}+1}^{n_{i,m+1}-1}A_{i,j}\nonumber \\&\qquad \le \sqrt{\Vert {\mathbf {a}}_{m+1}\odot {\mathbf {u}}_{m+1}\Vert ^2+\underline{\alpha }(1-\beta ^2)\sum _{i=1}^s\sum _{j=n_{i,m}+1}^{n_{i,m+1}-1}A_{i,j}^2}. \end{aligned}$$
(61)

For any \(C_2>0\), it holds

$$\begin{aligned}&\sqrt{\beta ^2\Vert {\mathbf {a}}_{m}\odot {\mathbf {u}}_{m}\Vert ^2+B_m\sum _{i=1}^s\sum _{j=n_{i,m-1}+1}^{n_{i,m}}A_{i,j}}\nonumber \\&\qquad \le \beta \Vert {\mathbf {a}}_{m}\odot {\mathbf {u}}_{m}\Vert +\sqrt{B_m\sum _{i=1}^s\sum _{j=n_{i,m-1}+1}^{n_{i,m}}A_{i,j}}\nonumber \\&\qquad \le \beta \Vert {\mathbf {a}}_{m}\odot {\mathbf {u}}_{m}\Vert +C_2B_m+\frac{1}{4C_2}\sum _{i=1}^s\sum _{j=n_{i,m-1}+1}^{n_{i,m}}A_{i,j}\nonumber \\&\qquad \le \beta \Vert {\mathbf {a}}_{m}\odot {\mathbf {u}}_{m}\Vert +C_2B_m+\frac{1}{4C_2}\sum _{i=1}^s\sum _{j=n_{i,m-1}+1}^{n_{i,m}-1}A_{i,j}+\frac{\sqrt{s}}{4C_2}\Vert {\mathbf {u}}_m\Vert . \end{aligned}$$
(62)

Combining (57), (61), and (62), we have

$$\begin{aligned}&\frac{1+\beta }{2}\Vert {\mathbf {a}}_{m+1}\odot {\mathbf {u}}_{m+1}\Vert +C_1\sum _{i=1}^s\sum _{j=n_{i,m}+1}^{n_{i,m+1}-1}A_{i,j}\\&\qquad \le \beta \Vert {\mathbf {a}}_{m}\odot {\mathbf {u}}_{m}\Vert +C_2B_m+\frac{1}{4C_2}\sum _{i=1}^s\sum _{j=n_{i,m-1}+1}^{n_{i,m}-1}A_{i,j}+\frac{\sqrt{s}}{4C_2}\Vert {\mathbf {u}}_m\Vert . \end{aligned}$$

Summing the above inequality over m from \(M_1\) through \(M_2\le M\) and arranging terms gives

$$\begin{aligned}&\sum _{m=M_1}^{M_2}\left( \frac{1-\beta }{2}\Vert {\mathbf {a}}_{m+1}\odot {\mathbf {u}}_{m+1}\Vert -\frac{\sqrt{s}}{4C_2}\Vert {\mathbf {u}}_{m+1}\Vert \right) +\left( C_1-\frac{1}{4C_2}\right) \sum _{m=M_1}^{M_2}\sum _{i=1}^s\sum _{j=n_{i,m}+1}^{n_{i,m+1}-1}A_{i,j}\nonumber \\&\qquad \le \beta \Vert {\mathbf {a}}_{M_1}\odot {\mathbf {u}}_{M_1}\Vert +C_2\sum _{m=M_1}^{M_2} B_m +\frac{1}{4C_2}\sum _{i=1}^s\sum _{j=n_{i,M_1-1}+1}^{n_{i,M_1}-1}A_{i,j}+\frac{\sqrt{s}}{4C_2}\Vert {\mathbf {u}}_{M_1}\Vert \end{aligned}$$
(63)

Take

$$\begin{aligned} C_2=\max \left( \frac{1}{2C_1},\ \frac{\sqrt{s}}{\sqrt{\underline{\alpha }}(1-\beta )}\right) . \end{aligned}$$
(64)

Then (63) implies

$$\begin{aligned}&\frac{\sqrt{\underline{\alpha }}(1-\beta )}{4}\sum _{m=M_1}^{M_2}\Vert {\mathbf {u}}_{m+1}\Vert +\frac{C_1}{2}\sum _{m=M_1}^{M_2}\sum _{i=1}^s\sum _{j=n_{i,m}+1}^{n_{i,m+1}-1}A_{i,j}\nonumber \\&\qquad \le \beta \sqrt{\overline{\alpha }}\Vert {\mathbf {u}}_{M_1}\Vert +C_2\sum _{m=M_1}^{M_2} B_m+\frac{1}{4C_2}\sum _{i=1}^s\sum _{j=n_{i,M_1-1}+1}^{n_{i,M_1}-1}A_{i,j}+\frac{\sqrt{s}}{4C_2}\Vert {\mathbf {u}}_{M_1}\Vert , \end{aligned}$$
(65)

which together with \(\sum _{i=1}^sA_{i,n_{i,m+1}}\le \sqrt{s}\Vert {\mathbf {u}}_{m+1}\Vert \) gives

$$\begin{aligned}&C_3\sum _{i=1}^s\sum _{j=n_{i,M_1}+1}^{n_{i,M_2+1}}A_{i,j}\nonumber \\&\quad = C_3\sum _{m=M_1}^{M_2}\sum _{i=1}^s\sum _{j=n_{i,m}+1}^{n_{i,m+1}}A_{i,j}\nonumber \\&\quad \le \beta \sqrt{\overline{\alpha }}\Vert {\mathbf {u}}_{M_1}\Vert +C_2\sum _{m=M_1}^{M_2} B_m+\frac{1}{4C_2}\sum _{i=1}^s\sum _{j=n_{i,M_1-1}+1}^{n_{i,M_1}-1}A_{i,j}+\frac{\sqrt{s}}{4C_2}\Vert {\mathbf {u}}_{M_1}\Vert ,\nonumber \\&\quad \le C_2\sum _{m=M_1}^{M_2} B_m+C_4\sum _{i=1}^s\sum _{j=n_{i,M_1-1}+1}^{n_{i,M_1}}A_{i,j}, \end{aligned}$$
(66)

where we have used \(\Vert {\mathbf {u}}_{M_1}\Vert \le \sum _{i=1}^sA_{i,n_{i,M_1}}\), and

$$\begin{aligned} C_3=\min \left( \frac{\sqrt{\underline{\alpha }}(1-\beta )}{4\sqrt{s}},\frac{C_1}{2}\right) ,\quad C_4=\beta \sqrt{\overline{\alpha }}+\frac{\sqrt{s}}{4C_2}. \end{aligned}$$
(67)

From (60), (64), and (67), we can take

$$\begin{aligned} C_1=\frac{\sqrt{\underline{\alpha }}(1-\beta )}{2\sqrt{sN}}\le \min \left\{ \sqrt{\frac{\underline{\alpha }(1-\beta ^2)(4-(1+\beta )^2)}{4sN}},\ \frac{\sqrt{\underline{\alpha }}(1-\beta )}{2\sqrt{s}}\right\} , \end{aligned}$$

where the inequality can be verified by noting \((1-\beta ^2)(4-(1+\beta )^2)-(1-\beta )^2\) is decreasing with respect to \(\beta \) in [0, 1]. Thus from (64) and (67), we have \(C_2=\frac{1}{2C_1},\, C_3=\frac{C_1}{2},\, C_4=\beta \sqrt{\overline{\alpha }}+\frac{\sqrt{s}C_1}{2}\). Hence, from (66), we complete the proof of (22).

If \(\lim _{m\rightarrow \infty }n_{i,m}=\infty ,\forall i\), \(\sum _{m=1}^\infty B_m<\infty \), and (21) holds for all m, letting \(M_1=1\) and \(M_2\rightarrow \infty \), we have (23) from (66).

1.6 Proof of Proposition 3

For any i, assume that while updating the ith block to \({\mathbf {x}}_i^k\), the value of the jth block (\(j\ne i\)) is \({\mathbf {y}}_j^{(i)}\), the extrapolated point of the ith block is \({\mathbf {z}}_i\), and the Lipschitz constant of \(\nabla _{{\mathbf {x}}_i}f({\mathbf {y}}_{\ne i}^{(i)},{\mathbf {x}}_i)\) with respect to \({\mathbf {x}}_i\) is \(\tilde{L}_i\), namely,

$$\begin{aligned} {\mathbf {x}}_i^k\in {{\mathrm{\hbox {arg min}}}}_{{\mathbf {x}}_i}\langle \nabla _{{\mathbf {x}}_i}f({\mathbf {y}}_{\ne i}^{(i)},{\mathbf {z}}_i),{\mathbf {x}}_i-{\mathbf {z}}_i\rangle +\tilde{L}_i\Vert {\mathbf {x}}_i-{\mathbf {z}}_i\Vert ^2+r_i({\mathbf {x}}_i). \end{aligned}$$

Hence, \(\mathbf {0}\in \nabla _{{\mathbf {x}}_i}f({\mathbf {y}}_{\ne i}^{(i)},{\mathbf {z}}_i)+2\tilde{L}_i({\mathbf {x}}_i^k-{\mathbf {z}}_i)+\partial r_i({\mathbf {x}}_i^k),\) or equivalently,

$$\begin{aligned} \nabla _{{\mathbf {x}}_i}f({\mathbf {x}}^k)-\nabla _{{\mathbf {x}}_i}f({\mathbf {y}}_{\ne i}^{(i)},{\mathbf {z}}_i)-2\tilde{L}_i({\mathbf {x}}_i^k-{\mathbf {z}}_i)\in \nabla _{{\mathbf {x}}_i}f({\mathbf {x}}^k)+\partial r_i({\mathbf {x}}_i^k),\,\forall i. \end{aligned}$$
(68)

Note that \({\mathbf {x}}_i\) may be updated to \({\mathbf {x}}_i^k\) not at the kth iteration but at some earlier one, which must be between \(k-T\) and k by Assumption 3. In addition, for each pair (ij), there must be some \(\kappa _{i,j}\) between \(k-2T\) and k such that

$$\begin{aligned} {\mathbf {y}}_j^{(i)}={\mathbf {x}}_j^{\kappa _{i,j}}, \end{aligned}$$
(69)

and for each i, there are \(k-3T\le \kappa _1^i<\kappa _2^i\le k\) and extrapolation weight \(\tilde{\omega }_i\le 1\) such that

$$\begin{aligned} {\mathbf {z}}_i={\mathbf {x}}_i^{\kappa _2^i}+\tilde{\omega }_i({\mathbf {x}}_i^{\kappa _2^i}- {\mathbf {x}}_i^{\kappa _1^i}). \end{aligned}$$
(70)

By triangle inequality, \(({\mathbf {y}}_{\ne i}^{(i)},{\mathbf {z}}_i)\in B_{4\rho }(\bar{{\mathbf {x}}})\) for all i. Therefore, it follows from (10) and (68) that

$$\begin{aligned} \mathrm {dist}(\mathbf {0},\partial F({\mathbf {x}}^k))\overset{(68)}{\le }&\sqrt{\sum _{i=1}^s\Vert \nabla _{{\mathbf {x}}_i}f({\mathbf {x}}^k)-\nabla _{{\mathbf {x}}_i}f({\mathbf {y}}_{\ne i}^{(i)},{\mathbf {z}}_i)-2\tilde{L}_i({\mathbf {x}}_i^k-{\mathbf {z}}_i)\Vert ^2}\nonumber \\ \le&\sum _{i=1}^s\Vert \nabla _{{\mathbf {x}}_i}f({\mathbf {x}}^k)-\nabla _{{\mathbf {x}}_i}f({\mathbf {y}}_{\ne i}^{(i)},{\mathbf {z}}_i)-2\tilde{L}_i({\mathbf {x}}_i^k-{\mathbf {z}}_i)\Vert \nonumber \\ \le&\sum _{i=1}^s\left( \Vert \nabla _{{\mathbf {x}}_i}f({\mathbf {x}}^k)-\nabla _{{\mathbf {x}}_i}f({\mathbf {y}}_{\ne i}^{(i)},{\mathbf {z}}_i)\Vert +2\tilde{L}_i\Vert {\mathbf {x}}_i^k-{\mathbf {z}}_i\Vert \right) \nonumber \\ \le&\sum _{i=1}^s\left( L_G\Vert {\mathbf {x}}^k-({\mathbf {y}}_{\ne i}^{(i)},{\mathbf {z}}_i)\Vert +2\tilde{L}_i\Vert {\mathbf {x}}_i^k-{\mathbf {z}}_i\Vert \right) \nonumber \\ \le&\sum _{i=1}^s\left( (L_G+2L)\Vert {\mathbf {x}}_i^k-{\mathbf {z}}_i\Vert +L_G\sum _{j\ne i}\Vert {\mathbf {x}}_j^k-{\mathbf {y}}_j^{(i)}\Vert \right) , \end{aligned}$$
(71)

where in the fourth inequality, we have used the Lipschitz continuity of \(\nabla _{{\mathbf {x}}_i}f({\mathbf {x}})\) with respect to \({\mathbf {x}}\), and the last inequality uses \(\tilde{L}_i\le L\). Now use (71), (69), (70) and also the triangle inequality to have the desired result.

1.7 Proof of Lemma 3

The proof follows that of Theorem 2 of [3]. When \(\gamma \ge 1\), since \(0\le A_{k-1}-A_k\le 1,\forall k\ge K\), we have \((A_{k-1}-A_k)^\gamma \le A_{k-1}-A_k\), and thus (33) implies that for all \(k\ge K\), it holds that \(A_k\le (\alpha +\beta )(A_{k-1}-A_k)\), from which item 1 immediately follows.

When \(\gamma <1\), we have \((A_{k-1}-A_k)^\gamma \ge A_{k-1}-A_k\), and thus (33) implies that for all \(k\ge K\), it holds that \(A_k\le (\alpha +\beta )(A_{k-1}-A_k)^\gamma \). Letting \(h(x)=x^{-1/\gamma }\), we have for \(k\ge K\),

$$\begin{aligned} 1&\le (\alpha +\beta )^{1/\gamma }(A_{k-1}-A_k)A_k^{-1/\gamma }\\&= (\alpha +\beta )^{1/\gamma }\left( \frac{A_{k-1}}{A_k}\right) ^{1/\gamma }(A_{k-1}-A_k)A_{k-1}^{-1/\gamma }\\&\le (\alpha +\beta )^{1/\gamma }\left( \frac{A_{k-1}}{A_k}\right) ^{1/\gamma }\int _{A_k}^{A_{k-1}}h(x)dx\\&=\frac{(\alpha +\beta )^{1/\gamma }}{1-1/\gamma }\left( \frac{A_{k-1}}{A_k}\right) ^{1/\gamma }\left( A_{k-1}^{1-1/\gamma }-A_k^{1-1/\gamma }\right) , \end{aligned}$$

where we have used nonincreasing monotonicity of h in the second inequality. Hence,

$$\begin{aligned} A_{k}^{1-1/\gamma }-A_{k-1}^{1-1/\gamma }\ge \frac{1/\gamma -1}{(\alpha +\beta )^{1/\gamma }}\left( \frac{A_{k}}{A_{k-1}}\right) ^{1/\gamma }. \end{aligned}$$
(72)

Let \(\mu \) be the positive constant such that

$$\begin{aligned} \frac{1/\gamma -1}{(\alpha +\beta )^{1/\gamma }}\mu =\mu ^{\gamma -1}-1. \end{aligned}$$
(73)

Note that the above equation has a unique solution \(0<\mu <1\). We claim that

$$\begin{aligned} A_{k}^{1-1/\gamma }-A_{k-1}^{1-1/\gamma }\ge \mu ^{\gamma -1}-1,\ \forall k\ge K. \end{aligned}$$
(74)

It obviously holds from (72) and (73) if \(\big (\frac{A_{k}}{A_{k-1}}\big )^{1/\gamma }\ge \mu \). It also holds if \(\big (\frac{A_{k}}{A_{k-1}}\big )^{1/\gamma }\le \mu \) from the arguments

$$\begin{aligned} \left( \frac{A_{k}}{A_{k-1}}\right) ^{1/\gamma }\le \mu \Rightarrow&A_k\le \mu ^\gamma A_{k-1}\Rightarrow A_k^{1-1/\gamma }\ge \mu ^{\gamma -1}A_{k-1}^{1-1/\gamma }\\ \Rightarrow&A_{k}^{1-1/\gamma }-A_{k-1}^{1-1/\gamma }\ge (\mu ^{\gamma -1}-1)A_{k-1}^{1-1/\gamma } \ge \mu ^{\gamma -1}-1, \end{aligned}$$

where the last inequality is from \(A_{k-1}^{1-1/\gamma }\ge 1\). Hence, (74) holds, and summing it over k gives

$$\begin{aligned} A_k^{1-1/\gamma }\ge A_k^{1-1/\gamma }-A_K^{1-1/\gamma }\ge (\mu ^{\gamma -1}-1)(k-K), \end{aligned}$$

which immediately gives item 2 by letting \(\nu =(\mu ^{\gamma -1}-1)^{\frac{\gamma }{\gamma -1}}\).

Appendix 2: Solutions of (46)

In this section, we give closed form solutions to both updates in (46). First, it is not difficult to have the solution of (46b):

$$\begin{aligned} {\mathbf {y}}_{\pi _i}^{k+1}=\max \left( 0,\big ({\mathbf {X}}_{\pi _{<i}}^{k+1}({\mathbf {Y}}_{\pi _{<i}}^{k+1})^\top +{\mathbf {X}}_{\pi _{>i}}^{k}({\mathbf {Y}}_{\pi _{>i}}^{k})^\top -{\mathbf {M}}\big )^\top {\mathbf {x}}_{\pi _i}^{k+1}\right) . \end{aligned}$$

Secondly, since \(L_{\pi _i}^k>0\), it is easy to write (46a) in the form of

$$\begin{aligned} \min _{{\mathbf {x}}\ge 0,\,\Vert {\mathbf {x}}\Vert =1}\frac{1}{2}\Vert {\mathbf {x}}-{\mathbf {a}}\Vert ^2+{\mathbf {b}}^\top {\mathbf {x}}+C, \end{aligned}$$

which is apparently equivalent to

$$\begin{aligned} \max _{{\mathbf {x}}\ge 0,\,\Vert {\mathbf {x}}\Vert =1} {\mathbf {c}}^\top {\mathbf {x}}, \end{aligned}$$
(75)

which \({\mathbf {c}}={\mathbf {a}}-{\mathbf {b}}\). Next we give solution to (75) in three different cases.

Case 1 \({\mathbf {c}}<0\). Let \(i_0={{\mathrm{\hbox {arg max}}}}_i c_i\) and \(c_{\max }=c_{i_0}<0\). If there are more than one components equal \(c_{\max }\), one can choose an arbitrary one of them. Then the solution to (75) is given by \(x_{i_0}=1\) and \(x_i=0,\forall i\ne i_0\) because for any \({\mathbf {x}}\ge 0\) and \(\Vert {\mathbf {x}}\Vert =1\), it holds that

$$\begin{aligned} {\mathbf {c}}^\top {\mathbf {x}}\le c_{\max }\Vert {\mathbf {x}}\Vert _1\le c_{\max }\Vert {\mathbf {x}}\Vert =c_{\max }. \end{aligned}$$

Case 2 \({\mathbf {c}}\le 0\) and \({\mathbf {c}}\not <0\). Let \({\mathbf {c}}=({\mathbf {c}}_{I_0},{\mathbf {c}}_{I_-})\) where \({\mathbf {c}}_{I_0}=\mathbf {0}\) and \({\mathbf {c}}_{I_-}<0\). Then the solution to (75) is given by \({\mathbf {x}}_{I_-}=\mathbf {0}\) and \({\mathbf {x}}_{I_0}\) being any vector that satisfies \({\mathbf {x}}_{I_0}\ge 0\) and \(\Vert {\mathbf {x}}_{I_0}\Vert =1\) because \({\mathbf {c}}^\top {\mathbf {x}}\le 0\) for any \({\mathbf {x}}\ge 0\).

Case 3 \({\mathbf {c}}\not \le 0\) Let \({\mathbf {c}}=({\mathbf {c}}_{I_+},{\mathbf {c}}_{I_+^c})\) where \({\mathbf {c}}_{I_+}>0\) and \({\mathbf {c}}_{I_+^c}\le 0\). Then (75) has a unique solution given by \({\mathbf {x}}_{I_+}=\frac{{\mathbf {c}}_{I_+}}{\Vert {\mathbf {c}}_{I_+}\Vert }\) and \({\mathbf {x}}_{I_+^c}=\mathbf {0}\) because for any \({\mathbf {x}}\ge 0\) and \(\Vert {\mathbf {x}}\Vert =1\), it holds that

$$\begin{aligned} {\mathbf {c}}^\top {\mathbf {x}}\le {\mathbf {c}}_{I_+}^\top {\mathbf {x}}_{I_+}\le \Vert {\mathbf {c}}_{I_+}\Vert \cdot \Vert {\mathbf {x}}_{I_+}\Vert \le \Vert {\mathbf {c}}_{I_+}\Vert \cdot \Vert {\mathbf {x}}\Vert =\Vert {\mathbf {c}}_{I_+}\Vert , \end{aligned}$$

where the second inequality holds with equality if and only if \({\mathbf {x}}_{I_+}\) is collinear with \({\mathbf {c}}_{I_+}\), and the third inequality holds with equality if and only if \({\mathbf {x}}_{I_+^c}=\mathbf {0}\).

Appendix 3: Proofs of Convergence of Some Examples

In this section, we give the proofs of the theorems in Sect.3.

1.1 Proof of Theorem 6

Through checking the assumptions of Theorem 2, we only need to verify the boundedness of \(\{{\mathbf {Y}}^k\}\) to show Theorem 6. Let \({\mathbf {E}}^k={\mathbf {X}}^k({\mathbf {Y}}^k)^\top -{\mathbf {M}}\). Since every iteration decreases the objective, it is easy to see that \(\{{\mathbf {E}}^k\}\) is bounded. Hence, \(\{{\mathbf {E}}^k+{\mathbf {M}}\}\) is bounded, and

$$\begin{aligned} a=\sup _k\max _{i,j}({\mathbf {E}}^k+{\mathbf {M}})_{ij}<\infty . \end{aligned}$$

Let \(y_{ij}^k\) be the (ij)th entry of \({\mathbf {Y}}^k\). Thus the columns of \({\mathbf {E}}^k+{\mathbf {M}}\) satisfy

$$\begin{aligned} a\ge {\mathbf {e}}_i^k+{\mathbf {m}}_i=\sum _{j=1}^p y_{ij}^k {\mathbf {x}}_j^k,\,\forall i, \end{aligned}$$
(76)

where \({\mathbf {x}}_j^k\) is the jth column of \({\mathbf {X}}^k\). Since \(\Vert {\mathbf {x}}_j^k\Vert =1\), we have \(\Vert {\mathbf {x}}_j^k\Vert _\infty \ge 1/\sqrt{m},\,\forall j\). Note that (76) implies each component of \(\sum _{j=1}^p y_{ij}^k {\mathbf {x}}_j^k\) is no greater than a. Hence from nonnegativity of \({\mathbf {X}}^k\) and \({\mathbf {Y}}^k\) and noting that at least one entry of \({\mathbf {x}}_j^k\) is no less than \(1/\sqrt{m}\), we have \(y_{ij}^k\le a\sqrt{m}\) for all ij and k. This completes the proof.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, Y., Yin, W. A Globally Convergent Algorithm for Nonconvex Optimization Based on Block Coordinate Update. J Sci Comput 72, 700–734 (2017). https://doi.org/10.1007/s10915-017-0376-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10915-017-0376-0

Keywords

Navigation