An inexact proximal augmented Lagrangian framework with arbitrary linearly convergent inner solver for composite convex optimization

Li, Fei; Qu, Zheng

doi:10.1007/s12532-021-00205-x

An inexact proximal augmented Lagrangian framework with arbitrary linearly convergent inner solver for composite convex optimization

Full Length Paper
Published: 17 July 2021

Volume 13, pages 583–644, (2021)
Cite this article

Mathematical Programming Computation Aims and scope Submit manuscript

Fei Li¹ &
Zheng Qu¹

510 Accesses
2 Citations
Explore all metrics

Abstract

We propose an inexact proximal augmented Lagrangian framework with explicit inner problem termination rule for composite convex optimization problems. We consider arbitrary linearly convergent inner solver including in particular stochastic algorithms, making the resulting framework more scalable facing the ever-increasing problem dimension. Each subproblem is solved inexactly with an explicit and self-adaptive stopping criterion, without requiring to set an a priori target accuracy. When the primal and dual domain are bounded, our method achieves $O(1/\sqrt{\epsilon })$ and $O(1/{\epsilon })$ complexity bound in terms of number of inner solver iterations, respectively for the strongly convex and non-strongly convex case. Without the boundedness assumption, only logarithm terms need to be added and the above two complexity bounds increase respectively to ${\tilde{O}}(1/\sqrt{\epsilon })$ and ${\tilde{O}}(1/{\epsilon })$, which hold both for obtaining $\epsilon $-optimal and $\epsilon $-KKT solution. Within the general framework that we propose, we also obtain ${\tilde{O}}(1/{\epsilon })$ and ${\tilde{O}}(1/{\epsilon ^2})$ complexity bounds under relative smoothness assumption on the differentiable component of the objective function. We show through theoretical analysis as well as numerical experiments the computational speedup possibly achieved by the use of randomized inner solvers for large-scale problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Proximal Alternating Direction Method of Multipliers for DC Programming with Structured Constraints

Article 11 May 2024

Relaxed-inertial derivative-free algorithm for systems of nonlinear pseudo-monotone equations

Article 15 May 2024

Efficiency of higher-order algorithms for minimizing composite functions

Article 10 October 2023

Notes

L-Katyusha stands for Loopless Katyusha. The algorithm Katyusha was first proposed by Allen-Zhu [2]. The loopless variants [21, 35] have the same complexity order as the original one but has simpler implementation form and improved practical efficiency.
https://www.csie.ntu.edu.tw/~cjlin/libsvm/.

References

Alacaoglu, A., Tran-Dinh, Q., Fercoq, O., Cevher, V.: Smooth primal-dual coordinate descent algorithms for nonsmooth convex optimization. In: Advances in Neural Information Processing Systems, pp. 5852–5861 (2017)
Allen-Zhu, Z.: Katyusha: the first direct acceleration of stochastic gradient methods. J. Mach. Learn. Res. 18(1), 8194–8244 (2017)
MathSciNet MATH Google Scholar
Auslender, A., Teboulle, M.: Interior projection-like methods for monotone variational inequalities. Math. Program. 104(1), 39–68 (2005). https://doi.org/10.1007/s10107-004-0568-x
Article MathSciNet MATH Google Scholar
Beck, A., Teboulle, M.: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2(1), 183–202 (2009)
Article MathSciNet Google Scholar
Beck, A., Teboulle, M.: Smoothing and first order methods: a unified framework. SIAM J. Optim. 22(2), 557–580 (2012). https://doi.org/10.1137/100818327
Article MathSciNet MATH Google Scholar
Belloni, A., Chernozhukov, V., Wang, L.: Square-root lasso: pivotal recovery of sparse signals via conic programming. SSRN Electron. J. (2011). https://doi.org/10.2139/ssrn.1910753
Article MATH Google Scholar
Bertsekas, D.P.: Constrained Optimization and Lagrange Multiplier Methods. Academic Press, Cambridge (2014)
MATH Google Scholar
Bolte, J.H., Bauschke, J., Teboulle, M.: A descent lemma beyond Lipschitz gradient continuity: first-order methods revisited and applications. Math. Oper. Res. 42, 330–348 (2016). https://doi.org/10.1287/moor.2016.0817
Article MathSciNet MATH Google Scholar
Borwein, J.M., Vanderwerff, J.D., et al.: Convex Functions: Constructions, Characterizations and Counterexamples, vol. 109. Cambridge University Press, Cambridge (2010)
Book Google Scholar
Chambolle, A., Ehrhardt, M.J., Richtárik, P., Schonlieb, C.B.: Stochastic primal-dual hybrid gradient algorithm with arbitrary sampling and imaging applications. SIAM J. Optim. 28(4), 2783–2808 (2018)
Article MathSciNet Google Scholar
Chambolle, A., Pock, T.: A first-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 40(1), 120–145 (2011)
Article MathSciNet Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 27 (2011)
Article Google Scholar
Chen, S., Donoho, D., Saunders, M.: Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20(1), 33–61 (1998). https://doi.org/10.1137/S1064827596304010
Article MathSciNet MATH Google Scholar
Drusvyatskiy, D., Paquette, C.: Efficiency of minimizing compositions of convex functions and smooth maps. Math. Program. 178, 503–558 (2019)
Article MathSciNet Google Scholar
Fercoq, O., Qu, Z.: Restarting the accelerated coordinate descent method with a rough strong convexity estimate. arXiv:1803.05771 (2018)
Fercoq, O., Qu, Z.: Adaptive restart of accelerated gradient methods under local quadratic growth condition. IMA J. Numer. Anal. (2019). https://doi.org/10.1093/imanum/drz007
Article MathSciNet MATH Google Scholar
Fercoq, O., Richtárik, P.: Accelerated, parallel and proximal coordinate descent. SIAM J. Optim. 25(4), 1997–2023 (2015)
Article MathSciNet Google Scholar
Friedlander, M.P., Goh, G.: Efficient evaluation of scaled proximal operators. Electron. Trans. Numer. Anal. 46, 1–22 (2017)
MathSciNet MATH Google Scholar
Bauschke, H.H., Combettes, P.: The baillon-haddad theorem revisited. J. Convex Anal. 17, 1–7 (2009)
MathSciNet MATH Google Scholar
Hien, L.T.K., Zhao, R., Haskell, W.B.: An inexact primal-dual smoothing framework for large-scale non-bilinear saddle point problems. arXiv preprint arXiv:1711.03669 (2017)
Kovalev, D., Horváth, S., Richtárik, P.: Don’t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop (2019)
Lan, G., Monteiro, R.D.: Iteration-complexity of first-order augmented Lagrangian methods for convex programming. Math. Program. 155(1–2), 511–547 (2016). https://doi.org/10.1007/s10107-015-0861-x
Article MathSciNet MATH Google Scholar
Li, H., Lin, Z.: On the complexity analysis of the primal solutions for the accelerated randomized dual coordinate ascent. arXiv preprint arXiv:1807.00261 (2018)
Liu, Y., Liu, X., Ma, S.: On the nonergodic convergence rate of an inexact augmented lagrangian framework for composite convex programming. Math. Oper. Res. 44(2), 632–650 (2019). https://doi.org/10.1287/moor.2018.0939
Article MathSciNet MATH Google Scholar
Lu, H., Freund, R., Nesterov, Y.: Relatively smooth convex optimization by first-order methods, and applications. SIAM J. Optim. 28(1), 333–354 (2018). https://doi.org/10.1137/16M1099546
Article MathSciNet MATH Google Scholar
Lu, Z., Zhou, Z.: Iteration-complexity of first-order augmented Lagrangian methods for convex conic programming. arXiv preprint arXiv:1803.09941 (2018)
Necoara, I., Nesterov, Y., Glineur, F.: Linear convergence of first order methods for non-strongly convex optimization. Math. Program. (2018). https://doi.org/10.1007/s10107-018-1232-1
Article MATH Google Scholar
Necoara, I., Patrascu, A., Glineur, F.: Complexity of first-order inexact Lagrangian and penalty methods for conic convex programming. Optim. Methods Softw. 34(2), 305–335 (2019). https://doi.org/10.1080/10556788.2017.1380642
Article MathSciNet MATH Google Scholar
Nedelcu, V., Necoara, I., Tran-Dinh, Q.: Computational complexity of inexact gradient augmented Lagrangian methods: application to constrained MPC. SIAM J. Control Optim. 52(5), 3109–3134 (2014). https://doi.org/10.1137/120897547
Article MathSciNet MATH Google Scholar
Nesterov, Y.: A method of solving a convex programming problem with convergence rate ${O}(1/k^2)$. Sov. Math. Doklady 27(2), 372–376 (1983)
MATH Google Scholar
Nesterov, Y.: Smooth minimization of non-smooth functions. Math. Program. 103(1), 127–152 (2005). https://doi.org/10.1007/s10107-004-0552-5
Article MathSciNet MATH Google Scholar
Nesterov, Y., et al.: Gradient methods for minimizing composite objective function (2007)
Ouyang, Y., Chen, Y., Lan, G., Pasiliao, E., Jr.: An accelerated linearized alternating direction method of multipliers. SIAM J. Imaging Sci. 8(1), 644–681 (2015)
Article MathSciNet Google Scholar
Patrascu, A., Necoara, I., Tran-Dinh, Q.: Adaptive inexact fast augmented Lagrangian methods for constrained convex optimization. Optim. Lett. 11, 609–626 (2015). https://doi.org/10.1007/s11590-016-1024-6
Article MathSciNet MATH Google Scholar
Qian, X., Qu, Z., Richtárik, P.: L-SVRG and L-Katyusha with arbitrary sampling. arXiv:1906.01481 (2019)
Rafique, H., Liu, M., Lin, Q., Yang, T.: Non-convex min-max optimization: Provable algorithms and applications in machine learning. arXiv preprint arXiv:1810.02060 (2018)
Rockafellar, R.T.: Convex Analysis. Princeton Mathematical Series, Princeton University Press, Princeton (1970)
Book Google Scholar
Rockafellar, R.T.: Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Math. Oper. Res. 1(2), 97–116 (1976)
Article MathSciNet Google Scholar
Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14(5), 877–898 (1976)
Article MathSciNet Google Scholar
Scokaert, P.O.M., Mayne, D.Q., Rawlings, J.B.: Suboptimal model predictive control (feasibility implies stability). IEEE Trans. Autom. Control 44(3), 648–654 (1999). https://doi.org/10.1109/9.751369
Article MathSciNet MATH Google Scholar
Simon, N., Friedman, J., Hastie, T., Tibshirani, R.: A sparse-group lasso. J. Comput. Graph. Stat. 22, 231–245 (2013)
Article MathSciNet Google Scholar
Tibshirani, R., Saunders, M., Rosset, S., Zhu, J., Knight, K.: Sparsity and smoothness via the fused lasso. J. R. Stat. Soc. Ser. B 67, 91–108 (2005)
Article MathSciNet Google Scholar
Tran-Dinh, Q., Alacaoglu, A., Fercoq, O., Cevher, V.: An adaptive primal-dual framework for nonsmooth convex minimization. Math. Program. Comput. (2019). https://doi.org/10.1007/s12532-019-00173-3
Article MATH Google Scholar
Tran-Dinh, Q., Fercoq, O., Cevher, V.: A smooth primal-dual optimization framework for nonsmooth composite convex minimization. SIAM J. Optim. 28(1), 96–134 (2018)
Article MathSciNet Google Scholar
Tseng, P.: On accelerated proximal gradient methods for convex-concave optimization. Submitted to SIAM Journal on Optimization (2008)
Wang, H., Li, G., Jiang, G.: Robust regression shrinkage and consistent variable selection through the lad-lasso. J. Bus. Econ. Stat. 25(3), 347–355 (2007)
Article MathSciNet Google Scholar
Xu, Y.: First-order methods for constrained convex programming based on linearized augmented Lagrangian function. arXiv preprint arXiv:1711.08020 (2017)
Xu, Y.: Iteration complexity of inexact augmented Lagrangian methods for constrained convex programming. arXiv:1711.05812 (2017)
Xu, Y., Zhang, S.: Accelerated primal-dual proximal block coordinate updating methods for constrained convex optimization. Comput. Optim. Appl. 70(1), 91–128 (2018)
Article MathSciNet Google Scholar
Yuan, X., Zeng, S., Zhang, J.: Discerning the linear convergence of ADMM for structured convex optimization through the lens of variational analysis. optimization-online (2018)
Zhu, J., Rosset, S., Hastie, T., Tibshirani, R.: 1normm support vector machines. In: Proceedings of the 16th International Conference on Neural Information Processing Systems, NIPS’03, pp. 49–56. MIT Press, Cambridge, MA, USA (2003). http://dl.acm.org/citation.cfm?id=2981345.2981352

Download references

Acknowledgements

We thank the three anonymous referees for their valuable comments for improving the paper.

Author information

Authors and Affiliations

Department of Mathematics, The University of Hong Kong, Pokfulam Road, Pok Fu Lam, Hong Kong
Fei Li & Zheng Qu

Authors

Fei Li
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Qu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zheng Qu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Fei Li was supported by Hong Kong PhD Fellowship Scheme No. PF15-16399. Zheng Qu was supported by Early Career Scheme from Hong Kong Research Grants Council No. 27302016. The computations were performed using research computing facilities offered by Information Technology Services, the University of Hong Kong.

Appendices

Proof of Lemma 1

The first assertion follows from the proof of [31, Theorem 1]. See also [19, Theorem 2.1] and [5, Lemma 4.1]. The condition (20) is given by the first order optimality condition of (17). It implies

$$\begin{aligned} \varLambda (u;\lambda ,\beta ) \in \partial h \left( u-\beta (\varLambda (u;\lambda , \beta )-\lambda )\right) . \end{aligned}$$

(91)

The equality in (21) is a direct application of the Fenchel duality theorem [37]. See also [5, Equation 4.1 and 4.2]. The inequality in (21) follows by considering $w=0$. The condition (22) follows from the first order optimality condition and (91). Finally (23) is obtained by plugging the optimal solution $w^\star $ in (22) into (21).

Some useful lemmas

We first state two useful lemmas.

Lemma 6

Let $\psi (\cdot ):\mathbb {R}^n\rightarrow \mathbb {R}\cup \{+\infty \}$ be a convex function. Define:

$$\begin{aligned} {\tilde{\psi }}(x):=\inf _w\{h(p(x)-w)+\psi (w)\}, \end{aligned}$$

Then condition (15) ensures the convexity of $\tilde{\psi }$.

Proof

For any $x, y\in \mathbb {R}^n$ and $\alpha \in [0,1]$, let $z= \alpha x+ (1- \alpha )y$. By condition (15),

$$\begin{aligned} h\left( p(z)- \alpha u- (1- \alpha ) v\right) \le \alpha h(p(x)- u)+ (1- \alpha )h(p(y)- v),\forall u,v\in \mathbb {R}^d. \end{aligned}$$

It follows that

$$\begin{aligned} \tilde{\psi }(z)&= \inf _\omega \left\{ h(p(z)- \omega )+ \psi (\omega ) \right\} \\&= \inf _{u,v} \left\{ h\left( p(z)- \alpha u- (1- \alpha )v\right) + \psi (\alpha u+ (1- \alpha )v) \right\} \\&\le \inf _{u,v} \left\{ \alpha h(p(x)- u)+ (1- \alpha )h(p(y)- v)+ \alpha \psi (u)+ (1- \alpha )\psi (v) \right\} \\&= \alpha \inf _u \left\{ h(p(x)- u)+ \psi (u) \right\} + (1- \alpha )\inf _v \left\{ h(p(y)- v)+ \psi (v) \right\} \\&= \alpha \tilde{\psi }(x)+ (1- \alpha )\tilde{\psi }(y). \end{aligned}$$

$\square $

Similarly, we can show the following result.

Lemma 7

Let $\psi (\cdot ):\mathbb {R}^n\rightarrow \mathbb {R}\cup \{+\infty \}$ be a convex function. Define:

$$\begin{aligned} {\tilde{\psi }}(w):=\inf _x\{h(p(x)-w)+\psi (x)\}, \end{aligned}$$

Then condition (15) ensures the convexity of $\tilde{\psi }$.

Inexact proximal point algorithm and inexact augmented Lagrangian method

1.1 Inexact proximal point method

Let ${\mathcal {T}}:\mathbb {R}^{n+d}\rightarrow \mathbb {R}^{n+d}$ be a maximal monotone operator and ${\mathcal {J}}_{\rho }= ({\mathcal {I}}+ \rho {\mathcal {T}})^{-1}$ be the resolvent of ${\mathcal {T}}$, where ${\mathcal {I}}$ denotes the identity operator. Then for any $z^\star $ such that $0\in {\mathcal {T}}(z^\star )$ [39],

$$\begin{aligned} \left\| {\mathcal {J}}_{\rho }(z)- z^\star \right\| ^2+ \left\| {\mathcal {J}}_{\rho }(z)- z\right\| ^2\le \left\| z- z^\star \right\| ^2. \end{aligned}$$

(92)

Lemma 8

[39] Let $\{z^{s}\}$ be the sequence generated by Algorithm 4. Then for any $z^\star $ such that $0\in {\mathcal {T}}(z^\star )$,

$$\begin{aligned} \left\| z^{s+1}- z^\star \right\| \le \left\| z_0- z^\star \right\| + \sum _{i= 0}^{s}\varepsilon _i \\ \left\| z^{s+1}- z^{s}\right\| \le \left\| z_0- z^\star \right\| + \sum _{i= 0}^{s}\varepsilon _i \end{aligned}$$

We now give a stochastic generalization of Algorithm 4.

We then extend Lemma 8 for Algorithm 5.

Lemma 9

Let $\{z^{s}\}$ be the sequence generated by Algorithm 5. Then for any $z^\star $ such that $0\in {\mathcal {T}}(z^\star )$,

$$\begin{aligned} \mathbb {E}\left[ \left\| z^{s+1}- z^\star \right\| \right] \le \left\| z_0- z^\star \right\| + \sum _{i= 0}^{s}\varepsilon _i \\ \mathbb {E}\left[ \left\| z^{s+1}- z^{s}\right\| \right] \le \left\| z_0- z^\star \right\| + \sum _{i= 0}^{s}\varepsilon _i \\ \left( \mathbb {E}\left[ \left\| z^{s+1}- z^\star \right\| ^2\right] \right) ^{1/2}\le \left\| z_0- z^\star \right\| + \sum _{i= 0}^{s}\varepsilon _i \end{aligned}$$

Proof

By (92), we know that for all $i\ge 0$

$$\begin{aligned} \left\| z^{i+1}- z^\star \right\|&\le \left\| z^{i+ 1}- {\mathcal {J}}_{\rho _i}(z^i)\right\| + \left\| {\mathcal {J}}_{\rho _i}(z^i)- z^\star \right\| \\&\le \left\| z^{i+ 1}- {\mathcal {J}}_{\rho _i}(z^i)\right\| + \left\| z^i- z^\star \right\| . \end{aligned}$$

Taking expectation on both sides, we get

$$\begin{aligned} \mathbb {E}\left[ \left\| z^{i+1}- z^\star \right\| \right] \le \mathbb {E}\left[ \left\| z^{i+ 1}- {\mathcal {J}}_{\rho _i}(z^i)\right\| \right] + \mathbb {E}\left[ \left\| z^i- z^\star \right\| \right] . \end{aligned}$$

By the definition of $z^{s}$, we have $\left( \mathbb {E}\left\| z^{s+1}- {\mathcal {J}}_{\rho _s}(z^{s})\right\| \right) ^2\le \mathbb {E}\left\| z^{s+1}- {\mathcal {J}}_{\rho _s}(z^{s})\right\| ^2\le \varepsilon _s^2$ and therefore

$$\begin{aligned} \mathbb {E}\left[ \left\| z^{i+1}- z^\star \right\| \right] \le \epsilon _i+ \mathbb {E}\left[ \left\| z^i- z^\star \right\| \right] . \end{aligned}$$

The first estimates is derived by summing the above inequality from $i= 0$ to s.

By (92), we know that for all for all $s\ge 0$

$$\begin{aligned} \left\| z^{s+1}- z^s\right\| \le \left\| z^{s+ 1}- {\mathcal {J}}_{\rho _s}(z^s)\right\| + \left\| {\mathcal {J}}_{\rho _s}(z^s)- z^s\right\| \le \left\| z^{s+ 1}- {\mathcal {J}}_{\rho _s}(z^s)\right\| + \left\| z^s- z^\star \right\| . \end{aligned}$$

Taking expectation on both sides,

$$\begin{aligned} \mathbb {E}\left[ \left\| z^{s+1}- z^s\right\| \right] \le \mathbb {E}\left[ \left\| z^{s+ 1}- {\mathcal {J}}_{\rho _s}(z^s)\right\| \right] + \mathbb {E}\left[ \left\| z^s- z^\star \right\| \right] \le \epsilon _s+ \mathbb {E}\left[ \left\| z^s- z^\star \right\| \right] . \end{aligned}$$

Together with the first estimate, the second estimate is derived.

The third estimate is derived from (92):

$$\begin{aligned} 0&\le \left\| {\mathcal {J}}_{\rho _s}- z^{s}\right\| ^2\le \left\| z^{s}- z^\star \right\| ^2- \left\| {\mathcal {J}}_{\rho _s}(z^{s})- z^\star \right\| ^2 \\&= \left\| z^{s}- z^\star \right\| ^2- \left\| {\mathcal {J}}_{\rho _s}(z^{s})- z^{s+1}+ z^{s+1}- z^\star \right\| ^2 \\&\le \left\| z^{s}- z^\star \right\| ^2- \left\| z^{s+1}- z^\star \right\| ^2- \left\| {\mathcal {J}}_{\rho _s}(z^{s})- z^{s+1}\right\| ^2\\&\quad + 2 \left\| {\mathcal {J}}_{\rho _s}(z^{s})- z^{s+1}\right\| \left\| z^{s+1}- z^\star \right\| \end{aligned}$$

Taking expectation on both sides we have:

$$\begin{aligned} 0&\le \mathbb {E}\left[ \left\| z^{s}- z^\star \right\| ^2\right] - \mathbb {E}\left[ \left\| z^{s+1}- z^\star \right\| ^2\right] \\&\quad - \mathbb {E}\left[ \left\| {\mathcal {J}}_{\rho _s}(z^{s})- z^{s+1}\right\| ^2\right] + 2 \mathbb {E}\left[ \left\| {\mathcal {J}}_{\rho _s}(z^{s})- z^{s+1}\right\| \left\| z^{s+1}- z^\star \right\| \right] \\&\le \mathbb {E}\left[ \left\| z^{s}- z^\star \right\| ^2\right] - \mathbb {E}\left[ \left\| z^{s+1}- z^\star \right\| ^2\right] \\&\quad - \mathbb {E}\left[ \left\| {\mathcal {J}}_{\rho _s}(z^{s})- z^{s+1}\right\| ^2\right] + 2\left( \mathbb {E}\left[ \left\| {\mathcal {J}}_{\rho _s}(z^{s})- z^{s+1}\right\| ^2\right] \mathbb {E}\left[ \left\| z^{s+1}- z^\star \right\| ^2\right] \right) ^{1/2} \\&= \mathbb {E}\left[ \left\| z^{s}- z^\star \right\| ^2\right] - \left( \left( \mathbb {E}\left[ \left\| z^{s+1}- z^\star \right\| ^2\right] \right) ^{1/2}- \left( \mathbb {E}\left[ \left\| {\mathcal {J}}_{\rho _s}(z^{s})- z^{s+1}\right\| ^2\right] \right) ^{1/2}\right) ^2 \end{aligned}$$

where the second inequality we use $\mathbb {E}[XY]\le (\mathbb {E}[X^2])^{1/2}(\mathbb {E}[Y^2])^{1/2}$. Therefore

$$\begin{aligned} \left( \mathbb {E}\left[ \left\| z^{s+1}- z^\star \right\| ^2\right] \right) ^{1/2}- \varepsilon _s&\le \left( \mathbb {E}\left[ \left\| z^{s+1}- z^\star \right\| ^2\right] \right) ^{1/2}- \left( \mathbb {E}\left[ \left\| {\mathcal {J}}_{\rho _s}(z^{s})- z^{s+1}\right\| ^2\right] \right) ^{1/2}\\&\le \left( \mathbb {E}\left[ \left\| z^{s}- z^\star \right\| ^2\right] \right) ^{1/2} \end{aligned}$$

Then summing up the latter inequalities from $s= 0$ we obtain the third inequality. $\square $

1.2 Inexact ALM

We define the maximal monotone operator ${\mathcal {T}}_{l}$ as follows.

$$\begin{aligned} {\mathcal {T}}_l(x;\lambda )&=\left\{ (v;u): (v;-u)\in \partial L(x;\lambda )\right\} \\&=\left\{ \begin{pmatrix} \nabla f(x)+\partial g(x)+\nabla p(x)\lambda \\ -p(x)+\partial h^*(\lambda ) \end{pmatrix}\right\} \end{aligned}$$

In the following we denote

$$\begin{aligned} \begin{aligned} L^\star (y,\lambda ,\beta )&:=\min _x L(x;y,\lambda ,\beta ),\\ x^\star (y,\lambda , \beta )&:=\arg \min _x L(x;y,\lambda ,\beta ),p^\star (y,\lambda ,\beta ):=p(x^\star (y,\lambda ,\beta )). \end{aligned} \end{aligned}$$

(93)

We further let $\varLambda ^\star (y,\lambda ,\beta ):=\varLambda (p^\star (y,\lambda ,\beta );\lambda ,\beta )$. By first order optimality condition and (18), we know that

$$\begin{aligned}&0\in \nabla f(x^\star (y,\lambda ,\beta ))+\partial g(x^\star (y,\lambda ,\beta ))\\&\qquad +\nabla p(x^\star (y, \lambda ,\beta )) \varLambda ^\star (y,\lambda ,\beta )+\beta (x^\star (y,\lambda ,\beta )-y) \end{aligned}$$

Secondly we know from (20) that

$$\begin{aligned} p^\star (y,\lambda ,\beta )-\beta (\varLambda ^\star (y,\lambda , \beta )-\lambda )\in \partial h^*(\varLambda ^\star (y,\lambda ,\beta )). \end{aligned}$$

It follows that

$$\begin{aligned} ( {\mathcal {I}}+\beta ^{-1}{\mathcal {T}}_l)^{-1}(y;\lambda )=(x^\star (y,\lambda ,\beta );\varLambda ^\star (y,\lambda ,\beta )) \end{aligned}$$

(94)

Lemma 10

For any $x\in \mathbb {R}^n$ we have,

$$\begin{aligned} \begin{aligned} L(x;y,\lambda ,\beta )-L^\star (y,\lambda ,\beta )&\ge \frac{\beta }{2}\Vert x-x^\star (y,\lambda ,\beta )\Vert ^2\\&\quad +\frac{\beta }{2} \Vert \varLambda (p(x);\lambda ,\beta )- \varLambda (p^\star (y,\lambda ,\beta );\lambda ,\beta )\Vert ^2. \end{aligned} \end{aligned}$$

(95)

Proof

In this proof we fix $y\in \mathbb {R}^n$, $\lambda \in \mathbb {R}^d$ and $\beta >0$. Recall the definitions in (93). Define

$$\begin{aligned} L(x, w; y, \lambda ,\beta ):= & {} f(x)+g(x)+ h(p(x)-w)+\frac{1}{2\beta }\Vert w\Vert ^2+\langle w,\lambda \rangle \\&+\frac{\beta }{2}\Vert x-y\Vert ^2-\frac{\beta }{2}\Vert x-x^\star (y,\lambda ,\beta )\Vert ^2. \end{aligned}$$

Then by (21),

$$\begin{aligned} \min _w L(x, w; y, \lambda ,\beta )= L(x;y,\lambda ,\beta )-\frac{\beta }{2}\Vert x-x^\star (y,\lambda ,\beta )\Vert ^2. \end{aligned}$$

(96)

Since $L(x;y,\lambda ,\beta )-\frac{\beta }{2}\Vert x-x^\star (y,\lambda ,\beta )\Vert ^2$ is a convex function with $x^\star (y,\lambda ,\beta )$ being a critical point, it follows that

$$\begin{aligned} \min _x \min _w L(x, w; y, \lambda ,\beta )=L^\star (y, \lambda ,\beta ). \end{aligned}$$

(97)

Denote

$$\begin{aligned} H(w;y, \lambda ,\beta ):=\min _x L(x, w;y, \lambda ,\beta ). \end{aligned}$$

(98)

In view of (22),

$$\begin{aligned} \begin{aligned} L(x;y, \lambda ,\beta )-\frac{\beta }{2}\Vert x-x^\star (y,\lambda ,\beta )\Vert ^2&=L(x, \beta (\varLambda (p(x); \lambda ,\beta )-\lambda ); y,\lambda ,\beta )\\&\overset{(98)}{\ge } H(\beta (\varLambda (p(x);\lambda ,\beta )-\lambda ); y,\lambda ,\beta ). \end{aligned} \end{aligned}$$

(99)

Note that

$$\begin{aligned} \min _w H(w;y,\lambda ,\beta )&=\min _w \min _x L(x,w;y,\lambda ,\beta )\nonumber \\&=\min _x \min _w L(x,w;y,\lambda ,\beta )\overset{(97)}{=}L^\star (y,\lambda ,\beta ). \end{aligned}$$

(100)

Denote $\varLambda ^\star (y,\lambda ,\beta )=\varLambda (p^\star (y,\lambda ,\beta ); \lambda ,\beta )$. It follows that,

$$\begin{aligned}&H( \beta (\varLambda ^\star (y,\lambda ,\beta )-\lambda );y,\lambda ,\beta )\\&\quad \ge \min _w H(w;y,\lambda ,\beta )\overset{(100)}{=}L^\star (y,\lambda ,\beta )=L(x^\star ( y,\lambda ,\beta ); y,\lambda ,\beta ). \end{aligned}$$

Using again (99) with $x=x^\star ( y,\lambda ,\beta )$ we deduce

$$\begin{aligned} H( \beta (\varLambda ^\star (y,\lambda ,\beta )-\lambda );y,\lambda ,\beta ) = \min _w H(w;y,\lambda ,\beta ). \end{aligned}$$

(101)

Moreover, it follows from Lemma 7 that $H( w;y, \lambda ,\beta )$ is $1/\beta $-strongly convex with respect to w. Thus,

$$\begin{aligned}&L(x;y,\lambda ,\beta )-L^\star (y,\lambda ,\beta )-\frac{\beta }{2}\Vert x-x^\star (y,\lambda ,\beta )\Vert ^2\\&\quad \overset{(99)+(100)}{\ge } H(\beta (\varLambda (p(x);\lambda ,\beta )-\lambda );y,\lambda ,\beta )-\min _w H(w;y,\lambda ,\beta ) \\&\quad \overset{(101)}{\ge } \quad \frac{1}{2\beta }\Vert \beta (\varLambda (p(x);\lambda ,\beta )-\lambda )-\beta (\varLambda ^\star (y,\lambda ,\beta )-\lambda ) \Vert ^2\\&\quad =\frac{\beta }{2} \Vert \varLambda (p(x);\lambda ,\beta )- \varLambda ^\star (y,\lambda ,\beta )\Vert ^2. \end{aligned}$$

$\square $

We can then establish the following well known link between inexact ALM and inexact PPA.

Proposition 3

(Compare with [39]) Algorithm 1 is a special case of Algorithm 5 with ${\mathcal {T}}={\mathcal {T}}_l$, $\rho _s=1/\beta _s$ and $\varepsilon _s= \sqrt{2\epsilon _s/\beta _s}$.

Proof

This follows from (94) and Lemma 10. $\square $

Missing proofs

1.1 Proofs in Section 2.2

Proof of Lemma 2

The convexity of ${\tilde{\psi }}$ follows from (21) and Lemma 6 with $\psi (w):=\frac{1}{2\beta }\Vert w\Vert ^2+\langle w,\lambda \rangle $. The gradient formula follows from (18).

Proof of Lemma 3

This is a direct consequence of Proposition 3 and Lemma 9.

Proof of Corollary 1

By Lemma 3, we have

$$\begin{aligned} \mathbb {E}\left[ \left\| (x^{s},\lambda ^{s+ 1})- (x^{s-1},\lambda ^{s})\right\| \right] \le \left\| (x^{-1},\lambda ^0)- (x^\star ,\lambda ^\star )\right\| + \frac{2\sqrt{\epsilon _0/\beta _0}}{1-\sqrt{\eta /\rho }},\forall s\ge 0, \end{aligned}$$

and

$$\begin{aligned} \mathbb {E}\left[ \left\| (x^{s},\lambda ^{s+ 1})- (x^\star ,\lambda ^\star )\right\| ^2\right] \le \left( \left\| (x^{-1},\lambda ^0)- (x^\star ,\lambda ^\star )\right\| + \frac{2\sqrt{\epsilon _0/\beta _0}}{1-\sqrt{\eta /\rho }} \right) ^2,\forall s\ge 0. \end{aligned}$$

Consequently,

$$\begin{aligned} \mathbb {E}\left[ \left\| \lambda ^{s+ 1}-\lambda ^{s}\right\| \right] \le \left\| (x^{-1},\lambda ^0)- (x^\star ,\lambda ^\star )\right\| + \frac{2\sqrt{\epsilon _0/\beta _0}}{1-\sqrt{\eta /\rho }},\forall s\ge 0, \end{aligned}$$

and

$$\begin{aligned}&\max \left( \mathbb {E}\left[ \left\| x^{s}-x^\star \right\| ^2\right] , \mathbb {E}\left[ \left\| \lambda ^{s+1}-\lambda ^\star \right\| ^2\right] \right) \\&\quad \le \left( \left\| (x^{-1},\lambda ^0)- (x^\star ,\lambda ^\star )\right\| + \frac{2\sqrt{\epsilon _0/\beta _0}}{1-\sqrt{\eta /\rho }} \right) ^2,\forall s\ge 0. \end{aligned}$$

We then conclude.

Proof of Theorem 1

First,

$$\begin{aligned} \begin{array}{ll} h_1(p_1(x^s))- h(p(x^s);\lambda ^s,\beta _s) &{}\overset{(23)}{=} h_1(p_1(x^s))-h_1(p_1(x^s)-\beta _s(\lambda _1^{s+1}-\lambda _1^s))\\ &{}\qquad -\frac{\beta _s}{2}(\Vert \lambda ^{s+1}\Vert ^2-\Vert \lambda ^s\Vert ^2)\\ {} &{}\le L_{h_1} \beta _s \Vert \lambda _1^{s+1}-\lambda _1^s\Vert +\frac{\beta _s}{2}(\Vert \lambda ^{s}\Vert ^2-\Vert \lambda ^{s+1}\Vert ^2). \end{array} \end{aligned}$$

(102)

Then we know that

$$\begin{aligned} F(x^s)-L(x^s; x^{s-1}, \lambda ^s, \beta _s)&= h_1(p_1(x^s))- h(p(x^s);\lambda ^s,\beta _s)-\frac{\beta _s}{2}\Vert x^{s}-x^{s-1}\Vert ^2\\&\overset{(102)}{\le } L_{h_1} \beta _s \Vert \lambda _1^{s+1}-\lambda _1^s\Vert +\frac{\beta _s}{2}(\Vert \lambda ^{s}\Vert ^2-\Vert \lambda ^{s+1}\Vert ^2)\\&\quad -\frac{\beta _s}{2}\Vert x^{s}-x^{s-1}\Vert ^2. \end{aligned}$$

Since $H_s(\cdot )$ is $\beta _s$-strongly convex, we know that

$$\begin{aligned} L^\star (x^{s-1}, \lambda ^s,\beta _s)&\le L(x^\star ; x^{s-1}, \lambda ^s, \beta ^s)-\frac{\beta _s}{2}\Vert x^\star -x^\star (x^{s-1}, \lambda ^s, \beta _s)\Vert ^2 \\&\overset{(21)}{\le } F^\star +\frac{\beta _s}{2}\Vert x^\star -x^{s-1}\Vert ^2-\frac{\beta _s}{2}\Vert x^\star -x^\star (x^{s-1}, \lambda ^s, \beta _s)\Vert ^2. \end{aligned}$$

Combining the latter two bounds we get

$$\begin{aligned} F(x^s)-F^\star&\le L(x^s; x^{s-1}, \lambda ^s, \beta _s)-L^\star (x^{s-1}, \lambda ^s,\beta _s)+ L_{h_1} \beta _s \Vert \lambda _1^{s+1}-\lambda _1^s\Vert \\&\quad +\frac{\beta _s}{2}(\Vert \lambda ^{s}\Vert ^2-\Vert \lambda ^{s+1}\Vert ^2)+\frac{\beta _s}{2}\Vert x^\star -x^{s-1}\Vert ^2 \\&\quad -\frac{\beta _s}{2}\Vert x^\star -x^\star (x^{s-1}, \lambda ^s, \beta _s)\Vert ^2 -\frac{\beta _s}{2}\Vert x^{s}-x^{s-1}\Vert ^2. \end{aligned}$$

Furthermore, by convexity of $h_1(\cdot )$,

$$\begin{aligned} \inf _x F(x)+ \langle \lambda _2^\star , p_2(x) \rangle -h_2^*(\lambda _2^\star ) \ge \inf _x f(x)+g(x)+\langle \lambda ^\star , p(x) \rangle -h^*(\lambda ^\star )=D(\lambda ^\star ). \end{aligned}$$

Now we apply the strong duality assumption (11) to obtain:

$$\begin{aligned} F(x^s)+ \langle \lambda _2^\star , p_2(x^s) \rangle -h_2^*(\lambda _2^\star )\ge \inf _x F(x)+ \langle \lambda _2^\star , p_2(x) \rangle -h_2^*(\lambda _2^\star ) \ge F^\star . \end{aligned}$$

Consequently,

$$\begin{aligned}&F(x^s)-F^\star \ge \langle \lambda _2^\star , -p_2(x^s) \rangle + h_2^*(\lambda _2^\star ) \\&\quad \ge \sup _v \langle \lambda _2^\star , v-p_2(x^s) \rangle -h_2(v) \ge -\Vert \lambda _2^\star \Vert {\text {dist}}(p_2(x^s),{\mathcal {K}}). \end{aligned}$$

From (20) we know

$$\begin{aligned} p_2(x^s)-\beta _s(\lambda _2^{s+1}-\lambda _2^s)\in {\mathcal {K}}, \end{aligned}$$

and thus

$$\begin{aligned} {\text {dist}}(p_2(x^s), {\mathcal {K}})\le \beta _s\Vert \lambda _2^{s+1}-\lambda _2^s\Vert . \end{aligned}$$

Proof of Corollary 2

Taking expectation on both sides of the bounds in Theorem 1 we have:

$$\begin{aligned}&\mathbb {E}\left[ F(x^s)-F^\star \right] \le \epsilon _s+ L_{h_1} \beta _s\left( \mathbb {E}\left\| \lambda ^{s+ 1}_1\right\| + \mathbb {E}\left\| \lambda ^s_1\right\| \right) \\ \\&\quad +\frac{\beta _s}{2}\mathbb {E}\left[ \left\| \lambda ^s\right\| ^2\right] + \frac{\beta _s}{2}\mathbb {E}\left[ \left\| x^s- x^{s- 1}\right\| ^2\right] , \\&\mathbb {E}\left[ F(x^s)-F^\star \right] \ge -\beta _s\Vert \lambda _2^\star \Vert \mathbb {E}\left[ \left\| \lambda ^{s+ 1}- \lambda ^s\right\| \right] ,\\&\mathbb {E}[{\text {dist}}(p_2(x^s), K)]\le \beta _s \mathbb {E}\left[ \left\| \lambda ^{s+ 1}- \lambda ^s\right\| \right] . \end{aligned}$$

By condition (a) in Assumption 1, we have for all $s\ge 0$, $\lambda _1^s\in {\text {dom}}(h_1^*)$ and $\left\| \lambda _1^s\right\| \le L_{h_1}$ due to [9, Proposition 4.4.6]. Then using Corollary 1, the above bounds can be relaxed as:

$$\begin{aligned}&\mathbb {E}[ F(x^s)-F^\star ] \le \epsilon _s+ 2L_{h_1}^2 \beta _s +c_0\beta _s \\&\mathbb {E}[F(x^s)-F^\star ] \ge -\beta _s\Vert \lambda _2^\star \Vert \sqrt{c_0},\\&\mathbb {E}[{\text {dist}}(p_2(x^s), K)]\le \beta _s \sqrt{c_0}. \end{aligned}$$

We then conclude by noting that (32) guarantees

$$\begin{aligned} \max (\epsilon _0+ 2L_{h_1}^2 \beta _0 +c_0\beta _0, \beta _0\Vert \lambda _2^\star \Vert \sqrt{c_0}, \beta _0 \sqrt{c_0})\le \epsilon \rho ^{-s}. \end{aligned}$$

(103)

1.2 Proof of Proposition 1

This section is devoted to the proof of Proposition 1.

Lemma 11

For any $x\in \mathbb {R}^n$, $\lambda ,\lambda '\in \mathbb {R}^d$ and $\beta ,\beta '\in \mathbb {R}_+$ we have,

$$\begin{aligned} \begin{aligned}&L(x;y,\lambda ,\beta )-L(x;y',\lambda ',\beta ')+\frac{\beta }{2}\Vert \varLambda (p(x); \lambda ,\beta )-\lambda \Vert ^2 \\&\qquad -\frac{\beta '}{2}\Vert \varLambda (p(x);\lambda ',\beta ')-\lambda '\Vert ^2\\&\quad \le \langle \varLambda (p(x);\lambda ,\beta )-\varLambda (p(x);\lambda ',\beta '), \beta '( \varLambda (p(x);\lambda ',\beta ')-\lambda ') \rangle \\&\qquad +\frac{\beta }{2}\Vert x-y\Vert ^2-\frac{\beta '}{2}\Vert x-y'\Vert ^2, \end{aligned} \end{aligned}$$

(104)

and

$$\begin{aligned} \begin{aligned}&L(x;y,\lambda ,\beta )-L(x;y',\lambda ',\beta ')+\frac{\beta }{2}\Vert \varLambda (p(x);\lambda ,\beta )-\lambda \Vert ^2\\&\qquad -\frac{\beta '}{2}\Vert \varLambda (p(x);\lambda ',\beta ')-\lambda '\Vert ^2\\&\quad \ge \langle \varLambda (p(x);\lambda ,\beta )-\varLambda (p(x);\lambda ',\beta '), \beta ( \varLambda (p(x);\lambda ,\beta )-\lambda ) \rangle \\&\qquad +\frac{\beta }{2}\Vert x-y\Vert ^2-\frac{\beta '}{2}\Vert x-y'\Vert ^2. \end{aligned} \end{aligned}$$

(105)

Proof

By the definitions (24), (16) and (17), we have

$$\begin{aligned}&L(x;y, \lambda ,\beta )-L(x;y',\lambda ',\beta ')+\frac{\beta }{2}\Vert \varLambda (p(x);\lambda ,\beta )-\lambda \Vert ^2-\frac{\beta '}{2}\Vert \varLambda (p(x);\lambda ',\beta ')-\lambda '\Vert ^2\\&\quad =\langle \varLambda (p(x);\lambda ,\beta )-\varLambda (p(x);\lambda ',\beta '), p(x) \rangle - h^*( \varLambda (p(x);\lambda ,\beta ))\\&\qquad + h^*( \varLambda (p(x);\lambda ',\beta '))+\frac{\beta }{2}\Vert x-y\Vert ^2-\frac{\beta '}{2}\Vert x-y'\Vert ^2. \end{aligned}$$

Next we apply (20) to get

$$\begin{aligned} h^*( \varLambda (p(x);\lambda ',\beta '))&\ge h^*( \varLambda (p(x);\lambda ,\beta ))\\&\quad +\langle \varLambda (p(x);\lambda ,\beta )-\varLambda (p(x);\lambda ',\beta ') ,\beta (\varLambda (p(x);\lambda , \beta )-\lambda )-p(x) \rangle , \end{aligned}$$

and

$$\begin{aligned} h^*( \varLambda (p(x);\lambda ,\beta ))&\ge h^*( \varLambda (p(x);\lambda ',\beta '))\\&\quad +\langle \varLambda (p(x);\lambda ,\beta )\\&-\varLambda (p(x);\lambda ',\beta ') ,p(x)-\beta '(\varLambda (p(x);\lambda ', \beta ')-\lambda ') \rangle . \end{aligned}$$

$\square $

Lemma 12

Consider any $u, \lambda , \lambda '\in \mathbb {R}^d$ and $\beta ,\beta '\in \mathbb {R}_+$. Condition (a) in Assumption 1 ensures:

$$\begin{aligned} \begin{aligned}&\Vert \beta (\varLambda (u;\lambda , \beta )-\lambda )-\beta '(\varLambda (u;\lambda ', \beta ')-\lambda ')\Vert \\&\qquad \le \sqrt{ ((\beta +\beta ')L_{h_1} + \Vert \beta \lambda _1-\beta '\lambda _1'\Vert )^2+\Vert \beta \lambda _2-\beta '\lambda _2'\Vert ^2}. \end{aligned} \end{aligned}$$

(106)

Proof

Denote

$$\begin{aligned} \varLambda _i(u_i;\lambda _i, \beta ):=\arg \max _{\xi _i}\left\{ \langle \xi _i, u_i \rangle -h_i^*(\xi _i)- \frac{\beta }{2}\Vert \xi _i-\lambda _i\Vert ^2 \right\} , i=1,2, \end{aligned}$$

(107)

so that $\varLambda (u;\lambda , \beta )=\left( \varLambda _1(u_1;\lambda _1, \beta ); \varLambda _{2}(u_{2};\lambda _{2}, \beta )\right) $. We can then decompose (20) into two independent conditions:

$$\begin{aligned} \varLambda _i(u_i;\lambda _i,\beta )\in \partial h_i(u_i-\beta (\varLambda _i(u_i;\lambda _i, \beta )-\lambda _i)),i=1,2. \end{aligned}$$

(108)

By condition (a) in Assumption 1,

$$\begin{aligned} \Vert \varLambda _1(u_1;\lambda _1, \beta )\Vert \le L_{h_1} \end{aligned}$$

(109)

which yields directly

$$\begin{aligned} \Vert \beta (\varLambda _1(u_1;\lambda _1, \beta )-\lambda _1)-\beta '(\varLambda _1(u_1;\lambda _1', \beta ')-\lambda _1')\Vert \le (\beta +\beta ')L_{h_1}+\Vert \beta \lambda _1-\beta '\lambda _1'\Vert . \end{aligned}$$

(110)

On the other hand, since $h_2$ is an indicator function, $\partial h_2$ is a cone and (108) implies

$$\begin{aligned} \beta \varLambda _2(u_2;\lambda _2,\beta )\in \partial h_2(u_2-\beta (\varLambda _2(u_2;\lambda _2, \beta )-\lambda _2)). \end{aligned}$$

(111)

The latter condition further leads to

$$\begin{aligned}&\langle \beta \varLambda _2(u_2;\lambda _2,\beta )-\beta '\varLambda _2(u_2;\lambda '_2,\beta '),\beta (\varLambda _2(u_2;\lambda _2, \beta )-\lambda _2)\\&\quad -\beta '(\varLambda _2(u_2;\lambda _2', \beta ')-\lambda _2') \rangle \le 0, \end{aligned}$$

which by Cauchy-Schwartz inequality implies

$$\begin{aligned} \Vert \beta (\varLambda _2(u_2;\lambda _2,\beta )-\lambda _2)-\beta '(\varLambda _2(u_2;\lambda '_2,\beta ')-\lambda _2')\Vert \le \Vert \beta \lambda _2-\beta '\lambda _2'\Vert . \end{aligned}$$

Then (106) is obtained by simple algebra. $\square $

Remark 9

If

$$\begin{aligned} h(u)=\left\{ \begin{array}{ll}0 &{}\quad \mathrm {if~} u=b\\ +\infty &{} \quad \mathrm {otherwise } \end{array}\right. \end{aligned}$$

for some constant vector $b\in \mathbb {R}^d$, then by (20) we have

$$\begin{aligned} u-\beta (\varLambda (u;\lambda , \beta )-\lambda )=b, \end{aligned}$$

for any $u,\lambda \in \mathbb {R}^d$ and $\beta \ge 0$. In this special case a refinement of Lemma 12 can be stated as follows:

$$\begin{aligned} \Vert \beta (\varLambda (u;\lambda , \beta )-\lambda )-\beta '(\varLambda (u;\lambda ', \beta ')-\lambda ')\Vert =0. \end{aligned}$$

Lemma 13

Consider any $0<\beta /2<\beta '$ and any $w,w',y,y'\in \mathbb {R}^n$. We have

$$\begin{aligned} \begin{aligned}&-\frac{\beta }{2}\Vert w'-w\Vert ^2+\frac{\beta }{2}\Vert w'-y\Vert ^2-\frac{\beta '}{2}\Vert w'-y'\Vert ^2\\&\qquad \le \frac{\beta }{2}\Vert w-y'\Vert ^2+ \frac{\beta (2\beta '+\beta )}{2(2\beta '-\beta )}\Vert y-y'\Vert ^2. \end{aligned} \end{aligned}$$

(112)

Proof

We first recall the following basic inequality:

$$\begin{aligned} \Vert u+v\Vert ^2\le (1+a)\Vert u\Vert ^2+(1+1/a)\Vert v\Vert ^2,\forall u,v\in \mathbb {R}^n, a>0. \end{aligned}$$

(113)

In view of (113) and the fact that $\beta '>\beta /2$, we know that

$$\begin{aligned}&-\frac{\beta }{2}\Vert w'-w\Vert ^2\le \frac{\beta }{2}\Vert w-y'\Vert ^2-\frac{\beta }{4}\Vert w'-y'\Vert ^2,\\&-\frac{\beta '+\beta /2}{2}\Vert w'-y'\Vert ^2\le \frac{\beta (2\beta '+\beta )}{2(2\beta '-\beta )}\Vert y-y'\Vert ^2 -\frac{\beta }{2}\Vert w'-y\Vert ^2. \end{aligned}$$

Combining the latter two inequalities we get (112). $\square $

Using the above four lemmas, we establish a relation between $L(x; y',\lambda ', \beta ') -L^\star (y',\lambda ',\beta ')$ and $L(x; y, \lambda , \beta )-L^\star (y,\lambda ,\beta )$.

Proposition 4

For any $x,y,y'\in \mathbb {R}^n$, $\lambda ,\lambda '\in \mathbb {R}^d$ and $0<\beta /2<\beta '$, we have

$$\begin{aligned}&L(x; y',\lambda ', \beta ') -L^\star (y',\lambda ',\beta ')- \left( L(x; y, \lambda , \beta )-L^\star (y,\lambda ,\beta )\right) \nonumber \\&\quad \le \Vert \lambda -\lambda '\Vert \sqrt{ ((\beta +\beta ')L_{h_1} + \Vert \beta \lambda _1-\beta '\lambda _1'\Vert )^2+\Vert \beta \lambda _2-\beta '\lambda _2'\Vert ^2}\nonumber \\&\qquad +{\beta }\Vert \lambda -\lambda '\Vert ^2+ \frac{\beta -\beta '}{2}\Vert \varLambda (p(x); \lambda ',\beta ')-\lambda '\Vert ^2\nonumber \\&\qquad +\frac{\beta '-\beta }{2}\Vert \varLambda (p^\star (y',\lambda ',\beta '); \lambda ,\beta )-\lambda \Vert ^2\nonumber \\&\qquad +\frac{ \beta }{2} \Vert \varLambda (p^\star (y,\lambda ,\beta );\lambda ,\beta )-\varLambda (p(x);\lambda ,\beta )\Vert ^2\nonumber \\&\qquad +\frac{\beta }{2}\Vert x^\star (y,\lambda ,\beta )-y'\Vert ^2 +\frac{\beta (2\beta '+\beta )}{2(2\beta '-\beta )}\Vert y-y'\Vert ^2\nonumber \\&\qquad -\frac{\beta }{2}\Vert x-y\Vert ^2+\frac{\beta '}{2}\Vert x-y'\Vert ^2. \end{aligned}$$

(114)

Proof

We first separate $L(x; y', \lambda ', \beta ') -L^\star (y',\lambda ',\beta ')$ into four parts:

$$\begin{aligned}&L(x; y',\lambda ', \beta ') -L^\star (y',\lambda ',\beta ')\\&\quad =\underbrace{L(x; y,\lambda , \beta )-L^\star (y,\lambda ,\beta )}_{\varDelta _1}+\underbrace{L(x;y', \lambda ', \beta ')-L(x; y,\lambda , \beta )}_{\varDelta _2}\\&\qquad +\underbrace{L(x^\star (y',\lambda ',\beta ');y, \lambda ,\beta ) -L^\star (y',\lambda ',\beta ')}_{\varDelta _3}\\&\quad +\underbrace{L^\star (y,\lambda ,\beta )-L(x^\star (y',\lambda ',\beta '); y,\lambda ,\beta )}_{\varDelta _4}. \end{aligned}$$

By Lemma 11,

$$\begin{aligned} \varDelta _2&\le \frac{\beta }{2}\Vert \varLambda (p(x);\lambda ,\beta )-\lambda \Vert ^2-\frac{\beta '}{2}\Vert \varLambda (p(x);\lambda ',\beta ')-\lambda '\Vert ^2-\frac{\beta }{2}\Vert x-y\Vert ^2\\&\quad +\frac{\beta '}{2}\Vert x-y'\Vert ^2+ \langle \varLambda (p(x);\lambda ',\beta ')-\varLambda (p(x);\lambda ,\beta ), \beta ( \varLambda (p(x);\lambda ,\beta )-\lambda ) \rangle , \end{aligned}$$

and

$$\begin{aligned} \varDelta _3&\le \frac{\beta '}{2}\Vert \varLambda (p^\star (y',\lambda ',\beta ');\lambda ',\beta ')-\lambda '\Vert ^2-\frac{\beta }{2}\Vert \varLambda (p^\star (y',\lambda ',\beta ');\lambda ,\beta )-\lambda \Vert ^2\\&+ \langle \varLambda (p^\star (y',\lambda ',\beta ');\lambda ,\beta )-\varLambda (p^\star (y',\lambda ',\beta ');\lambda ',\beta '),\\&\beta '( \varLambda (p^\star (y',\lambda ',\beta ');\lambda ',\beta ')-\lambda ') \rangle \\&\quad +\frac{\beta }{2}\Vert x^\star (y',\lambda ',\beta ')-y\Vert ^2-\frac{\beta '}{2}\Vert x^\star (y',\lambda ',\beta ')-y'\Vert ^2. \end{aligned}$$

We then get

$$\begin{aligned} \varDelta _2+\varDelta _3&\le -\frac{\beta }{2}\Vert \varLambda (p(x);\lambda ,\beta )-\lambda \Vert ^2-\frac{\beta '}{2}\Vert \varLambda (p(x);\lambda ',\beta ')-\lambda '\Vert ^2\\&\quad + \langle \varLambda (p(x);\lambda ',\beta ') -\lambda ', \beta ( \varLambda (p(x);\lambda ,\beta )-\lambda ) \rangle \\&\quad -\frac{\beta '}{2}\Vert \varLambda (p^\star (y',\lambda ',\beta ');\lambda ',\beta ')-\lambda '\Vert ^2\\&\quad -\frac{\beta }{2}\Vert \varLambda (p^\star (y',\lambda ',\beta ');\lambda ,\beta )-\lambda \Vert ^2\\&\quad + \langle \varLambda (p^\star (y',\lambda ',\beta ');\lambda ,\beta )-\lambda , \beta '( \varLambda (p^\star (y',\lambda ',\beta ');\lambda ',\beta ')-\lambda ') \rangle \\&\quad +\langle \lambda -\lambda ',\beta '( \varLambda (p^\star (y',\lambda ',\beta ');\lambda ', \beta ')-\lambda ') -\beta ( \varLambda (p(x);\lambda ,\beta )-\lambda ) \rangle \\&\quad -\frac{\beta }{2}\Vert x-y\Vert ^2+\frac{\beta '}{2}\Vert x-y'\Vert ^2 +\frac{\beta }{2}\Vert x^\star (y',\lambda ',\beta ')-y\Vert ^2\\&\quad -\frac{\beta '}{2}\Vert x^\star (y',\lambda ',\beta ')-y'\Vert ^2\\&\le \frac{\beta -\beta '}{2}\Vert \varLambda (p(x);\lambda ',\beta ') -\lambda '\Vert ^2+ \frac{\beta '-\beta }{2}\Vert \varLambda (p^\star (y',\lambda ',\beta ');\lambda ,\beta )-\lambda \Vert ^2\\&\quad +\langle \lambda -\lambda ',\beta '( \varLambda (p^\star (y',\lambda ',\beta ');\lambda ',\beta ')-\lambda ') -\beta ( \varLambda (p(x);\lambda ,\beta )-\lambda ) \rangle \\&\quad -\frac{\beta }{2}\Vert x-y\Vert ^2+\frac{\beta '}{2}\Vert x-y'\Vert ^2+\frac{\beta }{2}\Vert x^\star (y',\lambda ',\beta ') -y\Vert ^2\\&\quad -\frac{\beta '}{2}\Vert x^\star (y',\lambda ',\beta ')-y'\Vert ^2, \end{aligned}$$

where the last inequality simply relies on $2\langle x,y \rangle \le \Vert x\Vert ^2+\Vert y\Vert ^2$. Further, according to Lemma 10,

$$\begin{aligned} \varDelta _4&\le -\frac{\beta }{2} \Vert \varLambda (p^\star (y',\lambda ',\beta ');\lambda ,\beta )- \varLambda (p^\star (y,\lambda ,\beta );\lambda ,\beta )\Vert ^2\\&\quad -\frac{\beta }{2}\Vert x^\star (y',\lambda ',\beta ')- x^\star (y,\lambda ,\beta )\Vert ^2. \end{aligned}$$

Therefore,

$$\begin{aligned}&\varDelta _2+\varDelta _3+\varDelta _4 - \frac{\beta -\beta '}{2}\Vert \varLambda (p(x);\lambda ',\beta ') -\lambda '\Vert ^2- \frac{\beta '-\beta }{2}\Vert \varLambda (p^\star (y',\lambda ',\beta ');\lambda ,\beta )-\lambda \Vert ^2\nonumber \\&\quad \le \langle \lambda -\lambda ',\beta '( \varLambda (p^\star (y',\lambda ',\beta ');\lambda ',\beta ')-\lambda ') -\beta ( \varLambda (p(x);\lambda ,\beta )-\lambda ) \rangle \nonumber \\&\qquad -\frac{\beta }{2} \Vert \varLambda (p^\star (y',\lambda ',\beta ');\lambda ,\beta )- \varLambda (p^\star (y,\lambda ,\beta );\lambda ,\beta )\Vert ^2\nonumber \\&\qquad -\frac{\beta }{2}\Vert x^\star (y',\lambda ',\beta ')- x^\star (y,\lambda ,\beta )\Vert ^2-\frac{\beta }{2}\Vert x-y\Vert ^2+\frac{\beta '}{2}\Vert x-y'\Vert ^2\nonumber \\&\qquad +\frac{\beta }{2}\Vert x^\star (y',\lambda ',\beta ')-y\Vert ^2-\frac{\beta '}{2}\Vert x^\star (y',\lambda ',\beta ')-y'\Vert ^2\nonumber \\&\quad =\langle \lambda -\lambda ',\beta '( \varLambda (p^\star (y',\lambda ',\beta ');\lambda ',\beta ')-\lambda ') -\beta ( \varLambda (p^\star (y',\lambda ',\beta ');\lambda ,\beta )-\lambda ) \rangle \nonumber \\&\qquad +\beta \langle \lambda -\lambda ',\varLambda (p^\star (y,\lambda ,\beta );\lambda ,\beta )-\varLambda (p(x);\lambda ,\beta ) \rangle \nonumber \\&\qquad +\beta \langle \lambda -\lambda ', \varLambda (p^\star (y',\lambda ',\beta ');\lambda ,\beta )-\varLambda (p^\star (y,\lambda ,\beta );\lambda ,\beta ) \rangle \nonumber \\&\qquad -\frac{\beta }{2} \Vert \varLambda (p^\star (y',\lambda ',\beta ');\lambda ,\beta )- \varLambda (p^\star (y,\lambda ,\beta );\lambda ,\beta )\Vert ^2\nonumber \\&\qquad -\frac{\beta }{2}\Vert x^\star (y',\lambda ',\beta ')- x^\star (y,\lambda ,\beta )\Vert ^2-\frac{\beta }{2}\Vert x-y\Vert ^2+\frac{\beta '}{2}\Vert x-y'\Vert ^2\nonumber \\&\qquad +\frac{\beta }{2}\Vert x^\star (y',\lambda ',\beta ')-y\Vert ^2-\frac{\beta '}{2}\Vert x^\star (y',\lambda ',\beta ')-y'\Vert ^2\nonumber \\&\quad \le \Vert \lambda -\lambda '\Vert \sqrt{ ((\beta +\beta ')L_{h_1} + \Vert \beta \lambda _1-\beta '\lambda _1'\Vert )^2+\Vert \beta \lambda _2-\beta '\lambda _2'\Vert ^2}+{\beta }\Vert \lambda -\lambda '\Vert ^2 \nonumber \\&\qquad +\frac{ \beta }{2} \Vert \varLambda (p^\star (y,\lambda ,\beta );\lambda ,\beta )-\varLambda (p(x);\lambda ,\beta )\Vert ^2-\frac{\beta }{2}\Vert x^\star (y',\lambda ',\beta ')- x^\star (y,\lambda ,\beta )\Vert ^2\nonumber \\&\qquad -\frac{\beta }{2}\Vert x-y\Vert ^2+\frac{\beta '}{2}\Vert x-y'\Vert ^2+\frac{\beta }{2}\Vert x^\star (y',\lambda ',\beta ')-y\Vert ^2 -\frac{\beta '}{2}\Vert x^\star (y',\lambda ',\beta ')-y'\Vert ^2, \end{aligned}$$

(115)

where the last inequality follows from Lemma 12 and Cauchy Schwartz inequality. Now we apply Lemma 13 with $w=x^\star (y,\lambda ,\beta )$ and $w'=x^\star (y',\lambda ',\beta ')$ to obtain:

$$\begin{aligned}&-\frac{\beta }{2}\Vert x^\star (y',\lambda ',\beta ')-x^\star (y,\lambda ,\beta )\Vert ^2+\frac{\beta }{2}\Vert x^\star (y',\lambda ',\beta ')\nonumber \\&\quad -y\Vert ^2-\frac{\beta '}{2}\Vert x^\star (y',\lambda ',\beta ')-y'\Vert ^2\nonumber \\&\quad \le \frac{\beta }{2}\Vert x^\star (y,\lambda ,\beta )-y'\Vert ^2+ \frac{\beta (2\beta '+\beta )}{2(2\beta '-\beta )}\Vert y-y'\Vert ^2. \end{aligned}$$

(116)

Plugging (116) into (115) with we derive (114). $\square $

Now we are ready to give a proof for Proposition 1.

Proof of Proposition 1

We apply Proposition 4 with $\lambda =\lambda ^s$, $\lambda '=\lambda ^{s+1}$, $\beta =\beta _s$, $\beta '=\beta _{s+1}$, $x=x^s$, $y=x^{s-1}$ and $y'=x^s$ to obtain

$$\begin{aligned}&H_{s+1}(x^s) -H_{s+1}^\star -\left( H_{s}(x^s) -H_{s}^\star \right) \\&\quad \le \Vert \lambda ^s-\lambda ^{s+1}\Vert \sqrt{ \left( (\beta _s+\beta _{s+1})L_{h_1}+\Vert \beta _s \lambda _1^s-\beta _{s+1}\lambda _1^{s+1}\Vert \right) ^2 +\Vert \beta _s \lambda _2^s-\beta _{s+1}\lambda _2^{s+1}\Vert ^2 }\\&\qquad +{\beta _s}\Vert \lambda ^s-\lambda ^{s+1}\Vert ^2+ \frac{\beta _s-\beta _{s+1}}{2}\Vert \varLambda (p(x^s); \lambda ^{s+1},\beta ^{s+1})-\lambda ^{s+1}\Vert ^2\\&\qquad + \frac{\beta _{s+1}-\beta _s}{2}\Vert \varLambda (p^\star (x^s,\lambda ^{s+1},\beta _{s+1}); \lambda ^s,\beta _s)-\lambda ^s\Vert ^2\\&\qquad +\frac{ \beta _s}{2} \Vert \varLambda (p^\star (x^{s-1},\lambda ^s,\beta _s);\lambda ^s,\beta _s) -\varLambda (p(x^s);\lambda ^{s},\beta _s)\Vert ^2\\&\qquad +\frac{\beta _s}{2}\Vert x^\star (x^{s-1},\lambda ^s,\beta _s)-x^s\Vert ^2+ \frac{\beta _s(2\beta _{s+1}+\beta _s)}{2(2\beta _{s+1}-\beta _s)}\Vert x^{s-1}-x^s\Vert ^2 -\frac{\beta _s}{2}\Vert x^s-x^{s-1}\Vert ^2. \end{aligned}$$

We apply Lemma 10 with $x=x^s$, $y=x^{s-1}$, $\lambda =\lambda ^s$ and $\beta =\beta _s$ and get:

$$\begin{aligned}&\frac{ \beta _s}{2} \Vert \varLambda (p^\star (x^{s-1},\lambda ^s,\beta _s);\lambda ^s,\beta _s)-\varLambda (p(x^s);\lambda ^{s},\beta _s)\Vert ^2+\frac{\beta _s}{2}\Vert x^\star (x^{s-1},\lambda ^s,\beta _s)-x^s\Vert ^2\\&\quad \le H_{s}(x^s) -H_{s}^\star . \end{aligned}$$

Furthermore, since $\beta _{s+1}\le \beta _s$ we have,

$$\begin{aligned} \frac{\beta _{s+1}-\beta _s}{2}\Vert \varLambda (p^\star (x^s,\lambda ^{s+1},\beta _s); \lambda ^s,\beta _s)-\lambda ^s\Vert ^2\le 0. \end{aligned}$$

We then derive (34) by the latter three bounds.

Remark 10

If

$$\begin{aligned} h(u)=\left\{ \begin{array}{ll}0 &{}\quad \mathrm {if } u=b\\ +\infty &{} \quad \mathrm {otherwise } \end{array}\right. \end{aligned}$$

for some constant vector $b\in \mathbb {R}^d$, for the reason stated in Remark 9, the number of inner iterations $m_{s+1}$ in Algorithm 2 can be taken as the smallest integer satisfying

$$\begin{aligned}&2\epsilon _s+{{\beta _s}}\Vert \lambda ^{s+1}-\lambda ^{s}\Vert ^2 \frac{\beta _s-\beta _{s+1}}{2}\Vert \varLambda (p(x^s); \lambda ^{s+1},\beta _{s+1})-\lambda ^{s+1}\Vert ^2+ \frac{\beta _s^2}{2\beta _{s+1}-\beta _s}\Vert x^{s-1}-x^s\Vert ^2 \\&\le {2^{\lfloor m_{s+1}/ K_{s+1}\rfloor }} {\epsilon _{s+1}/2 } . \end{aligned}$$

1.3 Proofs in Section 3.1

Proof of Corollary 3

By (36), we have

$$\begin{aligned} \mathbb {E}\left[ H_{s+1}\left( x^{s+1}\right) -H_{s+1}^\star |{\mathcal {F}}_s\right] \le {2^{-\lfloor m_{s+1}/K_{s+1}\rfloor }} \left( H_{s+1}(x^s)-H_{s+1}^\star \right) . \end{aligned}$$

Then we apply Proposition 1 and obtain

$$\begin{aligned} \begin{array}{ll} \mathbb {E}\left[ H_{s+1}\left( x^{s+1}\right) -H_{s+1}^\star | {\mathcal {F}}_s \right] &{}\le {2^{1-\lfloor m_{s+1}/K_{s+1}\rfloor }} \left( H_{s}(x^s)-H_{s}^\star \right) \\ {} &{} \quad + {2^{-\lfloor m_{s+1}/K_{s+1}\rfloor }} M_s . \end{array} \end{aligned}$$

(117)

If (38) holds, then

$$\begin{aligned} {2^{-\lfloor m_{s+1}/K_{s+1}\rfloor }} \le \frac{\epsilon _{s+1}}{4\epsilon _s}, {2^{-\lfloor m_{s+1}/K_{s+1}\rfloor }} M_s\le \frac{\epsilon _{s+1}}{2}. \end{aligned}$$

It follows that

$$\begin{aligned} \mathbb {E}\left[ H_{s+1}\left( x^{s+1}\right) -H_{s+1}^\star | {\mathcal {F}}_s\right] \le \frac{\epsilon _{s+1}}{2\epsilon _s} \left( H_{s}(x^s)-H_{s}^\star \right) + \frac{\epsilon _{s+1}}{2}. \end{aligned}$$

Then (39) is guaranteed by taking expectation on both sides of the last inequality.

1.4 Proofs in Section 3.2

Proof of Lemma 4

We first bound

$$\begin{aligned}&\mathbb {E}[ \left( (\beta _s+\beta _{s+1})L_{h_1}+\Vert \beta _s \lambda _1^s-\beta _{s+1}\lambda _1^{s+1}\Vert \right) ^2 +\Vert \beta _s \lambda _2^s-\beta _{s+1}\lambda _2^{s+1}\Vert ^2 ]\\&\quad \le 2(\beta _s+\beta _{s+1})^2 L^2_{h_1}+2\mathbb {E}[ \Vert \beta _s \lambda ^s-\beta _{s+1}\lambda ^{s+1}\Vert ^2]\\&\quad \le 2(\beta _s+\beta _{s+1})^2 L^2_{h_1}+4(\beta _s^2+\beta _{s+1}^2) c\\&\quad \le 4(\beta _s^2+\beta _{s+1}^2)(L^2_{h_1}+c). \end{aligned}$$

Since

$$\begin{aligned} \lambda ^{s+1}=\varLambda (p(x^s);\lambda ^{s},\beta _{s}), \end{aligned}$$

by Lemma 12 we have

$$\begin{aligned}&\Vert \beta _{s+1}\left( \varLambda (p(x^s);\lambda ^{s+1},\beta _{s+1})-\lambda ^{s+1}\right) -\beta _s(\lambda ^{s+1}-\lambda ^{s})\Vert \end{aligned}$$

(118)

$$\begin{aligned}&\quad \le \sqrt{ \left( (\beta _s+\beta _{s+1})L_{h_1}+\Vert \beta _s \lambda _1^s-\beta _{s+1}\lambda _1^{s+1}\Vert \right) ^2 +\Vert \beta _s \lambda _2^s-\beta _{s+1}\lambda _2^{s+1}\Vert ^2 }. \end{aligned}$$

(119)

Therefore,

$$\begin{aligned}&\Vert \varLambda (p(x^s);\lambda ^{s+1},\beta _{s+1})-\lambda ^{s+1}\Vert \\&\quad \le \beta _{s+1}^{-1} \beta _s \Vert \lambda ^{s+1}-\lambda ^s\Vert \\&\qquad + \beta _{s+1}^{-1}\sqrt{ \left( (\beta _s+\beta _{s+1})L_{h_1}+\Vert \beta _s \lambda _1^s-\beta _{s+1}\lambda _1^{s+1}\Vert \right) ^2 +\Vert \beta _s \lambda _2^s-\beta _{s+1}\lambda _2^{s+1}\Vert ^2 }. \end{aligned}$$

If follows that

$$\begin{aligned}&\mathbb {E}[ \Vert \varLambda (p(x^s);\lambda ^{s+1},\beta _{s+1})-\lambda ^{s+1}\Vert ^2]\le 2 \beta _{s+1}^{-2} \beta ^2_s c+8\beta _{s+1}^{-2 } (\beta _s^2+\beta _{s+1}^2)(L^2_{h_1}+c) \end{aligned}$$

(120)

By $\mathbb {E}[XY]\le (\mathbb {E}[X^2])^{1/2}(\mathbb {E}[Y^2])^{1/2}$, we get

$$\begin{aligned} \begin{aligned}&\mathbb {E}\left[ \Vert \lambda ^{s+1}-\lambda ^{s}\Vert \sqrt{ \left( (\beta _s+\beta _{s+1})L_{h_1}+\Vert \beta _s \lambda _1^s-\beta _{s+1}\lambda _1^{s+1}\Vert \right) ^2 +\Vert \beta _s \lambda _2^s-\beta _{s+1}\lambda _2^{s+1}\Vert ^2 }\right] \\&\quad \le \sqrt{4c(\beta _s^2+\beta _{s+1}^2)(L^2_{h_1}+c)}. \end{aligned} \end{aligned}$$

(121)

Combining (44), (120) and (121), we then get an upper bound for $\mathbb {E}[M_{s}]$:

$$\begin{aligned} \begin{array}{ll} \mathbb {E}[M_{s+1}]&{}\le {{\beta _s}}c+ \frac{\beta _s-\beta _{s+1}}{2}\left( 2 \beta _{s+1}^{-2} \beta ^2_s c+8\beta _{s+1}^{-2 } (\beta _s^2+\beta _{s+1}^2)(L^2_{h_1}+c)\right) \\ &{}\quad + \frac{\beta _s^2}{2\beta _{s+1}-\beta _s}c + \sqrt{4c(\beta _s^2+\beta _{s+1}^2)(L^2_{h_1}+c)} \\ &{}\le {{\beta _s}}c+ \beta _s\left( \beta _{s+1}^{-2} \beta ^2_s c+4\beta _{s+1}^{-2 } (\beta _s^2+\beta _{s+1}^2)(L^2_{h_1}+c)\right) \\ &{} \quad + \frac{\beta _s^2}{2\beta _{s+1}-\beta _s}c + 2 \beta _s \sqrt{c(1+\beta _{s+1}^2\beta _s^{-2})(L^2_{h_1}+c)} \\ &{} \le 2{{\beta _s}}c + \beta _s\left( \beta _{s+1}^{-2} \beta ^2_s c+(5+4\beta _{s+1}^{-2 } \beta _s^2+\beta _{s+1}^2 \beta _s^{-2})(L^2_{h_1}+c)\right) + \frac{\beta _s^2}{2\beta _{s+1}-\beta _s}c \end{array}, \end{aligned}$$

where the last inequality used $2\sqrt{ab}\le a+b $ for any $a,b>0$. Next we plug in $\beta _s=\beta _0\rho ^s$ to obtain

$$\begin{aligned} \mathbb {E}[M_s]&\le \beta _s \left( 2c+\rho ^{-2}c+(9+\rho ^{-2})(L^2_{h_1}+c)+(2\rho -1)^{-1}c\right) \\&\le \beta _s \left( (11+2\rho ^{-2})(L^2_{h_1}+c)+(2\rho -1)^{-1}c\right) . \end{aligned}$$

Proof of Proposition 2

Since Algorithm 2 is a special case of Algorithm 1 with $\beta _s=\beta _0 \rho ^s$ and $\epsilon _s=\epsilon _0 \eta ^s$, we know from Corollary 1 that (44) holds with $c=4c_0$. Applying Lemma 4 we know that

$$\begin{aligned} \mathbb {E}[M_s]\le C\beta _s , \end{aligned}$$

with $C=(11+2\rho ^{-2})(L^2_{h_1}+4c_0)+4(2\rho -1)^{-1}c_0$. If $m_{s+1}$ is the smallest integer satisfying (38), then

$$\begin{aligned} m_{s+1}\le K_{s+1}\left( \log _2 \left( 4\epsilon _s \epsilon ^{-1}_{s+1} + 2M_s \epsilon ^{-1}_{s+1}\right) +1\right) +1. \end{aligned}$$

(122)

By the concavity of $\log _2$ function we get

$$\begin{aligned} \mathbb {E}[m_{s+1}]&\le K_{s+1}\left( \log _2 \left( 4\epsilon _s \epsilon ^{-1}_{s+1} + 2C\beta _s \epsilon ^{-1}_{s+1}\right) +1\right) +1 \\&=K_{s+1}\left( \log _2 \left( 4 \eta ^{-1} + 2C\beta _0 \epsilon _0^{-1}\eta ^{-1}\rho ^s\eta ^{-s}\right) +1\right) +1. \end{aligned}$$

Since $\rho >\eta $, we get

$$\begin{aligned} \mathbb {E}[m_{s+1}]&\le K_{s+1}\left( \log _2 \left( \left( 4 \eta ^{-1} + 2C\beta _0 \epsilon _0^{-1}\eta ^{-1}\right) \rho ^s\eta ^{-s}\right) +1\right) +1\\ {}&=K_{s+1}\left( \log _2 \left( 4 \eta ^{-1} + 2C\beta _0 \epsilon _0^{-1}\eta ^{-1}\right) +1+\log _2 \left( \rho ^s\eta ^{-s}\right) \right) +1\\&=K_{s+1}\left( s\log _2\left( \rho \eta ^{-1}\right) +c_2\right) + 1. \end{aligned}$$

Proof of Theorem 2

By Corollary 2, (48) holds if

$$\begin{aligned} s\ge \frac{\ln (c_1/\epsilon )}{\ln (1/\rho )}. \end{aligned}$$

Thus (48) is true for some integer s satisfying

$$\begin{aligned} s\le \frac{\ln (c_1/\epsilon )}{\ln (1/\rho )}+1=\frac{\ln (c_1/(\epsilon \rho ))}{\ln (1/\rho )}. \end{aligned}$$

(123)

Since $\epsilon \le \epsilon _0$, we know that $\epsilon \le c_1$ and

$$\begin{aligned} s\le \frac{\ln (c_1/(\epsilon \rho ))}{\ln (1/\rho )}=\frac{\ln (c^\ell _1/(\epsilon ^\ell \rho ^\ell ))}{\ell \ln (1/\rho )}\le \frac{c^\ell _1}{\epsilon ^\ell \rho ^\ell \ell \ln (1/\rho )}, \end{aligned}$$

(124)

where in the last inequality we used $\ln a\le a$ for any $a\ge 1$. In view of (47), we have

$$\begin{aligned} \sum _{t=1}^s K_t&\le \varsigma s+\frac{\omega }{\beta _0^\ell } \sum _{t=1}^s \rho ^{-\ell t}\le \varsigma s+\frac{\omega \rho ^{-\ell (s+1)}}{\beta _0^\ell (\rho ^{-\ell }-1)} \overset{(123)}{\le } \varsigma s+ \frac{\omega c_1^\ell }{\beta _0^\ell (1-\rho ^\ell )\rho ^\ell \epsilon ^\ell }\\&\overset{(124)}{\le } \left( \frac{ \varsigma c^\ell _1}{\rho ^\ell \ell \ln (1/\rho )}+ \frac{\omega c_1^\ell }{\beta _0^\ell (1-\rho ^\ell )\rho ^\ell } \right) \frac{1}{\epsilon ^\ell }. \end{aligned}$$

Then we apply Proposition 2 to obtain

$$\begin{aligned} \sum _{t=1}^s \mathbb {E}[m_t]&\le s \left( 1+\log _2({\rho }/{\eta })+c_2\right) \left( \frac{ \varsigma c^\ell _1}{\rho ^\ell \ell \ln (1/\rho )}+ \frac{\omega c_1^\ell }{\beta _0^\ell (1-\rho ^\ell )\rho ^\ell } \right) \frac{1}{\epsilon ^\ell } \\&\!\!\overset{(123)}{\le } \frac{1+\log _2({\rho }/{\eta })+c_2}{\ln (1/\rho )}\left( \frac{ \varsigma c^\ell _1}{\rho ^\ell \ell \ln (1/\rho )}+ \frac{\omega c_1^\ell }{\beta _0^\ell (1-\rho ^\ell )\rho ^\ell } \right) \frac{1}{\epsilon ^\ell }\ln \frac{c_1}{\epsilon \rho }. \end{aligned}$$

1.5 Proof in Section 5.1

Proof of Corollary 5

If $K_s$ satisfies (62), then

$$\begin{aligned} K_s\le 2\sqrt{\frac{2(L\beta _0+\Vert A\Vert ^2)}{\mu _g\beta _s+\beta ^2_s}}+1\le \left\{ \begin{array}{ll} \frac{2\sqrt{2(L\beta _0+\Vert A\Vert ^2)/\mu _g}}{ \sqrt{\beta _s}}+1 &{} \mathrm {if} ~\mu _g>0 \\ \frac{2\sqrt{2(L\beta _0+\Vert A\Vert ^2)}}{{\beta _s}}+1 &{} \mathrm {if} ~\mu _g=0 \end{array}\right. \end{aligned}$$

We then apply Corollary 4.

The proof of Corollary 6 and 7 are similar.

1.6 Proofs in Section 5.2

We first state a useful Lemma.

Lemma 14

For any $u,\lambda \in \mathbb {R}^d$, $\beta >0$,

$$\begin{aligned} \Vert \varLambda (u;\lambda ,\beta )\Vert \le L_{h_1}+ \beta ^{-1}{\text {dist}}(u_2+\beta \lambda _2, {\mathcal {K}}) \end{aligned}$$

(125)

Proof

From (21),

$$\begin{aligned}&h(u;\lambda ,\beta )=\min _z\left\{ h(z)+\frac{1}{2\beta }\Vert u+\beta \lambda -z\Vert ^2-\frac{\beta }{2}\Vert \lambda \Vert ^2 \right\} \end{aligned}$$

(126)

with optimal solution

$$\begin{aligned} z^\star =u+\beta \lambda -\beta \varLambda (u;\lambda ,\beta ). \end{aligned}$$

In particular, ${\text {dist}}(u_2+\beta \lambda _2, {\mathcal {K}})^2=\beta ^2\Vert \varLambda _2(u_2;\lambda _2,\beta )\Vert ^2$. Together with (109) we obtain the desired bound. $\square $

Proof of Lemma 5

$$\begin{aligned}&\Vert \nabla p(x) \varLambda (p(x);\lambda ^s,\beta _s)-\nabla p(y) \varLambda (p(y);\lambda ^s,\beta _s) \Vert \\&\quad \le \Vert \nabla p(x)-\nabla p(y)\Vert \Vert \varLambda (p(x);\lambda ^s,\beta _s)\Vert +\Vert \nabla p(y)\Vert \Vert \varLambda (p(x);\lambda ^s,\beta _s)\\&\qquad -\varLambda (p(y);\lambda ^s,\beta _s) \Vert \\&\quad \overset{(125)+(19)}{\le } L_{\nabla p} \Vert x-y\Vert \left( L_{h_1}+ \beta _s^{-1}{\text {dist}}(p_2(x)+\beta _s\lambda ^s_2, {\mathcal {K}})\right) \\&\qquad + M_{\nabla p}\Vert p(x)-p(y)\Vert \beta _s^{-1}\\&\quad \le \left( L_{\nabla p} \left( L_{h_1}+ \beta _s^{-1}{\text {dist}}(p_2(x)+\beta _s\lambda ^s_2, {\mathcal {K}})\right) + M^2_{\nabla p}\beta _s^{-1}\right) \Vert x-y\Vert . \end{aligned}$$

Note that by (74) and the definition of $d_s$,

$$\begin{aligned} {\text {dist}}(p_2(x)+\beta _s\lambda _2^s, {\mathcal {K}})\le d_s. \end{aligned}$$

1.7 Proofs in Section 6.2

Proof of Theorem 3

We know from the basic property of proximal gradient step [32] that

$$\begin{aligned} \Vert x^{s} -{\tilde{x}}^s\Vert ^2 \le 2\left( H_s(\tilde{x}^s)-H_s^\star \right) /L_{s}. \end{aligned}$$

By Line 4 in Algorithm 3,

$$\begin{aligned} 0\in \nabla \phi _s({\tilde{x}}^s)+L_{s}(x^s-{\tilde{x}}^s)+{\beta _s}( x^s-x^{s-1})+\partial g(x^s). \end{aligned}$$

Therefore,

$$\begin{aligned}&{\text {dist}}(0, \nabla \phi _s( x^s)+\partial g(x^s))\\ {}&\quad \le L_{s} \Vert \tilde{x}^s-x^s\Vert +\Vert \nabla \phi _s(x^s)-\nabla \phi _s({\tilde{x}}^s)\Vert +\beta _s \Vert x^s-x^{s-1} \Vert \\&\quad \le 2L_{s}\Vert {\tilde{x}}^s-x^s\Vert +\beta _s \Vert x^s-x^{s-1} \Vert \end{aligned}$$

Combining the last two bounds and (18) we get $\nabla \phi _s( x^s)= \nabla f( x^s)+\nabla p(x^s) \lambda ^{s+1}$ and

$$\begin{aligned} {\text {dist}}(0, \nabla f( x^s)+\nabla p(x^s) \lambda ^{s+1}+\partial g( x^{s}))^2\le 16 L_{s} \left( H_s(\tilde{x}^s)-H_s^\star \right) +2\beta _s^2\Vert x^s-x^{s-1}\Vert ^2. \end{aligned}$$

Secondly we know from (20) that

$$\begin{aligned} p(x^s)-\beta _s(\lambda ^{s+1}-\lambda ^s)\in \partial h^*(\lambda ^{s+1}). \end{aligned}$$

It follows that

$$\begin{aligned} {\text {dist}}(0, p(x^s)-\partial h^*(\lambda ^{s+1}))\le \beta _s \Vert \lambda ^{s+1}-\lambda ^s\Vert . \end{aligned}$$

Proof of Corollary 9

Due to (82), we can have the same bound (in expectation) of the sequence $\{({\tilde{x}}^s, x^s, \lambda ^s)\}$ as Corollary 1. Hence,

$$\begin{aligned}&\mathbb {E}\left[ {\text {dist}}(0,\partial _x L(x^s, \lambda ^{s+1})) \right] \le \sqrt{16 L_{s}\epsilon _s+8c_0\beta _s^2}\le \sqrt{16\gamma \epsilon _0/\beta _0+8c_0\beta _0}\rho ^s,\\&\mathbb {E}\left[ {\text {dist}}(0,\partial _{\lambda }L(x^s, \lambda ^{s+1})) \right] \le \beta _0\sqrt{c_0} \rho ^s. \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, F., Qu, Z. An inexact proximal augmented Lagrangian framework with arbitrary linearly convergent inner solver for composite convex optimization. Math. Prog. Comp. 13, 583–644 (2021). https://doi.org/10.1007/s12532-021-00205-x

Download citation

Received: 24 December 2019
Accepted: 23 June 2021
Published: 17 July 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s12532-021-00205-x

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An inexact proximal augmented Lagrangian framework with arbitrary linearly convergent inner solver for composite convex optimization

Abstract

Access this article

Similar content being viewed by others

A Proximal Alternating Direction Method of Multipliers for DC Programming with Structured Constraints

Relaxed-inertial derivative-free algorithm for systems of nonlinear pseudo-monotone equations

Efficiency of higher-order algorithms for minimizing composite functions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Proof of Lemma 1

Proof of Lemma 1

Some useful lemmas

Lemma 6

Proof

Lemma 7

Inexact proximal point algorithm and inexact augmented Lagrangian method

1.1 Inexact proximal point method

Lemma 8

Lemma 9

Proof

1.2 Inexact ALM

Lemma 10

Proof

Proposition 3

Proof

Missing proofs

1.1 Proofs in Section 2.2

Proof of Lemma 2

Proof of Lemma 3

Proof of Corollary 1

Proof of Theorem 1

Proof of Corollary 2

1.2 Proof of Proposition 1

Lemma 11

Proof

Lemma 12

Proof

Remark 9

Lemma 13

Proof

Proposition 4

Proof

Proof of Proposition 1

Remark 10

1.3 Proofs in Section 3.1

Proof of Corollary 3

1.4 Proofs in Section 3.2

Proof of Lemma 4

Proof of Proposition 2

Proof of Theorem 2

1.5 Proof in Section 5.1

Proof of Corollary 5

1.6 Proofs in Section 5.2

Lemma 14

Proof

Proof of Lemma 5

1.7 Proofs in Section 6.2

Proof of Theorem 3

Proof of Corollary 9

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation