Skip to main content
Log in

Accelerating incremental gradient optimization with curvature information

  • Published:
Computational Optimization and Applications Aims and scope Submit manuscript

Abstract

This paper studies an acceleration technique for incremental aggregated gradient (IAG) method through the use of curvature information for solving strongly convex finite sum optimization problems. These optimization problems of interest arise in large-scale learning applications. Our technique utilizes a curvature-aided gradient tracking step to produce accurate gradient estimates incrementally using Hessian information. We propose and analyze two methods utilizing the new technique, the curvature-aided IAG (CIAG) method and the accelerated CIAG (A-CIAG) method, which are analogous to gradient method and Nesterov’s accelerated gradient method, respectively. Setting \(\kappa\) to be the condition number of the objective function, we prove the R linear convergence rates of \(1 - \frac{4c_0 \kappa }{(\kappa +1)^2}\) for the CIAG method, and \(1 - \sqrt{\frac{c_1}{2\kappa }}\) for the A-CIAG method, where \(c_0,c_1 \le 1\) are constants inversely proportional to the distance between the initial point and the optimal solution. When the initial iterate is close to the optimal solution, the R linear convergence rates match with the gradient and accelerated gradient method, albeit CIAG and A-CIAG operate in an incremental setting with strictly lower computation complexity. Numerical experiments confirm our findings. The source codes used for this paper can be found on http://github.com/hoitowai/ciag/.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  1. Agarwal, A., Bottou, L.: A lower bound for the optimization of finite sums. In: International Conference on Machine Learning, pp. 78–86 (2015)

  2. Arjevani, Y., Shamir, O.: Dimension-free iteration complexity of finite sum optimization problems. In: Advances in Neural Information Processing Systems 29, pp. 3540–3548 (2016)

  3. Bertsekas, D.P.: Nonlinear Programming. Athena Scientific, Belmont (1999)

    MATH  Google Scholar 

  4. Bertsekas, D.P.: Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. Optim. Mach. Learn. 2010(1–38), 3 (2011)

    Google Scholar 

  5. Blatt, D., Hero, A.O., Gauchman, H.: A convergent incremental gradient method with a constant step size. SIAM J. Optim. 18(1), 29–51 (2007)

    Article  MathSciNet  Google Scholar 

  6. Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)

    Article  MathSciNet  Google Scholar 

  7. Bubeck, S., et al.: Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. 8(3–4), 231–357 (2015)

    Article  Google Scholar 

  8. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011)

    Article  Google Scholar 

  9. Defazio, A., Bach, F., Lacoste-Julien, S.: Saga: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems 27, pp. 1646–1654 (2014)

  10. Feyzmahdavian, H.R., Aytekin, A., Johansson, M.: A delayed proximal gradient method with linear convergence rate. In: IEEE International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6. IEEE (2014)

  11. Gower, R.M., Roux, N.L., Bach, F.: Tracking the gradients using the hessian: a new look at variance reducing stochastic methods. In: AISTATS (2018)

  12. Gürbüzbalaban, M., Ozdaglar, A., Parrilo, P.: A globally convergent incremental newton method. Math. Program. 151(1), 283–313 (2015)

    Article  MathSciNet  Google Scholar 

  13. Gürbüzbalaban, M., Ozdaglar, A., Parrilo, P.: Why random reshuffling beats stochastic gradient descent. Math. Program. https://doi.org/10.1007/s10107-019-01440-w (2019)

    Article  MATH  Google Scholar 

  14. Gürbüzbalaban, M., Ozdaglar, A., Parrilo, P.: On the convergence rate of incremental aggregated gradient algorithms. SIAM J. Optim. 27(2), 1035–1048 (2017)

    Article  MathSciNet  Google Scholar 

  15. Lan, G., Zhou, Y.: An optimal randomized incremental gradient method. Math. Program. 171, 167–215 (2018)

    Article  MathSciNet  Google Scholar 

  16. Mairal, J.: Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM J. Optim. 25(2), 829–855 (2015)

    Article  MathSciNet  Google Scholar 

  17. Mokhtari, A., Eisen, M., Ribeiro, A.: Iqn: an incremental quasi-newton method with local superlinear convergence rate. SIAM J. Optim. 28(2), 1670–1698 (2018)

    Article  MathSciNet  Google Scholar 

  18. Nedić A., Bertsekas D.: Convergence Rate of Incremental Subgradient Algorithms. In: Uryasev S., Pardalos P.M. (eds) Stochastic Optimization: Algorithms and Applications. Applied Optimization, vol. 54. Springer, Boston, MA (2001)

    MATH  Google Scholar 

  19. Nedic, A., Bertsekas, D.P.: Incremental subgradient methods for nondifferentiable optimization. SIAM J. Optim. 12(1), 109–138 (2001)

    Article  MathSciNet  Google Scholar 

  20. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, Berlin (2013)

    MATH  Google Scholar 

  21. Nitanda, A.: Stochastic proximal gradient descent with acceleration techniques. In: Advances in Neural Information Processing Systems, pp. 1574–1582 (2014)

  22. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)

    Article  MathSciNet  Google Scholar 

  23. Rodomanov, A., Kropotov, D.: A superlinearly-convergent proximal Newton-type method for the optimization of finite sums. In: International Conference on Machine Learning, pp. 2597–2605 (2016)

  24. Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017)

    Article  MathSciNet  Google Scholar 

  25. Schmidt, M., Roux, N.L., Bach, F.R.: Convergence rates of inexact proximal-gradient methods for convex optimization. In: Advances in Neural Information Processing Systems 24, pp. 1458–1466 (2011)

  26. So, A.M.C., Zhou, Z.: Non-asymptotic convergence analysis of inexact gradient methods for machine learning without strong convexity. Optim. Methods Softw. 32(4), 963–992 (2017)

    Article  MathSciNet  Google Scholar 

  27. Vanli, N.D., Gürbüzbalaban, M., Ozdaglar, A.: A stronger convergence result on the proximal incremental aggregated gradient method. arXiv preprint arXiv:1611.08022 (2016)

  28. Vapnik, V.N.: An overview of statistical learning theory. IEEE Trans. Neural Netw. 10(5), 988–999 (1999)

    Article  Google Scholar 

  29. Wai, H.T., Shi, W., Nedić, A., Scaglione, A.: Curvature-aided incremental aggregated gradient method. In: Proceedings of Allerton (2017)

  30. Xiao, L., Zhang, T.: A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim. 24(4), 2057–2075 (2014)

    Article  MathSciNet  Google Scholar 

  31. Zheng, S., Meng, Q., Wang, T., Chen, W., Yu, N., Ma, Z.M., Liu, T.Y.: Asynchronous stochastic gradient descent with delay compensation. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 4120–4129. JMLR.org (2017)

Download references

Acknowledgements

This work has been partially supported by the NSF Grant CCF-1717391 and CUHK Direct Grant #4055113.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hoi-To Wai.

Additional information

In memory of Dr. Wei Shi, a respected friend and talented scholar.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Proof of Proposition 3

Let us express the gradient error as

$${{{\varvec{e}}}}_{\textsf{CIAG}}^k = \sum _{i=1}^m \left( {\nabla }f_i ( {\varvec{\theta }}^{\tau _i^k} ) + {\nabla }^2 f_i ( {\varvec{\theta }}^{\tau _i^k} ) ( {\varvec{\theta }}^k - {\varvec{\theta }}^{\tau _i^k} ) - {\nabla }f_i ( {\varvec{\theta }}^k ) \right)$$
(59)

Applying Lemma 1:

$$\begin{aligned} \begin{aligned}&\Vert {{{\varvec{e}}}}_{\textsf{CIAG}}^k \Vert \le \sum _{i=1}^m \frac{L_{H,i}}{2} \Vert {\varvec{\theta }}^{\tau _i^k} - {\varvec{\theta }}^k \Vert ^2 \le \sum _{i=1}^m \frac{L_{H,i}}{2} \underbrace{(k - \tau _i^k)}_{\le K} \sum _{j=\tau _i^k}^{k-1} \Vert {\varvec{\theta }}^{j+1} - {\varvec{\theta }}^j \Vert ^2 \\&\quad \le \frac{K L_H}{2} \sum _{j= (k-K)_{++}}^{k-1} \Vert {\varvec{\theta }}^{j+1} - {\varvec{\theta }}^j \Vert ^2 \le \frac{K L_H}{2} \gamma ^2 \sum _{j=(k-K)_{++}}^{k-1} \Vert {{{\varvec{e}}}}_{\textsf{CIAG}}^j + {\nabla }F({\varvec{\theta }}^j) \Vert ^2 \\&\quad \le \gamma ^2 K L_H \sum _{j=(k-K)_{++}}^{k-1} \left( \Vert {{{\varvec{e}}}}_{\textsf{CIAG}}^j \Vert ^2 + \Vert {\nabla }F({\varvec{\theta }}^j) \Vert ^2 \right) \;. \end{aligned} \end{aligned}$$
(60)

Furthermore, we have

$$\begin{aligned}&\Vert {\nabla }F({\varvec{\theta }}^j) \Vert ^2 = \Vert {\nabla }F({\varvec{\theta }}^j) - {\nabla }F({\varvec{\theta }}^\star ) \Vert ^2 \le L^2 V^{(j)}, \end{aligned}$$
(61)
$$\begin{aligned}&\Vert {{{\varvec{e}}}}_{\textsf{CIAG}}^j \Vert \overset{(a)}{\le } \sum _{i=1}^m L_{H,i} \left( V^{(j)} + V^{(\tau _i^j)} \right) \le 2 L_H \max _{ \ell \in \{ \tau _i^j \}_{i=1}^m \cup \{j\} } V^{(\ell )} \;, \end{aligned}$$
(62)

where (a) is due to \(\Vert {{{\varvec{a}}}} - {{{\varvec{b}}}} \Vert ^2 \le 2 (\Vert {{{\varvec{a}}}}\Vert ^2 + \Vert {{{\varvec{b}}}} \Vert ^2)\). Plugging these back into (60) and using \(\tau _i^{k-K} \ge k - 2K\) gives:

$$\begin{aligned} \begin{aligned} \Vert {{{\varvec{e}}}}_{\textsf{CIAG}}^k \Vert&\le \gamma ^2 K L_H \sum _{j=(k-K)_{++}}^{k-1} \left( L^2 V^{(j)} + \left( 2 L_H \max _{ \ell \in \{ \tau _i^j \}_{i=1}^m \cup \{j\} } V^{(\ell )} \right) ^2 \right) \\&\le \gamma ^2 K^2 L_H \left( L^2 \max _{ (k-K)_{++} \le \ell \le k-1 } V^{(\ell )} + 4 L_H^2 \max _{ (k-2K)_{++} \le \ell \le k-1 } (V^{(\ell )})^2 \right) \;. \end{aligned} \end{aligned}$$
(63)

B Step 3 in the Proof of Theorem 1

Combining Proposition 1 and 3 yields

$$\begin{aligned} \begin{aligned} V^{(k+1)}&\le \left( 1 - 2\gamma \frac{ \mu L }{\mu + L}\right) V^{(k)} \\&\quad + 2 \gamma ^3 K^2 L_H \left( L^2 \max _{ (k-K)_{++} \le \ell \le k } (V^{(\ell )})^{\frac{3}{2}}+ 4 L_H^2 \max _{ (k-2K)_{++} \le \ell \le k } (V^{(\ell )})^{\frac{5}{2}} \right) \\&\quad + 2 \gamma ^6 K^4 L_H^2 \left( L^4 \max _{ (k-K)_{++} \le \ell \le k-1 } (V^{(\ell )})^2 + 16 L_H^4 \max _{ (k-2K)_{++} \le \ell \le k-1 } (V^{(\ell )})^4 \right) , \end{aligned} \end{aligned}$$
(64)

which is the exact form for Eq. (44). The right hand side in (64) can be decomposed into two terms—the first term is of the same order as \(V^{(k)}\), and the other terms are delayed and higher-order terms of \(V^{(\ell )}\).

Observe that (64) is a special case of (48) in Proposition 5 with \(R^{(k)} = V^{(k)}\), \(M=2K+1\), \(p=1 - 2 \gamma \mu L / (\mu + L)\) and

$$\begin{aligned} \begin{aligned}&q_1 = 2 \gamma ^3 K^2 L^2 L_H,~\eta _1 = 3/2,~q_2 = 8 \gamma ^3 K^2 L_H^3,~\eta _3 = 5/2 \;, \\&q_3 = 2 \gamma ^6 K^4 L_H^2 L^4,~\eta _3 = 2,~q_4 = 32 \gamma ^6 K^4 L_H^6,~\eta _4 = 4 \;. \end{aligned} \end{aligned}$$
(65)

The corresponding convergence condition in (49) can be satisfied if

$$\begin{aligned} \begin{aligned}&\gamma ^5 ~ 2K^4 L_H^2 \left( L^4 V^{(1)} + 16 L_H^4 (V^{(1)})^3 \right)< \frac{ \mu L }{ \mu + L } \\&\text{and}~~\gamma ^2 ~ 2K^2 L_H \left( L^2 (V^{(1)})^{1/2} + 4 L_H^2 (V^{(1)})^{3/2} \right) < \frac{ \mu L }{ \mu + L } \;, \end{aligned} \end{aligned}$$
(66)

which can be implied by (28). The proof is thus concluded.

C Proof of Proposition 5

The proof of the proposition is divided into two parts. We first show that under (49), the sequence \(\{ R^{(k)} \}_{k \ge 1}\) converges linearly as in part (a) of the proposition; then we show that the rate of convergence is asymptotically given by p as in part (b) of the proposition [cf. (50)].

The first part of the proof is achieved using induction on all \(\ell \ge 1\) with:

$$\begin{aligned} R^{(k)} \le \delta ^\ell ~ R^{(1)},~\forall ~k=(\ell -1)M + 2,..., \ell M + 1\;. \end{aligned}$$
(67)

The base case when \(\ell =1\) can be straightforwardly established:

$$\begin{aligned} \begin{aligned}&\textstyle R^{(2)} \le p R^{(1)} + \sum _{j=1}^J q_j (R^{(1)})^{\eta _j} \le \delta R^{(1)} \;, \\&\vdots \\&\textstyle R^{(M+1)} \le p R^{(M)} + \sum _{j=1}^J q_j (R^{(0)})^{\eta _j} \le \delta R^{(1)} \;. \end{aligned} \end{aligned}$$
(68)

Suppose that the statement (67) is true up to \(\ell =c\), for \(\ell =c+1\), we have:

$$\begin{aligned} \begin{aligned} R^{( cM+ 2)}&\le p R^{( cM+1 )} + \sum _{j=1}^J q_j \max _{ k' \in [ (c-1)M + 2, cM +1 ] } (R^{(k')})^{\eta _j} \\&\le p \left( \delta ^c R^{(1)} \right) + \sum _{j=1}^J q_j \left( \delta ^c R^{(1)} \right) ^{\eta _j} \le \delta ^c ~ \left( pR^{(1)} + \sum _{j=1}^J q_j (R^{(1)})^{\eta _j} \right) \le \delta ^{c+1} R^{(1)} \;. \end{aligned} \end{aligned}$$

Similar statement also holds for \(R^{(k)}\) with \(k=cM+3,...,(c+1)M+1\). We thus conclude with:

$$\begin{aligned} R^{(k)} \le \delta ^{ \lceil (k-1) / M \rceil } ~ R^{(1)},~\forall ~ k \ge 1 \;, \end{aligned}$$
(69)

which proves the first part of the proposition.

The second part of the proof establishes the asymptotic linear rate of convergence in (50). We consider the upper bound sequence \(\{ \bar{R}^{(k)} \}_{k \ge 1}\) such that \(\bar{R}^{(1)} = R^{(1)}\) and the inequality (48) is tight for \(\{ \bar{R}^{(k)} \}_{k \ge 1}\). Obviously, it also holds that \(\bar{R}^{(k)} \le \delta ^{ \lceil (k-1) / M \rceil } \bar{R}^{(1)}\) for all \(k \ge 1\). Now, observe that

$$\begin{aligned} \frac{\bar{R}^{(k+1)}}{\bar{R}^{(k)}} = p + \frac{ \sum _{j=1}^J q_j \max _{ k' \in [(k-M+1)_{++}, k] } (R^{(k')})^{\eta _j} }{ \bar{R}^{(k)} } \;. \end{aligned}$$
(70)

For any \(k' \in [k-M+1,k]\) and any \(\eta > 1\), we have:

$$\begin{aligned} \begin{aligned}&\frac{ (\bar{R}^{(k')})^{\eta } }{ \bar{R}^{(k)} } = \frac{ \bar{R}^{(k')} }{ \bar{R}^{(k)} } ~ (\bar{R}^{(k')})^{\eta -1} \le \frac{ \bar{R}^{(k')} }{ \bar{R}^{(k)} } (R^{(1)})^{\eta -1} \delta ^{ (\lceil \frac{k'-1}{M} \rceil )(\eta -1) }\;. \end{aligned} \end{aligned}$$
(71)

Note that as \(\bar{R}^{(k+1)} / \bar{R}^{(k)} \ge p\), we have:

$$\begin{aligned} \frac{ (\bar{R}^{(k')})^{\eta } }{ \bar{R}^{(k)} } \le p^{-M} (R^{(1)})^{\eta -1} \delta ^{ (\lceil \frac{k'-1}{M} \rceil )(\eta -1) } \;. \end{aligned}$$
(72)

Taking \(k \rightarrow \infty\) shows that the right hand side vanishes. As a result, we have \(\lim _{k \rightarrow \infty } \bar{R}^{(k+1)} / \bar{R}^{(k)} = p\). This proves part (b) of the proposition.

D Proof of Proposition 2

The following proof is partially inspired by [7, 21, 25]. For simplicity, we drop the subscript ACIAG in \({{{\varvec{g}}}}_{\textsf{ACIAG}}^k\) and \({{{\varvec{e}}}}_{\textsf{ACIAG}}^k\). Define \(\rho \mathrel{\mathop :}=1 - \sqrt{\mu \gamma }\) and the estimation sequence as:

$$\begin{aligned} \begin{aligned} \varPhi _1 ( {\varvec{\theta }})&\mathrel{\mathop :}=F ( {\varvec{\theta }}_{ex}^1 ) + \frac{ \mu }{2} \Vert {\varvec{\theta }}- {\varvec{\theta }}_{ex}^1 \Vert ^2 \\ \varPhi _{k+1}( {\varvec{\theta }})&\mathrel{\mathop :}=\rho ~\varPhi _k ( {\varvec{\theta }}) + \sqrt{\mu \gamma } \left( F( {\varvec{\theta }}_{ex}^k) + \langle {{{\varvec{g}}}}^k, {\varvec{\theta }}- {\varvec{\theta }}_{ex}^k \rangle + \frac{\mu }{2} \Vert {\varvec{\theta }}- {\varvec{\theta }}_{ex}^k \Vert ^2 \right) \;, \end{aligned} \end{aligned}$$
(73)

where \({{{\varvec{g}}}}^k \mathrel{\mathop :}={{{\varvec{b}}}}^k + {{{\varvec{H}}}}^k {\varvec{\theta }}_{ex}^k\) is the gradient surrogate used in (17). Recall that \({{{\varvec{e}}}}^k \mathrel{\mathop :}={{{\varvec{g}}}}^k - {\nabla }F( {\varvec{\theta }}_{ex}^k )\) is the gradient error. The following inequality, which holds for all \({\varvec{\theta }}\in \mathbb{R}^d\), can be immediately obtained using (73) and the \(\mu\)-strong convexity of \(F({\varvec{\theta }})\):

$$\begin{aligned} \begin{aligned}&\varPhi _{k+1} ({\varvec{\theta }}) - F({\varvec{\theta }}) = \rho \varPhi _k ( {\varvec{\theta }}) - F({\varvec{\theta }}) \\&\qquad + \sqrt{\mu \gamma } \left( F( {\varvec{\theta }}_{ex}^k) + \langle {\nabla }F({\varvec{\theta }}_{ex}^k) + {{{\varvec{e}}}}^k, {\varvec{\theta }}- {\varvec{\theta }}_{ex}^k \rangle + \frac{\mu }{2} \Vert {\varvec{\theta }}- {\varvec{\theta }}_{ex}^k \Vert ^2 \right) \\&\quad \le \rho \left( \varPhi _k ( {\varvec{\theta }}) - F({\varvec{\theta }}) \right) + \sqrt{\mu \gamma } \langle {{{\varvec{e}}}}^k, {\varvec{\theta }}- {\varvec{\theta }}_{ex}^s \rangle \\&\quad \le \rho ^k \left( \varPhi _1( {\varvec{\theta }}) - F({\varvec{\theta }}) \right) + \sum _{\ell =1}^k \rho ^{k-\ell } \sqrt{ \mu \gamma } \langle {{{\varvec{e}}}}^\ell , {\varvec{\theta }}- {\varvec{\theta }}_{ex}^\ell \rangle \;. \end{aligned} \end{aligned}$$
(74)

To facilitate our development, let us denote:

$$\begin{aligned} \varPhi _k^\star \mathrel{\mathop :}=\min _{ {\varvec{\theta }}} \varPhi _k ( {\varvec{\theta }}),~~{{{\varvec{v}}}}^k \mathrel{\mathop :}=\arg \min _{ {\varvec{\theta }}} \varPhi _k ( {\varvec{\theta }}) \;. \end{aligned}$$
(75)

By setting \({\varvec{\theta }}= {\varvec{\theta }}^\star\) in (74), we have:

$$\begin{aligned} \begin{aligned}&\varPhi _{k+1}^\star - F({\varvec{\theta }}^\star ) \le \varPhi _{k+1}({\varvec{\theta }}^\star ) - F({\varvec{\theta }}^\star ) \\&\quad \le \rho ^k \left( \frac{\mu }{2} \Vert {\varvec{\theta }}^\star - {\varvec{\theta }}_{ex}^1 \Vert ^2 + F({\varvec{\theta }}_{ex}^1) - F({\varvec{\theta }}^\star ) \right) + \sum _{\ell =1}^k \rho ^{k-\ell } \sqrt{ \mu \gamma } \langle {{{\varvec{e}}}}^\ell , {\varvec{\theta }}^\star - {\varvec{\theta }}_{ex}^\ell \rangle \\&\quad \le 2 \rho ^k \left( F({\varvec{\theta }}^1) - F({\varvec{\theta }}^\star ) \right) + \sum _{\ell =1}^k \rho ^{k-\ell } \sqrt{ \mu \gamma } \langle {{{\varvec{e}}}}^\ell , {\varvec{\theta }}^\star - {\varvec{\theta }}_{ex}^\ell \rangle \;. \end{aligned} \end{aligned}$$
(76)

Now, if \(F( {\varvec{\theta }}^{k+1} ) \le \varPhi _{k+1}^\star\), then the inequality above shows the evolution of the optimality gap \(h^{(k)}\). This motivates our next step, relating \(F( {\varvec{\theta }}^{k+1} )\) to \(\varPhi _{k+1}^\star\).

Lower bounding\(\varPhi _{k+1}^\star\)in the presence of errors. Since \({\nabla }^2 \varPhi _k ( {\varvec{\theta }}) = \mu {{{\varvec{I}}}}\), the function \(\varPhi _k({\varvec{\theta }})\) is quadratic and we can represent \(\varPhi _k({\varvec{\theta }})\) alternatively as

$$\begin{aligned} \varPhi _k ({\varvec{\theta }}) = \varPhi _k^\star + \frac{\mu }{2} \Vert {\varvec{\theta }}- {{{\varvec{v}}}}^k \Vert ^2 \;. \end{aligned}$$
(77)

By substituting (77) into the definition of \(\varPhi _{k+1} ({\varvec{\theta }})\) in (73) and evaluating the first order optimality condition of the latter, we have:

$$\begin{aligned} \begin{aligned}&\sqrt{\mu \gamma } ( {{{\varvec{g}}}}^k + \mu ( {{{\varvec{v}}}}^{k+1} - {\varvec{\theta }}_{ex}^k ) ) + \rho ~ \mu ( {{{\varvec{v}}}}^{k+1} - {{{\varvec{v}}}}^k ) = {{{\varvec{0}}}} \;,\\&\Longrightarrow {{{\varvec{v}}}}^{k+1} = \rho {{{\varvec{v}}}}^k + \sqrt{ \mu \gamma } {\varvec{\theta }}_{ex}^k - \sqrt{\frac{\gamma }{\mu }} {{{\varvec{g}}}}^k \;. \end{aligned} \end{aligned}$$
(78)

By setting \({\varvec{\theta }}={\varvec{\theta }}_{ex}^k\) in (73) and using the recursive definition of \(\varPhi _{k+1} ({\varvec{\theta }})\), we obtain

$$\begin{aligned} \begin{aligned} \varPhi _{k+1} ( {\varvec{\theta }}_{ex}^k )&= \rho \varPhi _{k} ( {\varvec{\theta }}_{ex}^k ) + \sqrt{\mu \gamma } F( {\varvec{\theta }}_{ex}^k ) = \rho \left( \varPhi _k^\star + \frac{\mu }{2} \Vert {\varvec{\theta }}_{ex}^k - {{{\varvec{v}}}}^k \Vert ^2 \right) + \sqrt{\mu \gamma } F( {\varvec{\theta }}_{ex}^k ) \;, \end{aligned} \end{aligned}$$
(79)

while setting \({\varvec{\theta }}={\varvec{\theta }}_{ex}^k\) in (77) and using (78) gives us:

$$\begin{aligned} \begin{aligned} \varPhi _{k+1} ( {\varvec{\theta }}_{ex}^k )&= \varPhi _{k+1}^\star + \frac{\mu }{2} \left( \rho ^2 \Vert {\varvec{\theta }}_{ex}^k - {{{\varvec{v}}}}^k \Vert ^2 + \frac{\gamma }{\mu } \Vert {{{\varvec{g}}}}^k \Vert ^2 + 2 \rho \sqrt{\frac{\gamma }{\mu }} \langle {{{\varvec{g}}}}^k, {\varvec{\theta }}_{ex}^k - {{{\varvec{v}}}}^k \rangle \right) \;. \end{aligned} \end{aligned}$$
(80)

Comparing the right hand side of (79) and (80) shows:

$$\begin{aligned} \begin{aligned} \varPhi _{k+1}^\star&= \rho \left( \varPhi _k^\star + \frac{\mu }{2} \Vert {\varvec{\theta }}_{ex}^k - {{{\varvec{v}}}}^k \Vert ^2 \right) + \sqrt{\mu \gamma } F( {\varvec{\theta }}_{ex}^k ) \\&\quad - \frac{\mu }{2}\left( \rho ^2 \Vert {\varvec{\theta }}_{ex}^k - {{{\varvec{v}}}}^k \Vert ^2 + \frac{\gamma }{\mu } \Vert {{{\varvec{g}}}}^k \Vert ^2 + 2 \rho \sqrt{\frac{\gamma }{\mu }} \langle {{{\varvec{g}}}}^k, {\varvec{\theta }}_{ex}^k - {{{\varvec{v}}}}^k \rangle \right) \\&= \rho \varPhi _k^\star + \sqrt{\mu \gamma } F({\varvec{\theta }}_{ex}^k) + \frac{\mu }{2} \rho \sqrt{\mu \gamma } \Vert {\varvec{\theta }}_{ex}^k - {{{\varvec{v}}}}^k \Vert ^2 - \frac{\gamma }{2} \Vert {{{\varvec{g}}}}^k \Vert ^2 - \rho \sqrt{\mu \gamma } \langle {{{\varvec{g}}}}^k, {\varvec{\theta }}_{ex}^k - {{{\varvec{v}}}}^k \rangle \;. \end{aligned} \end{aligned}$$

Using the fact \({{{\varvec{v}}}}^k - {\varvec{\theta }}_{ex}^k = (\sqrt{\mu \gamma })^{-1} \left( {\varvec{\theta }}_{ex}^k - {\varvec{\theta }}^k \right)\) (proven in Sect. D.1), we have

$$\begin{aligned} \begin{aligned} \varPhi _{k+1}^\star&= \rho \varPhi _k^\star + \sqrt{\mu \gamma } F({\varvec{\theta }}_{ex}^k) + \frac{\mu }{2} \frac{ \rho }{ \sqrt{\mu \gamma } } \Vert {\varvec{\theta }}_{ex}^k - {\varvec{\theta }}^k \Vert ^2 - \frac{ \gamma }{2} \Vert {{{\varvec{g}}}}^k \Vert ^2 - \rho \langle {{{\varvec{g}}}}^k, {\varvec{\theta }}^k - {\varvec{\theta }}_{ex}^k \rangle \;. \end{aligned} \end{aligned}$$
(81)

We obtain the following chain:

$$\begin{aligned} \begin{aligned}&F( {\varvec{\theta }}^{k+1} ) - \varPhi _{k+1}^\star \overset{(a)}{\le } F( {\varvec{\theta }}_{ex}^k ) - \gamma \langle {\nabla }F( {\varvec{\theta }}_{ex}^k ), {{{\varvec{g}}}}^k \rangle + \frac{L \gamma ^2}{2} \Vert {{{\varvec{g}}}}^k \Vert ^2 - \varPhi _{k+1}^\star \\&\quad \overset{(b)}{=} \rho ~ \left( F({\varvec{\theta }}_{ex}^k ) + \langle {{{\varvec{g}}}}^k, {\varvec{\theta }}^k - {\varvec{\theta }}_{ex}^k \rangle - \varPhi _k^\star \right) \\&\qquad -\gamma \langle {\nabla }F( {\varvec{\theta }}_{ex}^k ), {{{\varvec{g}}}}^k \rangle + \frac{\gamma }{2} \left( 1 + L \gamma \right) \Vert {{{\varvec{g}}}}^k \Vert ^2 - \frac{\mu }{2} \frac{ \rho }{ \sqrt{\mu \gamma } } \Vert {\varvec{\theta }}_{ex}^k - {\varvec{\theta }}^k \Vert ^2 \\&\quad \overset{(c)}{=} \rho ~ \left( F({\varvec{\theta }}_{ex}^k) + \langle {\nabla }F({\varvec{\theta }}_{ex}^k), {\varvec{\theta }}^k - {\varvec{\theta }}_{ex}^k \rangle - \varPhi _k^\star \right) -\gamma \langle {\nabla }F( {\varvec{\theta }}_{ex}^k ), {{{\varvec{g}}}}^k \rangle \\&\qquad + \rho \langle {{{\varvec{e}}}}^k, {\varvec{\theta }}^k - {\varvec{\theta }}_{ex}^k \rangle + \frac{\gamma }{2} \left( 1 + L \gamma \right) \Vert {{{\varvec{g}}}}^k \Vert ^2 - \frac{\mu }{2} \frac{ \rho }{ \sqrt{\mu \gamma } } \Vert {\varvec{\theta }}_{ex}^k - {\varvec{\theta }}^k \Vert ^2 \\&\quad \overset{(d)}{\le } \rho ~ \left( F({\varvec{\theta }}^k) - \varPhi _k^\star + \langle {{{\varvec{e}}}}^k, {\varvec{\theta }}^k - {\varvec{\theta }}_{ex}^k \rangle \right) - \frac{\mu }{2} \frac{ 1 - \mu \gamma }{ \sqrt{\mu \gamma } } \Vert {\varvec{\theta }}_{ex}^k - {\varvec{\theta }}^k \Vert ^2 \\&\qquad + \frac{\gamma }{2} \left( 1 + L \gamma \right) \Vert {{{\varvec{g}}}}^k \Vert ^2 - \gamma \langle {\nabla }F( {\varvec{\theta }}_{ex}^k ), {{{\varvec{g}}}}^k \rangle \\&\quad \overset{(e)}{\le } \rho ~ \left( F({\varvec{\theta }}^k) - \varPhi _k^\star + \langle {{{\varvec{e}}}}^k, {\varvec{\theta }}^k - {\varvec{\theta }}_{ex}^k \rangle \right) - \frac{\mu }{2} \frac{ 1 - \mu \gamma }{ \sqrt{\mu \gamma } } \Vert {\varvec{\theta }}_{ex}^k - {\varvec{\theta }}^k \Vert ^2 + \gamma \Vert {{{\varvec{e}}}}^k \Vert ^2 \;, \end{aligned} \end{aligned}$$
(82)

where (a) is due to the L-smoothness of F; (b) is due to (81); (c) is obtained by expanding \({{{\varvec{g}}}}^k\) as \({\nabla }F({\varvec{\theta }}_{ex}^k) + {{{\varvec{e}}}}^k\); (d) is obtained by adding and subtracting \((\mu /2) \Vert {\varvec{\theta }}^k - {\varvec{\theta }}_{ex}^k \Vert ^2\) inside the first bracket, applying the identity \(\rho + \rho / \sqrt{\mu \gamma } = (1 - \mu \gamma ) / \sqrt{\mu \gamma }\), and using the \(\mu\)-strong convexity of F; and (e) is due to the following chain of inequalities:

$$\begin{aligned} \begin{aligned}&\frac{\gamma }{2} \left( 1 + L \gamma \right) \Vert {{{\varvec{g}}}}^k \Vert ^2 - \gamma \langle {\nabla }F( {\varvec{\theta }}_{ex}^k ), {{{\varvec{g}}}}^s \rangle \\&\quad \le \frac{\gamma }{2} \left( 1 + L \gamma \right) \left( \Vert {{{\varvec{e}}}}^k \Vert ^2 + \Vert {\nabla }F( {\varvec{\theta }}_{ex}^k ) \Vert ^2 \right) + \frac{ L \gamma ^2 }{2} \left( \Vert {\nabla }F({\varvec{\theta }}_{ex}^k ) \Vert ^2 + \Vert {{{\varvec{e}}}}^k \Vert ^2 \right) - \gamma \Vert {\nabla }F( {\varvec{\theta }}_{ex}^k ) \Vert ^2 \\&\quad = \left( \frac{\gamma }{2} + L \gamma ^2 \right) \Vert {{{\varvec{e}}}}^k \Vert ^2 + \left( -\frac{\gamma }{2} + L \gamma ^2 \right) \Vert {\nabla }F( {\varvec{\theta }}_{ex}^k ) \Vert ^2 \le \gamma \Vert {{{\varvec{e}}}}^k \Vert ^2 \;. \end{aligned} \end{aligned}$$

As \(\varPhi _1( {\varvec{\theta }}^1 ) = F( {\varvec{\theta }}^1 ) = \varPhi _1^\star\), applying the inequality (82) recursively shows:

$$\begin{aligned} \begin{aligned}&F( {\varvec{\theta }}^{k+1} ) - \varPhi _{k+1}^\star \le \\&\sum _{\ell =1}^k \rho ^{k-\ell } \left( (1-\sqrt{\mu \gamma }) \langle {{{\varvec{e}}}}^\ell , {\varvec{\theta }}^\ell - {\varvec{\theta }}_{ex}^\ell \rangle + \gamma \Vert {{{\varvec{e}}}}^\ell \Vert ^2 - \frac{\mu }{2} \frac{1-\mu \gamma }{\sqrt{\mu \gamma }} \Vert {\varvec{\theta }}_{ex}^\ell - {\varvec{\theta }}^\ell \Vert ^2 \right) \;. \end{aligned} \end{aligned}$$
(83)

Importantly, (83) establishes a lower bound on \(\varPhi _{k+1}^\star\) in terms of \(F({\varvec{\theta }}^{k+1})\) and \({{{\varvec{e}}}}^k\).

Proving Proposition 2. Finally, summing up (83) and (76) gives:

$$\begin{aligned} \begin{aligned} h^{(k+1)}&\le 2 \rho ^k h^{(1)} + \sum _{\ell =1}^k \rho ^{k-\ell } \left( \sqrt{\mu \gamma } \langle {{{\varvec{e}}}}^\ell , {\varvec{\theta }}^\star - {\varvec{\theta }}_{ex}^\ell \rangle \right. \\&\quad \left. +\, \rho \langle {{{\varvec{e}}}}^\ell , {\varvec{\theta }}^\ell - {\varvec{\theta }}_{ex}^\ell \rangle + \gamma \Vert {{{\varvec{e}}}}^\ell \Vert ^2 - \frac{\mu }{2} \frac{1-\mu \gamma }{\sqrt{\mu \gamma }} \Vert {\varvec{\theta }}_{ex}^\ell - {\varvec{\theta }}^\ell \Vert ^2 \right) \\&= 2 \rho ^k h^{(1)} + \sum _{\ell =1}^k \rho ^{k-\ell } \left( \sqrt{\mu \gamma } \langle {{{\varvec{e}}}}^\ell , {\varvec{\theta }}^\star - {\varvec{\theta }}^\ell \rangle \right. \\&\quad \left. +\, \langle {{{\varvec{e}}}}^\ell , {\varvec{\theta }}^\ell - {\varvec{\theta }}_{ex}^\ell \rangle + \gamma \Vert {{{\varvec{e}}}}^\ell \Vert ^2 - \frac{\mu }{2} \frac{1-\mu \gamma }{\sqrt{\mu \gamma }} \Vert {\varvec{\theta }}_{ex}^\ell - {\varvec{\theta }}^\ell \Vert ^2 \right) \;. \end{aligned} \end{aligned}$$
(84)

Let us take a look at the last summands in the above inequality: for any \(\ell \ge 1\),

$$\begin{aligned} \begin{aligned}&\sqrt{\mu \gamma } \langle {{{\varvec{e}}}}^\ell , {\varvec{\theta }}^\star - {\varvec{\theta }}^\ell \rangle + \langle {{{\varvec{e}}}}^\ell , {\varvec{\theta }}^\ell - {\varvec{\theta }}_{ex}^\ell \rangle + \gamma \Vert {{{\varvec{e}}}}^\ell \Vert ^2 - \frac{\mu }{2} \frac{1-\mu \gamma }{\sqrt{\mu \gamma }} \Vert {\varvec{\theta }}_{ex}^\ell - {\varvec{\theta }}^\ell \Vert ^2 \\&\quad \overset{(a)}{\le } \sqrt{\mu \gamma } \Vert {{{\varvec{e}}}}^\ell \Vert \Vert {\varvec{\theta }}^\star - {\varvec{\theta }}^\ell \Vert + \left( \gamma + \frac{ \sqrt{\gamma / \mu } }{ 1 - \mu \gamma } \right) \Vert {{{\varvec{e}}}}^\ell \Vert ^2 - \frac{\mu }{4} \frac{1-\mu \gamma }{\sqrt{\mu \gamma }} \Vert {\varvec{\theta }}_{ex}^\ell - {\varvec{\theta }}^\ell \Vert ^2 \\&\quad \overset{(b)}{\le } \sqrt{2 \gamma h^{(\ell )}} \Vert {{{\varvec{e}}}}^\ell \Vert + \left( \gamma + \frac{ \sqrt{\gamma / \mu } }{ 1 - \mu \gamma } \right) \Vert {{{\varvec{e}}}}^\ell \Vert ^2 - \frac{\mu }{4} \frac{1-\mu \gamma }{\sqrt{\mu \gamma }} \Vert {\varvec{\theta }}_{ex}^\ell - {\varvec{\theta }}^\ell \Vert ^2 \\&\quad \overset{(c)}{\le } \sqrt{2 \gamma h^{(\ell )}} \Vert {{{\varvec{e}}}}^\ell \Vert + \sqrt{\frac{9\gamma }{\mu }} \Vert {{{\varvec{e}}}}^\ell \Vert ^2 - \frac{\mu }{4} \frac{1-\mu \gamma }{\sqrt{\mu \gamma }} \Vert {\varvec{\theta }}_{ex}^\ell - {\varvec{\theta }}^\ell \Vert ^2 \;, \end{aligned} \end{aligned}$$
(85)

where (a) is resulted from the fact \(\langle {{{\varvec{e}}}}^\ell , {\varvec{\theta }}^\ell - {\varvec{\theta }}_{ex}^\ell \rangle \le (1/2) ( \Vert {{{\varvec{e}}}}^\ell \Vert ^2 / c + c \Vert {\varvec{\theta }}^\ell - {\varvec{\theta }}_{ex}^\ell \Vert ^2 )\) for any \(c > 0\) and we have set \(c = \frac{\mu }{2} \frac{1-\mu \gamma }{\sqrt{\mu \gamma }}\) therein; (b) is due to the relation \(\Vert {\varvec{\theta }}^\ell - {\varvec{\theta }}^\star \Vert \le \sqrt{2 h^{(\ell )} / \mu }\); (c) is due to \(\gamma + \frac{ \sqrt{\gamma / \mu } }{ 1 - \mu \gamma } \le 3 \sqrt{ \gamma / \mu }\), which can be verified through replacing \(\gamma\) by its upper bound 1/(2L) in the denominator of the fraction on the left-hand-side. Combining the two equations above yields the desired result of Proposition.

1.1 D.1 Proof of the equality

We prove \({{{\varvec{v}}}}^k - {\varvec{\theta }}_{ex}^k = (\sqrt{\mu \gamma })^{-1} \left( {\varvec{\theta }}_{ex}^k - {\varvec{\theta }}^k \right)\) using induction on k. Clearly, the said equality holds for \(k=1\) since \({{{\varvec{v}}}}^1 = {\varvec{\theta }}^1 = {\varvec{\theta }}_{ex}^1\), and we assume that it holds up to k. Consider:

$$\begin{aligned} \begin{aligned}&{{{\varvec{v}}}}^{k+1} - {\varvec{\theta }}_{ex}^{k+1} = \rho {{{\varvec{v}}}}^k + \sqrt{ \mu \gamma } {\varvec{\theta }}_{ex}^k - \sqrt{\frac{\gamma }{\mu }} {{{\varvec{g}}}}^k - {\varvec{\theta }}_{ex}^{k+1} \\&\quad =\rho ( {{{\varvec{v}}}}^k - {\varvec{\theta }}_{ex}^k ) + {\varvec{\theta }}_{ex}^k - \sqrt{\frac{\gamma }{\mu }} {{{\varvec{g}}}}^k - {\varvec{\theta }}_{ex}^{k+1} = \frac{ \rho }{ \sqrt{\mu \gamma } } ( {\varvec{\theta }}_{ex}^k - {\varvec{\theta }}^k ) + {\varvec{\theta }}_{ex}^k - \sqrt{\frac{\gamma }{\mu }} {{{\varvec{g}}}}^k - {\varvec{\theta }}_{ex}^{k+1} \;, \end{aligned} \end{aligned}$$

where we have used the induction hypothesis. Furthermore, using \({\varvec{\theta }}^{k+1} = {\varvec{\theta }}_{ex}^k - \gamma {{{\varvec{g}}}}^k\),

$$\begin{aligned} \begin{aligned}&{{{\varvec{v}}}}^{k+1} - {\varvec{\theta }}_{ex}^{k+1} = \sqrt{\mu \gamma }^{-1} \left( \rho ({\varvec{\theta }}_{ex}^k - {\varvec{\theta }}^k) + \sqrt{\mu \gamma } ( {\varvec{\theta }}_{ex}^k - {\varvec{\theta }}_{ex}^{k+1} ) - \gamma {{{\varvec{g}}}}^k \right) \\&\quad \overset{(a)}{=} \sqrt{ \mu \gamma }^{-1} \left( \sqrt{\mu \gamma } ( {\varvec{\theta }}^{k+1} - {\varvec{\theta }}_{ex}^{k+1} ) + \rho ({\varvec{\theta }}^{k+1} - {\varvec{\theta }}^k ) \right) = \sqrt{ \mu \gamma }^{-1} \left( {\varvec{\theta }}_{ex}^{k+1} - {\varvec{\theta }}^{k+1} \right) \;, \end{aligned} \end{aligned}$$
(86)

where (a) is due to \(\rho ({\varvec{\theta }}^{k+1} - {\varvec{\theta }}^k ) = (1 + \sqrt{\mu \gamma } ) ( {\varvec{\theta }}_{ex}^{k+1} - {\varvec{\theta }}^{k+1} )\).

E Proof of Proposition 4

We begin by observing that due to the \(L_{H,i}\)-Lipschitz continuity of the Hessian of \(f_i\) and using Lemma 1, we have:

$$\begin{aligned} \begin{aligned}&\Vert {{{\varvec{e}}}}_{\textsf{ACIAG}}^\ell \Vert = \Vert {{{\varvec{g}}}}_{\textsf{ACIAG}}^\ell - {\nabla }F( {\varvec{\theta }}_{ex}^\ell ) \Vert \le \sum _{i=1}^m \frac{L_{H,i}}{2} \Vert {\varvec{\theta }}_{ex}^\ell - {\varvec{\theta }}_{ex}^{\tau _i^\ell } \Vert ^2 \;. \end{aligned} \end{aligned}$$
(87)

Now, expanding the right hand side of (87) gives:

$$\begin{aligned} \begin{aligned} \Vert {{{\varvec{e}}}}_{\textsf{ACIAG}}^\ell \Vert&\le \sum _{i=1}^m \frac{L_{H,i}}{2} \Big \Vert {\varvec{\theta }}_{ex}^\ell - {\varvec{\theta }}_{ex}^{\tau _i^\ell } \Big \Vert ^2 \le \sum _{i=1}^m \frac{L_{H,i}}{2} ~ \underbrace{( \ell - \tau _i^\ell )}_{\le K} \sum _{j=\ell -\tau _i^\ell }^{\ell -1} \Vert {\varvec{\theta }}_{ex}^{j+1} - {\varvec{\theta }}_{ex}^j \Vert ^2 \\&\quad \le \frac{K L_{H}}{2} \sum _{j=( \ell -K )_{++}}^{\ell -1} \Vert {\varvec{\theta }}_{ex}^{j+1} - {\varvec{\theta }}_{ex}^j \Vert ^2 = \frac{K L_{H}}{2} \sum _{j=( \ell -K )_{++}}^{\ell -1} \Vert \gamma {{{\varvec{g}}}}_{\textsf{ACIAG}}^j + \underbrace{\alpha ( {\varvec{\theta }}^{j+1} - {\varvec{\theta }}^j)}_{= {\varvec{\theta }}_{ex}^{j+1} - {\varvec{\theta }}^{j+1}} \Vert ^2 \\&\quad \le \frac{3 K L_H}{2} \sum _{j=( \ell -K )_{++}}^{\ell -1} \left( \gamma ^2 \left( \Vert {{{\varvec{e}}}}^j \Vert ^2 + \Vert {\nabla }F( {\varvec{\theta }}_{ex}^j ) \Vert ^2 \right) + \Vert {\varvec{\theta }}_{ex}^{j+1} - {\varvec{\theta }}^{j+1} \Vert ^2 \right) \;. \end{aligned} \end{aligned}$$
(88)

Remarkably, the above bound resembles that of Proposition 3 with the exception of the last term that depends on \({\varvec{\theta }}_{ex}^{j+1} - {\varvec{\theta }}^{j+1}\). This is included to account for the extrapolated iterates used in the A-CIAG method.

To find an upper bound of \(\Vert {{{\varvec{e}}}}_{\textsf{ACIAG}}^\ell \Vert\) to corroborate Proposition 4, in what follows, we will upper bound \(\Vert {{{\varvec{e}}}}_{\textsf{ACIAG}}^j \Vert ^2\) and \(\Vert {\nabla }F({\varvec{\theta }}_{ex}^j) \Vert ^2\), respectively. Firstly,

$$\begin{aligned} \begin{aligned} \Vert {{{\varvec{e}}}}_{\textsf{ACIAG}}^j \Vert&\le \sum _{i=1}^m \frac{L_{H,i}}{2} \Big \Vert {\varvec{\theta }}_{ex}^j - {\varvec{\theta }}_{ex}^{\tau _i^j} \Big \Vert ^2 \\&\le \sum _{i=1}^m L_{H,i} \left( (1+\alpha )^2 \Vert {\varvec{\theta }}^j - {\varvec{\theta }}^{\tau _i^j} \Vert ^2 + \alpha ^2 \Vert {\varvec{\theta }}^{j-1} - {\varvec{\theta }}^{\tau _i^j-1} \Vert ^2 \right) \;. \end{aligned} \end{aligned}$$
(89)

Noticing that as \(\Vert {\varvec{\theta }}^j - {\varvec{\theta }}^{\tau _i^j} \Vert ^2 \le 2 ( \Vert {\varvec{\theta }}^j - {\varvec{\theta }}^\star \Vert ^2 + \Vert {\varvec{\theta }}^{\tau _i^j} - {\varvec{\theta }}^\star \Vert ^2 ) \le (4/\mu ) ( h^{(j)} + h^{(\tau _i^j)} )\), it follows from (89) that

$$\begin{aligned} \begin{aligned} \Vert {{{\varvec{e}}}}_{\textsf{ACIAG}}^j \Vert&\le \frac{4}{\mu } \sum _{i=1}^m L_{H,i} \left( (1+\alpha )^2 ( h^{(j)} + h^{(\tau _i^j)} ) + \alpha ^2 ( h^{(j-1)} + h^{(\tau _i^j - 1)} ) \right) \\&\le \frac{ 8 L_H }{\mu } \left( (1+\alpha )^2 + \alpha ^2 \right) \max _{ (j- K-1)_{++} \le q \le j } h^{(q)} \le \frac{ 40 L_H }{\mu } \max _{ (j- K-1)_{++} \le q \le j } h^{(q)} \;, \end{aligned} \end{aligned}$$
(90)

which implies

$$\begin{aligned} \begin{aligned} \sum _{j=(\ell -K)_{++}}^{\ell -1} \Vert {{{\varvec{e}}}}_{\textsf{ACIAG}}^j \Vert ^2&\le K \left( \frac{ 40 L_H }{\mu }\right) ^2 \max _{ (\ell - 2K-1)_{++} \le q \le \ell } ( h^{(q)} )^2 \;. \end{aligned} \end{aligned}$$
(91)

Secondly,

$$\begin{aligned} \begin{aligned} \Vert {\nabla }F( {\varvec{\theta }}_{ex}^j ) \Vert ^2&\le 2L^2 \left( \Vert {\varvec{\theta }}^j - {\varvec{\theta }}^\star \Vert ^2 + \Vert {\varvec{\theta }}^j - {\varvec{\theta }}^{j-1} \Vert ^2 \right) \le \frac{4L^2}{\mu } \left( 3 h^{(j)} + 2 h^{(j-1)} ) \right) \;, \end{aligned} \end{aligned}$$
(92)

thus

$$\begin{aligned} \begin{aligned} \sum _{j=(\ell -K)_{++}}^{\ell -1} \Vert {\nabla }F( {\varvec{\theta }}_{ex}^j ) \Vert ^2&\le \frac{20L^2 K }{\mu } \max _{ (\ell - K - 1)_{++} \le q \le \ell -1} h^{(q)} \;. \end{aligned} \end{aligned}$$
(93)

Substituting (91) and (93) into the right hand side of (88) verifies Proposition 4.

F Step 3 in the Proof of Theorem 2

To proceed with the proof, let us define the following quantity:

$$\begin{aligned} \begin{aligned}&\tilde{E}^{(\ell )} \mathrel{\mathop :}=\gamma ^{\frac{5}{2}} \sqrt{\frac{9}{2}} K^2 L_H \left( \left( \frac{40L_H}{\mu } \right) ^2 \max _{ (\ell -2K-1)_{++} \le q \le \ell } (h^{(q)})^2 + \frac{20L^2}{\mu } \max _{ (\ell -K-1)_{++} \le q \le \ell } h^{(q)} \right) \\&\quad + \gamma ^{\frac{9}{2}} \frac{ 81 K^4 L_H^2 }{4 \sqrt{\mu }} \left( \left( \frac{40L_H}{\mu } \right) ^4 \max _{ (\ell -2K-1)_{++} \le q \le \ell } (h^{(q)})^4 + \left( \frac{20L^2}{\mu } \right) ^2 \max _{ (\ell -K-1)_{++} \le q \le \ell } (h^{(q)})^2 \right) \;. \end{aligned} \end{aligned}$$

Using Proposition 4, we obtain:

$$\begin{aligned} \begin{aligned}&\sqrt{2 \gamma h^{(\ell )}} \Vert {{{\varvec{e}}}}_{\textsf{ACIAG}}^\ell \Vert + \sqrt{\frac{9\gamma }{\mu }} \Vert {{{\varvec{e}}}}_{\textsf{ACIAG}}^\ell \Vert ^2 \\&\quad \le \tilde{E}^{(\ell )} + \sum _{j=(\ell -K+1)_{++}}^{\ell } \left( \sqrt{\frac{9 \gamma h^{(\ell )} K^2 L_H^2}{2}} \Vert {\varvec{\theta }}^{j} - {\varvec{\theta }}_{ex}^{j} \Vert ^2 + \frac{27 K^3 L_H^2}{4} \sqrt{\frac{9\gamma }{\mu }} \Vert {\varvec{\theta }}^j - {\varvec{\theta }}_{ex}^j \Vert ^4 \right) \;. \end{aligned} \end{aligned}$$
(94)

We need to further bound \(h^{(k)}\) [recall for (41) in Proposition 2] in terms of itself to create a ‘recursion’ for \(h^{(k)}\). To upper bound the right hand side of (41), let us start from (94). It follows that

$$\begin{aligned} \begin{aligned}&\sum _{\ell =1}^k \rho ^{k-\ell } \left( \sqrt{2 \gamma h^{(\ell )}} \Vert {{{\varvec{e}}}}^\ell \Vert + \sqrt{\frac{9\gamma }{\mu }} \Vert {{{\varvec{e}}}}^\ell \Vert ^2 - \frac{\mu }{4} \frac{1-\mu \gamma }{\sqrt{\mu \gamma }} \Vert {\varvec{\theta }}_{ex}^\ell - {\varvec{\theta }}^\ell \Vert ^2 \right) \le \sum _{\ell =1}^k \rho ^{k-\ell } \Bigg ( \tilde{E}^{(\ell )} \\+ &\left( \sum _{j=\ell }^{\min \{k,\ell +K-1\}} \left( \sqrt{\frac{9 \gamma K^2 L_H^2 h^{(j)}}{2} } + \frac{81 K^3 L_H^2}{4} \sqrt{\frac{\gamma }{\mu }} \Vert {\varvec{\theta }}^\ell - {\varvec{\theta }}_{ex}^\ell \Vert ^2 \right) - \frac{\mu }{4} \frac{1-\mu \gamma }{\sqrt{\mu \gamma }} \right) \Vert {\varvec{\theta }}^\ell - {\varvec{\theta }}_{ex}^\ell \Vert ^2 \Bigg ). \end{aligned} \end{aligned}$$

Moreover, we observe for \(\ell \ge 2\):

$$\begin{aligned} \Vert {\varvec{\theta }}^\ell - {\varvec{\theta }}_{ex}^\ell \Vert ^2 \le 2 ( \Vert {\varvec{\theta }}^\ell - {\varvec{\theta }}^\star \Vert ^2 + \Vert {\varvec{\theta }}^{\ell -1} - {\varvec{\theta }}^\star \Vert ^2 ) \le \frac{4}{\mu } \left( h^{(\ell )} + h^{(\ell -1)} \right) \;, \end{aligned}$$
(95)

The coefficient in front of the last \(\Vert {\varvec{\theta }}^\ell - {\varvec{\theta }}_{ex}^\ell \Vert ^2\) term can be upper bounded as:

$$\begin{aligned} \tilde{C}^{(\ell ,k)} \mathrel{\mathop :}=\gamma K^2 L_H \sqrt{\frac{9}{2}} \max _{ \ell \le q \le \min \{ \ell +K-1,k \}} (h^{(q)})^{\frac{1}{2}} + {\gamma } \frac{81 K^4 L_H^2}{\mu ^{\frac{3}{2}}} \left( h^{(\ell )} + h^{(\ell -1)} \right) - \frac{\mu }{4} \frac{1-\mu \gamma }{\sqrt{\mu }}. \end{aligned}$$

If we define

$$\begin{aligned} \begin{aligned}&E^{(\ell ,k)} \mathrel{\mathop :}=\tilde{E}^{(\ell )} + \tilde{C}^{(\ell ,k)} \frac{\Vert {\varvec{\theta }}^\ell - {\varvec{\theta }}_{ex}^\ell \Vert ^2}{\sqrt{\gamma }} \;, \end{aligned} \end{aligned}$$
(96)

where \(E^{(\ell ,k)} = E^{(\ell ,k -1)}\) for all \(k \ge \ell + m\). Applying Proposition 2 readily shows

$$\begin{aligned} h^{(k+1)} \le 2 ( 1 - \sqrt{\mu \gamma } )^k h^{(1)} + \sum _{\ell =1}^k (1 - \sqrt{\mu \gamma })^{k - \ell } E^{(\ell ,k)} \;. \end{aligned}$$
(97)

Concluding the Proof of Theorem 2. Our goal is to analyze (97) using Proposition 6. Let us recognize that:

$$\begin{aligned} R^{(k)}= & {} \bar{h}^{(k)},~p = (1-\sqrt{\mu \gamma }),~b = 2,~M= 2K+1,~\eta _1 = \frac{3}{2},~\eta _2 = \frac{5}{2}, \eta _3 = 2,~\eta _4 = 4\\ s_1= & {} \gamma ^{\frac{5}{2}} \sqrt{\frac{9}{2}} K^2 L_H \frac{20L^2}{\mu },~ s_2 = \gamma ^{\frac{5}{2}} \sqrt{\frac{9}{2}} K^2 L_H \left( \frac{40L_H}{\mu } \right) ^2, \\ s_3= & {} \gamma ^{\frac{9}{2}} \frac{81 K^4 L_H^2}{4\sqrt{\mu }} \left( \frac{20L^2}{\mu }\right) ^2,~ s_4 = \gamma ^{\frac{9}{2}} \frac{81 K^4 L_H^2}{4\sqrt{\mu }} \left( \frac{40L_H}{\mu } \right) ^4 \;,\\ c= & {} \frac{\mu }{4} \frac{1 - \mu \gamma }{\sqrt{\mu }},~D^{(\ell )} = \frac{ \Vert {\varvec{\theta }}^\ell - {\varvec{\theta }}_{ex}^\ell \Vert ^2 }{\sqrt{\gamma }},~f( \bar{h}^{(q)} ) = \gamma \left( K^2 L_H \sqrt{\frac{9}{2}} (\bar{h}^{(q)})^{\frac{1}{2}} + \frac{162 K^4 L_H^2}{\mu ^{\frac{3}{2}}} \bar{h}^{(q)} \right) \;. \end{aligned}$$

The conditions in (55) are satisfied when

$$\begin{aligned} \begin{aligned}&\frac{\sqrt{\mu }}{4} - \gamma \left( K^2 L_H \sqrt{9} (\bar{h}^{(1)})^{\frac{1}{2}} + \frac{324 K^4 L_H^2}{\mu ^{\frac{3}{2}}} \bar{h}^{(1)} + \frac{\mu ^{\frac{3}{2}}}{4} \right) \ge 0 \\&\Longleftrightarrow \gamma \le \frac{\sqrt{\mu }}{4} \left( K^2 L_H \sqrt{9} (\bar{h}^{(1)})^{\frac{1}{2}} + \frac{324 K^4 L_H^2}{\mu ^{\frac{3}{2}}} \bar{h}^{(1)} + \frac{\mu ^{\frac{3}{2}}}{4} \right) ^{-1} \mathrel{\mathop :}=\frac{\bar{c}_3}{L} \;, \end{aligned} \end{aligned}$$
(98)

and

$$\begin{aligned} \begin{aligned} 1 > (1-\sqrt{\mu \gamma })&+ \gamma ^{\frac{5}{2}} \sqrt{\frac{9}{2}} K^2 L_H \left( \frac{20L^2}{\mu } (2 \bar{h}^{(1)})^{\frac{1}{2}} + \left( \frac{40L_H}{\mu } \right) ^2 (2 \bar{h}^{(1)})^{\frac{3}{2}} \right) \\&+ \gamma ^{\frac{9}{2}} \frac{81 K^4 L_H^2}{4\sqrt{\mu }} \left( \left( \frac{20L^2}{\mu }\right) ^2 (2 \bar{h}^{(1)} ) + \left( \frac{40L_H}{\mu } \right) ^4 (2 \bar{h}^{(1)})^3 \right) \;, \end{aligned} \end{aligned}$$
(99)

that can be implied by

$$\begin{aligned} \begin{aligned}&\gamma< \left( \frac{\sqrt{\mu }}{\sqrt{18} K^2 L_H}\left( \frac{20L^2}{\mu } (2 \bar{h}^{(1)})^{\frac{1}{2}} + \left( \frac{40L_H}{\mu } \right) ^2 (2 \bar{h}^{(1)})^{\frac{3}{2}} \right) ^{-1} \right) ^{\frac{1}{2}} \mathrel{\mathop :}=\frac{\bar{c}_1}{L}~~~~\text{and} \\&\gamma < \left( \frac{2 {\mu }}{81 K^4 L_H^2} \left( \left( \frac{20L^2}{\mu }\right) ^2 (2 \bar{h}^{(1)} ) + \left( \frac{40L_H}{\mu } \right) ^4 (2 \bar{h}^{(1)})^3 \right) ^{-1} \right) ^{\frac{1}{4}} \mathrel{\mathop :}=\frac{\bar{c}_2}{L} \;. \end{aligned} \end{aligned}$$
(100)

Substituting these constants into Proposition 6 proves the claims in Theorem 2.

G Proof of Proposition 6

Define \(\{ \bar{R}^{(k)} \}_{k \ge 1}\) that satisfies:

$$\begin{aligned} \bar{R}^{(k+1)} = p^k b \bar{R}^{(1)} + \sum _{\ell =1}^k p^{k-\ell } \left( \sum _{j=1}^J s_j \max _{ (\ell - M)_{++} \le q \le \ell } (\bar{R}^{(q)})^{\eta _j} \right) ,~~\bar{R}^{(1)} = R^{(1)} \;, \end{aligned}$$
(101)

By subtracting \(p \bar{R}^{(k)}\) from \(\bar{R}^{(k+1)}\), (101) can be alternatively expressed as:

$$\begin{aligned} \bar{R}^{(k+1)} - p \bar{R}^{(k)} = \sum _{j=1}^J s_j \max _{ (k- M)_{++} \le q \le k } (\bar{R}^{(q)})^{\eta _j} \;. \end{aligned}$$
(102)

Now, consider the statements (1) and (2) in (56) as the following event:

$$\begin{aligned} \begin{aligned} \mathcal{E}_z = \Big \{&~ \bar{R}^{((z-1)M + k+1)} \ge R^{((z-1)M + k+1)}, ~\bar{R}^{((z-1)M + k+1)} \le \delta ^z (b \bar{R}^{(1)} ),~ k = 1,..., M \Big \} \;, \end{aligned} \end{aligned}$$

for all \(z \ge 1\). We shall prove that \(\mathcal{E}_z\) is true for \(z=1,2,...\) using induction.

Base case with\(z=1\). To prove \(\mathcal{E}_1\), let us apply another induction on k inside the event. For the base case of \(k=1\),

$$\begin{aligned} \begin{aligned} \bar{R}^{(2)}&\ge p ( b R^{(1)} ) + \sum _{j=1}^J s_j (R^{(1)})^{\eta _j} - ( \bar{f} - f( R^{(1)})) D^{(1)} = R^{(2)} \;, \end{aligned} \end{aligned}$$
(103)

where we used the fact \(\bar{f} \ge f( b R^{(1)} ) \ge f( R^{(1)} )\). Furthermore, the base case holds as:

$$\begin{aligned} \bar{R}^{(2)} = (b \bar{R}^{(1)}) \left( p + (1/b) \sum _{j=1}^J s_j ( \bar{R}^{(1)} )^{\eta _j - 1} \right) \le \delta ( b \bar{R}^{(1)} ) \;. \end{aligned}$$
(104)

For the induction step, suppose that the statements in (103) are also true up to \(k=k' - 1\) with \(z=1\) such that \(\bar{R}^{(k')} \ge R^{(k')}\) and \(\bar{R}^{(k')} \le \delta ( b \bar{R}^{(1)} )\). Consider the case of \(k=k'\), we observe that \(\bar{f} \ge f( b R^{(1)} ) \ge f (\delta b R^{(1)} ) \ge f( \bar{R}^{(q)} ) \ge f( R^{(q)} )\) for all \(q=1,...,k'\). Therefore, we can lower bound \(\bar{R}^{(k'+1)}\) as:

$$\begin{aligned} \begin{aligned}&\bar{R}^{(k'+1)} = p^{k'} ( b \bar{R}^{(1)} ) + \sum _{\ell =1}^{k'} p^{k'-\ell } \left( \sum _{j=1}^J s_j \max _{ (\ell -M)_{++} \le q \le \ell } (\bar{R}^{(q)})^{\eta _j} \right) \\&\quad \ge p^{k'} ( b R^{(1)} ) + \sum _{\ell =1}^{k'} p^{k'-\ell } \left( \sum _{j=1}^J s_j \max _{ (\ell -M)_{++} \le q \le \ell } (R^{(q)})^{\eta _j} - \left( \bar{f} - \max _{\ell \le q \le k'} f(R^{(q)}) \right) V^{(\ell )} \right) , \end{aligned} \end{aligned}$$

where the right hand side is exactly \(R^{(k'+1)}\); also, using (102), we can show:

$$\begin{aligned} \begin{aligned} \bar{R}^{(k'+1)}&\le ( b \bar{R}^{(1)} ) \left( \delta p + \sum _{j=1}^J s_j (b \bar{R}^{(1)})^{\eta _j-1} \right) \le \delta ( b \bar{R}^{(1)} ) \;. \end{aligned} \end{aligned}$$
(105)

Induction Case. For the induction case, suppose that \(\mathcal{E}_z\) is true for all z up to \(z'\). We consider the case when \(z = z' + 1\). Once again, we apply another induction on k. In the base case of \(k = 1\) and \(z=z' + 1\), we have

$$\begin{aligned} \begin{aligned}&\bar{R}^{(z'M+2)} = p^{z'M+1} ( b \bar{R}^{(1)} ) + \sum _{\ell =1}^{z'M+1} p^{z'M+1-\ell } \left( \sum _{j=1}^J s_j \max _{ (\ell -M)_{++} \le q \le \ell } (\bar{R}^{(q)})^{\eta _j} \right) \\&\quad \ge p^{z'M+1} ( b R^{(1)} ) + \sum _{\ell =1}^{z'M+1} p^{z'M+1-\ell } \left( \sum _{j=1}^J s_j \max _{ (\ell -M)_{++} \le q \le \ell } (R^{(q)})^{\eta _j} \right. \\&\qquad \left. - \left( \bar{f} - \max _{\ell \le q \le z'M + 1} f(R^{(q)}) \right) V^{(\ell )} \right) = R^{(z'M+2)} \;, \end{aligned} \end{aligned}$$

where we used \(\bar{f} \ge f( b R^{(1)} ) \ge f ( \bar{R}^{(q)} ) \ge f( R^{(q)} )\) for all q up to \(q = z'M+1\) (by the induction hypothesis). Furthermore, the base case holds since:

$$\begin{aligned} \begin{aligned} \bar{R}^{(z'M+2)}&= p \bar{R}^{(z'M+1)} + \sum _{j=1}^J s_j \max _{ (z'M+1-M)_{++} \le q \le z'M+1 } ( \bar{R}^{(q)} )^{\eta _j} \\&\quad \le \delta ^{z'} (b \bar{R}^{(1)}) \left( p + \sum _{j=1}^J s_j (\delta ^{z'})^{\eta _j-1} (b \bar{R}^{(1)})^{\eta _j-1} \right) \le \delta ^{z'+1} ( b \bar{R}^{(1)} ) \;. \end{aligned} \end{aligned}$$
(106)

Let the statements in \(\mathcal{E}_z\) be true up to \(k=k' - 1\), \(z=z'+1\). With \(k = k'\),

$$\begin{aligned} \begin{aligned} \bar{R}^{( z'M + k' + 1 )}&\ge p^{z'M+k'} ( b R^{(1)} ) + \sum _{\ell =1}^{z'M+k'} p^{z'M+k'-\ell } \left( \sum _{j=1}^J s_j \max _{ (\ell -M)_{++} \le q \le \ell } (R^{(q)})^{\eta _j} \right. \\&\quad \left. - \left( \bar{f} - \max _{\ell \le q \le z'M + k'} f(R^{(q)}) \right) V^{(\ell )} \right) = R^{(z'M + k' + 1)} \;,\\ \bar{R}^{(z'M+k'+1)}&\le \delta ^{z'} (b \bar{R}^{(1)}) \left( \delta p + \sum _{j=1}^J s_j (\delta ^{z'})^{\eta _j-1} (b \bar{R}^{(1)})^{\eta _j-1} \right) \le \delta ^{z'+1} ( b \bar{R}^{(1)} ) \;. \end{aligned} \end{aligned}$$
(107)

The induction case is thus proven. This shows that the event \(\mathcal{E}_z\) is true for all \(z \ge 1\).

Proving statement (iii). We apply statement (ii) to prove (iii). From (102),

$$\begin{aligned} \begin{aligned} \frac{ \bar{R}^{(k+1)} }{ \bar{R}^{(k)} }&= p + \frac{1}{ \bar{R}^{(k)} } \sum _{j=1}^J s_j \max _{ (k-M)_{++} \le q \le k} (\bar{R}^{(q)} )^{\eta _j} \;. \end{aligned} \end{aligned}$$
(108)

For any \(q \in [(k-M)_{++}, k]\), we have

$$\begin{aligned} \frac{ (\bar{R}^{(q)})^{\eta _j} }{\bar{R}^{(k)}} = \frac{ \bar{R}^{(q)} }{ \bar{R}^{(k)} } (\bar{R}^{(q)})^{\eta _j - 1} \le \frac{ \bar{R}^{(q)} }{ \bar{R}^{(k)} } \left( \delta ^{\lceil (q-1) / M \rceil } ( b R^{(1)} ) \right) ^{\eta _j - 1} \;. \end{aligned}$$
(109)

Since \(\eta _j > 1\) and \(|q-k| \le M\), we have \(\delta ^{\lceil (q-1) / M \rceil ( \eta _j - 1 )} \rightarrow 0\) as \(k \rightarrow \infty\), moreover as \(\bar{R}^{(k+1)} / \bar{R}^{(k)} \ge p\) for all \(k \ge 1\), \(\bar{R}^{(q)} / \bar{R}^{(k)} \le p^{-M}\) for all q. Therefore, we get

$$\begin{aligned} \lim _{ k \rightarrow \infty } \frac{ \max _{ (k-M)_{++} \le q \le k} (\bar{R}^{(q)} )^{\eta _j} }{ \bar{R}^{(k)} } = 0,~\forall ~j \Longrightarrow \lim _{ k \rightarrow \infty } \frac{ \bar{R}^{(k+1)} }{ \bar{R}^{(k)} } = p \;. \end{aligned}$$
(110)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wai, HT., Shi, W., Uribe, C.A. et al. Accelerating incremental gradient optimization with curvature information. Comput Optim Appl 76, 347–380 (2020). https://doi.org/10.1007/s10589-020-00183-1

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10589-020-00183-1

Keywords

Navigation