# Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods

## Abstract

In this paper we study several classes of stochastic optimization algorithms enriched with heavy ball momentum. Among the methods studied are: stochastic gradient descent, stochastic Newton, stochastic proximal point and stochastic dual subspace ascent. This is the first time momentum variants of several of these methods are studied. We choose to perform our analysis in a setting in which all of the above methods are equivalent: convex quadratic problems. We prove global non-asymptotic linear convergence rates for all methods and various measures of success, including primal function values, primal iterates, and dual function values. We also show that the primal iterates converge at an accelerated linear rate in a somewhat weaker sense. This is the first time a linear rate is shown for the stochastic heavy ball method (i.e., stochastic gradient descent method with momentum). Under somewhat weaker conditions, we establish a sublinear convergence rate for Cesàro averages of primal iterates. Moreover, we propose a novel concept, which we call stochastic momentum, aimed at decreasing the cost of performing the momentum step. We prove linear convergence of several stochastic methods with stochastic momentum, and show that in some sparse data regimes and for sufficiently small momentum parameters, these methods enjoy better overall complexity than methods with deterministic momentum. Finally, we perform extensive numerical testing on artificial and real datasets, including data coming from average consensus problems.

This is a preview of subscription content, access via your institution.

1. In addition, these three methods are identical to the stochastic fixed point method (with relaxation) for solving the fixed point problem $$x = {\mathbb {E}\left[ \varPi _{\mathcal{L}_\mathbf{S}}(x)\right] }$$, where $$\mathcal{L}_{\mathbf{S}}$$ is the set of solutions of $$\mathbf{S}^\top \mathbf{A}x = \mathbf{S}^\top b$$, which is a sketched version of the linear system (2), and can be seen as a stochastic approximation of the set $$\mathcal{L}{:}{=}\{x\;:\; \mathbf{A}x = b\}$$.

2. In the rest of the paper we consider projection with respect to an arbitrary Euclidean norm.

3. Note that for $$\mathbf {B}=\mathbf {I}$$ it holds that $$\mathbf {M}^{\dagger _{\mathbf {I}}}=\mathbf {M}^{\dagger }$$ and hence the $$\mathbf {I}$$-pseudoinverse reduces to the standard Moore-Penrose pseudoinverse.

4. A more popular, and certainly theoretically much better understood alternative to Polyak’s momentum is the momentum introduced by Nesterov [60, 62], leading to the famous accelerated gradient descent (AGD) method. This method converges non-asymptotically and globally; with optimal sublinear rate $$\mathcal{O}(\sqrt{L/\epsilon })$$  when applied to minimizing a smooth convex objective function (class $$\mathcal{F}^{1,1}_{0,L}$$), and with the optimal linear rate $$\mathcal{O}(\sqrt{L/\mu } \log (1/\epsilon ))$$ when minimizing smooth strongly convex functions (class $$\mathcal{F}^{1,1}_{\mu ,L}$$). Recently, variants of Nesterov’s momentum have also been introduced for the acceleration of stochastic gradient descent. We refer the interested reader to [1, 26, 35, 36, 41, 95, 96] and the references therein. Both Nesterov’s and Polyak’s update rules are known in the literature as “momentum” methods. In this paper, however, we focus exclusively on Polyak’s heavy ball momentum.

5. This choice implies that the Hessian of the stochastic function $$f_{\mathbf{S}}(x) = \tfrac{1}{2}\Vert \mathbf{A}x - b\Vert ^2_{\mathbf{H}}$$ is a projection matrix (in the $$\mathbf{B}$$ inner product). This fact is immensely useful throughout the analysis. This choice implies that SGD without momentum satisfies the decrease identity

\begin{aligned} \Vert x_{k+1}-x_*\Vert _{\mathbf{B}}^2 = \Vert x_k - x_*\Vert _{\mathbf{B}}^2 - 2\omega (2-\omega ) f_{\mathbf{S}_k}(x_k), \end{aligned}

where $$x_*$$ is the projection of $$x_0$$ onto the solution space of the linear system $$\mathbf{A}x=b$$ . The above identity holds for any matrix $$\mathbf{S}_k$$; note that the equation does not involve any expectation. If $$\mathbf{S}_k$$ is chosen randomly, and it is chosen randomly throughout our paper, then this is an identity between two random variable: the left and the right-hand side. One interesting consequence of the identity is that $$\omega =1$$ is a natural (and in some sense optimal) stepsize for SGD without momentum. Indeed, fixing $$\mathbf{S}_k$$ and $$x_k$$, the decrease in distance squared is maximized for $$\omega =1$$. Lastly, the choice of $$\mathbf{H}$$ is what makes all the various methods we consider in this paper equivalent. Hence, it is in this sense the canonical choice of the pseudo-metric .

6. While the Hessian is not self-adjoint with respect to the standard inner product, it is self-adjoint with respect to the inner product $$\langle \mathbf{B}x, y\rangle$$ which we use as the canonical inner product in $$\mathbb {R}^n$$.

7. The gradient is computed with respect to the inner product $$\langle x, y\rangle _{\mathbf{B}} {:}{=}\langle \mathbf{B}x, y\rangle$$. Since $$\langle x, y\rangle = \langle \mathbf{B}^{-1}x, y\rangle _{\mathbf{B}}$$, this gradient is obtained from the standard gradient by applying to it the linear transformation $$\mathbf{B}^{-1}$$.

8. In this method we take the $$\mathbf{B}$$-pseudoinverse of the Hessian of $$f_{\mathbf{S}_k}$$ instead of the classical inverse, as the inverse does not exist. When $$\mathbf{B}=\mathbf{I}$$, the $$\mathbf{B}$$ pseudoinverse specializes to the standard Moore-Penrose pseudoinverse.

9. In this case, the equivalence only works for $$0<\omega \le 1$$.

10. In the plots of Fig. 1, the hyperplane of each update is chosen in an alternating fashion for illustration purposes.

11. The experiments were repeated with various values of the main parameters and initializations, and similar results were obtained in all cases.

12. Remember that in our setting we have $$f(x_*)=0$$ for the optimal solution $$x_*$$ of the best approximation problem; thus $$f(x)-f(x_*)=f(x)$$. The function values $$f(x_k)$$ refer to function (37) in the case of RK and to function (39) for the RCD. For block variants the objective function of problem (1) has also closed form expression but it can be very difficult to compute. In these cases one can instead evaluate the quantity $$\Vert \mathbf {A}x-b\Vert ^2_{\mathbf {B}}$$.

13. Note that in the first experiment we use Gaussian matrices which by construction are full rank matrices with probability 1 and as a result the consistent linear systems have unique solution. Thus, for any starting point $$x_0$$, the vector z that is used to create the linear system is the solution mSGD converges to. This is not true for general consistent linear systems, with no full-rank matrix. In this case, the solution $$x_{*}=\varPi _{\mathcal{L}}^{\mathbf {B}}(x_0)$$ that mSGD converges to is not necessarily equal to z. For this reason, in the evaluation of the relative error measure $$\Vert x_k-x_*\Vert ^2_\mathbf {B}/ \Vert x_0-x_*\Vert ^2_\mathbf {B}$$, one should be careful and use the value $$x_*=x_0+\mathbf {A}^\dagger (b- \mathbf {A}x_0)\overset{x_0=0}{=} \mathbf {A}^\dagger b$$.

14. RCD converge to the optimal solution only in the case of positive definite matrices. For this reason $$\mathbf {A}= \mathbf {P}^\top \mathbf {P}\in \mathbb {R}^{n \times n}$$ is used which with probability 1 is a full rank matrix.

15. To pre-compute the solution $$x_*$$ for each linear system $$\mathbf {A}_g x=b_g$$ we use the close form expression of the projection (14).

16. Matrix $$\mathbf {A}$$ of the linear system is the incidence matrix of the graph and it is known that the Laplacian matrix is equal to $$\mathbf {L}=\mathbf {A}^\top \mathbf {A}$$, where $$\Vert \mathbf {A}\Vert ^2_F=2m$$.

17. The lower bound on $$\beta$$ is tight. However, the upper bound is not. However, we do not care much about the regime of large $$\beta$$ as $$\beta$$ is the convergence rate, and hence is only interesting if smaller than 1.

## References

1. Allen-Zhu, Z.: Katyusha: the first direct acceleration of stochastic gradient methods. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1200–1205. ACM (2017)

2. Allen-Zhu, Z., Qu, Z., Richtárik, P., Yuan, Y.: Even faster accelerated coordinate descent using non-uniform sampling. In: International Conference on Machine Learning, pp. 1110–1119 (2016)

3. Arnold, S., Manzagol, P., Babanezhad, R., Mitliagkas, I., Roux, N.: Reducing the variance in online optimization by transporting past gradients. arXiv preprint arXiv:1906.03532 (2019)

4. Bertsekas, D.: Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. Optim. Mach. Learn. 2010(1–38), 3 (2011)

5. Blatt, D., Hero, A., Gauchman, H.: A convergent incremental gradient method with a constant step size. SIAM J. Optim. 18(1), 29–51 (2007)

6. Boyd, S., Ghosh, A., Prabhakar, B., Shah, D.: Randomized gossip algorithms. IEEE Trans. Inf. Theory 14(SI), 2508–2530 (2006)

7. Byrne, C.: Applied Iterative Methods. AK Peters, Wellesley (2008)

8. Can, B., Gurbuzbalaban, M., Zhu, L.: Accelerated linear convergence of stochastic momentum methods in wasserstein distances. In: International Conference on Machine Learning, pp. 891–901 (2019)

9. Chambolle, A., Ehrhardt, M., Richtárik, P., Schönlieb, C.: Stochastic primal–dual hybrid gradient algorithm with arbitrary sampling and imaging applications. SIAM J. Optim. 28(4), 2783–2808 (2018)

10. Chang, C.C., Lin, C.J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 27 (2011)

11. Csiba, D., Richtárik, P.: Global convergence of arbitrary-block gradient methods for generalized Polyak–Lojasiewicz functions. arXiv preprint arXiv:1709.03014 (2017)

12. De Abreu, N.M.M.: Old and new results on algebraic connectivity of graphs. Linear Algebra Appl. 423(1), 53–73 (2007)

13. Defazio, A.: A simple practical accelerated method for finite sums. In: Advances in Neural Information Processing Systems, pp. 676–684 (2016)

14. Defazio, A., Bach, F., Lacoste-Julien, S.: SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems, pp. 1646–1654 (2014)

15. Devraj, A., Bušic, A., Meyn, S.: Optimal matrix momentum stochastic approximation and applications to q-learning. arXiv preprint arXiv:1809.06277 (2018)

16. Devraj, A., Bušić, A., Meyn, S.: Zap meets momentum: stochastic approximation algorithms with optimal convergence rate. arXiv preprint arXiv:1809.06277 (2018)

17. Dimakis, A., Kar, S., Moura, J., Rabbat, M., Scaglione, A.: Gossip algorithms for distributed signal processing. Proc. IEEE 98(11), 1847–1864 (2010)

18. Elaydi, S.: An Introduction to Difference Equations. Springer, Berlin (2005)

19. Eldar, Y., Needell, D.: Acceleration of randomized Kaczmarz method via the Johnson–Lindenstrauss lemma. Numer. Algorithms 58(2), 163–177 (2011)

20. Fercoq, O., Richtárik, P.: Accelerated, parallel, and proximal coordinate descent. SIAM J. Optim. 25(4), 1997–2023 (2015)

21. Fillmore, J., Marx, M.: Linear recursive sequences. SIAM Rev. 10(3), 342–353 (1968)

22. Gadat, S., Panloup, F., Saadane, S.: Stochastic heavy ball. Electron. J. Stat. 12(1), 461–529 (2018)

23. Geman, S.: A limit theorem for the norm of random matrices. Ann. Probab. 8, 252–261 (1980)

24. Ghadimi, E., Feyzmahdavian, H., Johansson, M.: Global convergence of the heavy-ball method for convex optimization. In: Control Conference (ECC), 2015 European, pp. 310–315. IEEE (2015)

25. Ghadimi, E., Shames, I., Johansson, M.: Multi-step gradient methods for networked optimization. IEEE Trans. Signal Process. 61(21), 5417–5429 (2013)

26. Ghadimi, S., Lan, G.: Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Math. Program. 156(1–2), 59–99 (2016)

27. Gower, R., Goldfarb, D., Richtárik, P.: Stochastic block BFGS: squeezing more curvature out of data. In: International Conference on Machine Learning, pp. 1869–1878 (2016)

28. Gower, R., Richtárik, P.: Randomized iterative methods for linear systems. SIAM J. Matrix Anal. Appl. 36(4), 1660–1690 (2015)

29. Gower, R., Richtárik, P.: Stochastic dual ascent for solving linear systems. arXiv preprint arXiv:1512.06890 (2015)

30. Gower, R., Richtárik, P.: Linearly convergent randomized iterative methods for computing the pseudoinverse. arXiv preprint arXiv:1612.06255 (2016)

31. Gower, R.M., Richtárik, P.: Randomized quasi-Newton updates are linearly convergent matrix inversion algorithms. SIAM J. Matrix Anal. Appl. 38(4), 1380–1409 (2017)

32. Gurbuzbalaban, M., Ozdaglar, A., Parrilo, P.: On the convergence rate of incremental aggregated gradient algorithms. SIAM J. Optim. 27(2), 1035–1048 (2017)

33. Hanzely, F., Konečný, J., Loizou, N., Richtárik, P., Grishchenko, D.: Privacy preserving randomized gossip algorithms. arXiv preprint arXiv:1706.07636 (2017)

34. Hanzely, F., Konečnỳ, J., Loizou, N., Richtárik, P., Grishchenko, D.: A privacy preserving randomized gossip algorithm via controlled noise insertion. In: NeurIPS Privacy Preserving Machine Learning Workshop (2018)

35. Jalilzadeh, A., Shanbhag, U., Blanchet, J., Glynn, P.: Optimal smoothed variable sample-size accelerated proximal methods for structured nonsmooth stochastic convex programs. arXiv preprint arXiv:1803.00718 (2018)

36. Jofré, A., Thompson, P.: On variance reduction for stochastic smooth convex optimization with multiplicative noise. Math. Program. 174, 1–40 (2017)

37. Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)

38. Kaczmarz, S.: Angenäherte auflösung von systemen linearer gleichungen. Bulletin International de l’Academie Polonaise des Sciences et des Lettres 35, 355–357 (1937)

39. Konečný, J., Liu, J., Richtárik, P., Takáč, M.: Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE J. Sel. Top. Signal Process. 10(2), 242–255 (2016)

40. Konečný, J., Richtárik, P.: Semi-stochastic gradient descent methods. Front. Appl. Math. Stat. 3(9), 1–14 (2017)

41. Kovalev, D., Horváth, S., Richtárik, P.: Don’t jump through hoops and remove those loops: SVRG and Katyusha are better without the outer loop. arXiv preprint arXiv:1901.08689 (2019)

42. Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

43. Lee, Y., Sidford, A.: Efficient accelerated coordinate descent methods and faster algorithms for solving linear systems. In: 2013 IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS), pp. 147–156. IEEE (2013)

44. Lessard, L., Recht, B., Packard, A.: Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016)

45. Leventhal, D., Lewis, A.: Randomized methods for linear constraints: convergence rates and conditioning. Math. Oper. Res. 35(3), 641–654 (2010)

46. Liu, J., Wright, S.: An accelerated randomized Kaczmarz algorithm. Math. Comput. 85(297), 153–178 (2016)

47. Loizou, N., Rabbat, M., Richtárik, P.: Provably accelerated randomized gossip algorithms. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7505–7509. IEEE (2019)

48. Loizou, N., Richtárik, P.: A new perspective on randomized gossip algorithms. In: 2016 IEEE Global Conference on Signal and Information Processing (GlobalSIP), pp. 440–444. IEEE (2016)

49. Loizou, N., Richtárik, P.: Linearly convergent stochastic heavy ball method for minimizing generalization error. In: NIPS-Workshop on Optimization for Machine Learning (2017)

50. Loizou, N., Richtárik, P.: Accelerated gossip via stochastic heavy ball method. In: 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp. 927–934. IEEE (2018)

51. Loizou, N., Richtárik, P.: Convergence analysis of inexact randomized iterative methods. arXiv preprint arXiv:1903.07971 (2019)

52. Loizou, N., Richtárik, P.: Revisiting randomized gossip algorithms: general framework, convergence rates and novel block and accelerated protocols. arXiv preprint arXiv:1905.08645 (2019)

53. Ma, A., Needell, D., Ramdas, A.: Convergence properties of the randomized extended Gauss–Seidel and Kaczmarz methods. SIAM J. Matrix Anal. Appl. 36(4), 1590–1604 (2015)

54. Ma, J., Yarats, D.: Quasi-hyperbolic momentum and Adam for deep learning. arXiv preprint arXiv:1810.06801 (2018)

55. Needell, D.: Randomized Kaczmarz solver for noisy linear systems. BIT Numer. Math. 50(2), 395–403 (2010)

56. Needell, D., Srebro, N., Ward, R.: Stochastic gradient descent and the randomized Kaczmarz algorithm. Math. Program. Ser. A 155(1), 549–573 (2016)

57. Needell, D., Tropp, J.: Paved with good intentions: analysis of a randomized block Kaczmarz method. Linear Algebra Appl. 441, 199–221 (2014)

58. Needell, D., Zhao, R., Zouzias, A.: Randomized block Kaczmarz method with projection for solving least squares. Linear Algebra Appl. 484, 322–343 (2015)

59. Nemirovskii, A., Yudin, D.: Problem Complexity and Method Efficiency in Optimization. Wiley Interscience, Hoboken (1983)

60. Nesterov, Y.: A method of solving a convex programming problem with convergence rate $$o(1/k^2)$$. Sov. Math. Dokl. 27, 372–376 (1983)

61. Nesterov, Y.: Efficiency of coordinate descent methods on huge-scale optimization problems. SIAM J. Optim. 22(2), 341–362 (2012)

62. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, Berlin (2013)

63. Nutini, J., Schmidt, M., Laradji, I., Friedlander, M., Koepke, H.: Coordinate descent converges faster with the gauss-southwell rule than random selection. In: International Conference on Machine Learning, pp. 1632–1641 (2015)

64. Nutini, J., Sepehry, B., Laradji, I., Schmidt, M., Koepke, H., Virani, A.: Convergence rates for greedy Kaczmarz algorithms, and faster randomized Kaczmarz rules using the orthogonality graph. In: Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, pp. 547–556. AUAI Press (2016)

65. Ochs, P., Brox, T., Pock, T.: iPiasco: inertial proximal algorithm for strongly convex optimization. J. Math. Imaging Vis. 53(2), 171–181 (2015)

66. Ochs, P., Chen, Y., Brox, T., Pock, T.: iPiano: inertial proximal algorithm for nonconvex optimization. SIAM J. Imaging Sci. 7(2), 1388–1419 (2014)

67. Penrose, M.: Random Geometric Graphs, vol. 5. Oxford University Press, Oxford (2003)

68. Polyak, B.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)

69. Polyak, B.: Introduction to Optimization. Translations Series in Mathematics and Engineering. Optimization Software, New York (1987)

70. Popa, C.: Least-squares solution of overdetermined inconsistent linear systems using Kaczmarz’s relaxation. Int. J. Comput. Math. 55(1–2), 79–89 (1995)

71. Popa, C.: Convergence rates for Kaczmarz-type algorithms. Numer. Algorithms 79(1), 1–17 (2018)

72. Qu, Z., Richtárik, P.: Coordinate descent with arbitrary sampling I: algorithms and complexity. Optim. Methods Softw. 31(5), 829–857 (2016)

73. Qu, Z., Richtárik, P.: Coordinate descent with arbitrary sampling II: expected separable overapproximation. Optim. Methods Softw. 31(5), 858–884 (2016)

74. Qu, Z., Richtárik, P., Takáč, M., Fercoq, O.: SDNA: stochastic dual Newton ascent for empirical risk minimization. In: International Conference on Machine Learning (2016)

75. Qu, Z., Richtárik, P., Zhang, T.: Quartz: randomized dual coordinate ascent with arbitrary sampling. In: Advances in Neural Information Processing Systems, pp. 865–873 (2015)

76. Richtárik, P., Takáč, M.: Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Math. Program. 144(1–2), 1–38 (2014)

77. Richtárik, P., Takáč, M.: Parallel coordinate descent methods for big data optimization. Math. Program. 156(1–2), 433–484 (2016)

78. Richtárik, P., Takáč, M.: Stochastic reformulations of linear systems: algorithms and convergence theory. arXiv:1706.01108 (2017)

79. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)

80. Schmidt, M., Le Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017)

81. Schöpfer, F., Lorenz, D.: Linear convergence of the randomized sparse Kaczmarz method. Math. Program. 173, 1–28 (2018)

82. Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss. J. Mach. Learn. Res. 14(1), 567–599 (2013)

83. Strohmer, T., Vershynin, R.: A randomized Kaczmarz algorithm with exponential convergence. J. Fourier Anal. Appl. 15(2), 262–278 (2009)

84. Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: International Conference on Machine Learning , vol. 28, pp. 1139–1147 (2013)

85. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR, pp. 1–9 (2015)

86. Tseng, P.: An incremental gradient (-projection) method with momentum term and adaptive stepsize rule. SIAM J. Optim. 8(2), 506–531 (1998)

87. Tu, S., Venkataraman, S., Wilson, A., Gittens, A., Jordan, M., Recht, B.: Breaking locality accelerates block Gauss–Seidel. In: International Conference on Machine Learning (2017)

88. Wilson, A.C., Roelofs, R., Stern, M., Srebro, N., Recht, B.: The marginal value of adaptive gradient methods in machine learning. In: Advances in Neural Information Processing Systems, pp. 4148–4158 (2017)

89. Wright, S.: Coordinate descent algorithms. Math. Program. 151(1), 3–34 (2015)

90. Xiang, H., Zhang, L.: Randomized iterative methods with alternating projections. arXiv preprint arXiv:1708.09845 (2017)

91. Xu, P., He, B., De Sa, C., Mitliagkas, I., Re, C.: Accelerated stochastic power iteration. In: International Conference on Artificial Intelligence and Statistics, pp. 58–67 (2018)

92. Yang, T., Lin, Q., Li, Z.: Unified convergence analysis of stochastic momentum methods for convex and non-convex optimization. arXiv preprint arXiv:1604.03257 (2016)

93. Zavriev, S., Kostyuk, F.: Heavy-ball method in nonconvex optimization problems. Comput. Math. Model. 4(4), 336–341 (1993)

94. Zhang, J., Mitliagkas, I., Ré, C.: Yellowfin and the art of momentum tuning. arXiv preprint arXiv:1706.03471 (2017)

95. Zhou, K.: Direct acceleration of SAGA using sampled negative momentum. arXiv preprint arXiv:1806.11048 (2018)

96. Zhou, K., Shang, F., Cheng, J.: A simple stochastic variance reduced algorithm with fast convergence rates. In: Proceedings of the 35th International Conference on Machine Learning, PMLR, vol. 80, pp. 5980–5989 (2018)

97. Zouzias, A., Freris, N.: Randomized extended Kaczmarz for solving least squares. SIAM J. Matrix Anal. Appl. 34(2), 773–793 (2013)

## Author information

Authors

### Corresponding author

Correspondence to Nicolas Loizou.

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Work done while the first author was a PhD student at School of Mathematics, The University of Edinburgh.

## Appendices

### Lemma 9

Fix $$F_1=F_0\ge 0$$ and let $$\{F_k\}_{k\ge 0}$$ be a sequence of nonnegative real numbers satisfying the relation

\begin{aligned} F_{k+1}\le a_1F_k +a_2 F_{k-1}, \quad \forall k\ge 1, \end{aligned}
(42)

where $$a_2 \ge 0$$, $$a_1 + a_2 <1$$ and at least one of the coefficients $$a_1,a_2$$ is positive. Then the sequence satisfies the relation $$F_{k+1}\le q^{k} (1+ \delta ) F_0$$ for all $$k\ge 1,$$ where $$q=\frac{a_1+\sqrt{a_1^2+4a_2}}{2}$$ and $$\delta =q-a_1\ge 0$$. Moreover,

\begin{aligned} q \ge a_1 + a_2, \end{aligned}
(43)

with equality if and only if $$a_2=0$$ (in which case $$q=a_1$$ and $$\delta =0$$).

### Proof

Choose $$\delta = \frac{-a_1+\sqrt{a_1^2+4a_2}}{2}$$. We claim $$\delta \ge 0$$ and $$a_2 \le (a_1+\delta )\delta$$. Indeed, non-negativity of $$\delta$$ follows from $$a_2\ge 0$$, while the second relation follows from the fact that $$\delta$$ satisfies

\begin{aligned} (a_1+\delta )\delta - a_2 = 0. \end{aligned}
(44)

In view of these two relations, adding $$\delta F_k$$ to both sides of (42), we get

\begin{aligned} F_{k+1} + \delta F_k \le (a_1+\delta )F_k + a_2 F_{k-1} \le (a_1+\delta )(F_k + \delta F_{k-1}) = q(F_k+\delta F_{k-1}). \end{aligned}
(45)

Let us now argue that $$0<q<1$$. Non-negativity of q follows from non-negativity of $$a_2$$. Clearly, as long as $$a_2>0$$, q is positive. If $$a_2=0$$, then $$a_1>0$$ by assumption, which implies that q is positive. The inequality $$q<1$$ follows directly from the assumption $$a_1+a_2<1$$. By unrolling the recurrence (45), we obtain $$F_{k+1} \le F_{k+1} + \delta F_k \le q^k (F_1+ \delta F_0) = q^{k}(1+\delta ) F_{0}.$$

Finally, let us establish (45). Noting that $$a_1 = q-\delta$$, and since in view of (44) we have $$a_2=q\delta$$, we conclude that $$a_1+a_2 = q + \delta (q-1) \le q$$, where the inequality follows from $$q <1$$. $$\square$$

The following identities were established in . For completeness, we include different (and somewhat simpler) proofs here.

### Lemma 10

() For all $$x \in \mathbb {R}^n$$ we have

\begin{aligned} f_{\mathbf {S}}(x) = \frac{1}{2}\Vert \nabla f_{\mathbf {S}}(x)\Vert ^2_{\mathbf {B}}. \end{aligned}
(46)

Moreover, if $$x_*\in \mathcal{L}$$ (i.e., if $$x_*$$ satisfies $$\mathbf{A}x_* =b$$), then for all $$x\in \mathbb {R}^n$$ we have

\begin{aligned} f_{\mathbf {S}}(x) = \frac{1}{2}\langle \nabla f_{\mathbf {S}}(x),x-x_* \rangle _{\mathbf {B}}, \end{aligned}
(47)

and

\begin{aligned} f(x) = \frac{1}{2}\langle \nabla f(x),x-x_* \rangle _{\mathbf {B}}. \end{aligned}
(48)

### Proof

In view of (10), and since $$\mathbf{Z}\mathbf{B}^{-1} \mathbf{Z}= \mathbf{Z}$$ (see ), we have

\begin{aligned} \Vert \nabla f_{\mathbf {S}}(x)\Vert ^2_{\mathbf {B}}&\overset{(10)}{=} \Vert \mathbf {B}^{-1} \mathbf{Z}(x-x_*)\Vert ^2_{\mathbf {B}} \\&= (x-x_*)^\top \mathbf{Z}\mathbf {B}^{-1} \mathbf {Z}(x-x_*) = (x-x_*)^\top \mathbf{Z}(x-x_*) \\&\overset{(7)}{=} (x-x_*)^\top \mathbf{A}^\top \mathbf{H}\mathbf{A}(x-x_*) = (\mathbf{A}x-b)^\top \mathbf{H}(\mathbf{A}x- b ) \quad \overset{(6)}{=}\quad 2f_{\mathbf {S}}(x). \end{aligned}

Moreover,

\begin{aligned} \langle \nabla f_{\mathbf {S}}(x),x-x_* \rangle _{\mathbf {B}}&\overset{(10)}{=} \langle \mathbf {B}^{-1} \mathbf {Z}(x-x_*),x-x_* \rangle _{\mathbf {B}}\\&= (x-x_*)^\top \mathbf {Z}\mathbf {B}^{-1} \mathbf {B}(x-x_*) \quad = \quad 2f_{\mathbf {S}}(x). \end{aligned}

By taking expectations in the last identity with respect to the random matrix $$\mathbf {S}$$, we get $$\langle \nabla f(x),x-x_* \rangle _{\mathbf {B}}=2f(x).$$ $$\square$$

### Lemma 11

() For all $$x \in \mathbb {R}^n$$ and $$x_* \in \mathcal{L}$$

\begin{aligned} \lambda _{\min }^+ f(x) \le \frac{1}{2} \Vert \nabla f(x) \Vert ^2_{\mathbf {B}} \le \lambda _{\max } f(x) \end{aligned}
(49)

and

\begin{aligned} f(x) \le \frac{\lambda _{\max }}{2} \Vert x-x_*\Vert ^2_{\mathbf {B}}. \end{aligned}
(50)

Moreover, if exactness is satisfied, and we let $$x_* =\varPi ^{\mathbf {B}}_{\mathcal{L}}(x)$$, we have

\begin{aligned} \frac{\lambda _{\min }^+}{2} \Vert x-x_*\Vert ^2_{\mathbf {B}} \le f(x) . \end{aligned}
(51)

Finally, let us present a simple lemma of an identity that we use in our main proofs. This preliminary result is known to hold for the case of Euclidean norms ($$\mathbf {B}=\mathbf {I}$$). We provide the proof for the more general $$\mathbf {B}-$$norm for completeness.

### Lemma 12

Let abc be arbitrary vectors in $$\mathbb {R}^n$$ and let $$\mathbf {B}$$ be a positive definite matrix. Then the following identity holds: $$2 \langle a-c,c-b \rangle _{\mathbf {B}}=\Vert a-b\Vert ^2_{\mathbf {B}}-\Vert c-b\Vert ^2_{\mathbf {B}}-\Vert a-c\Vert ^2_{\mathbf {B}},$$

### Proof

\begin{aligned} LHS= & {} 2 \langle a-c,c-b \rangle _{\mathbf {B}} = 2(a-c)^\top \mathbf {B}(c-b)\\= & {} 2a^\top \mathbf {B}c-2a^\top \mathbf {B}b-2c^\top \mathbf {B}c+2c^\top \mathbf {B}b \end{aligned}

and

\begin{aligned} RHS= & {} \Vert a-b\Vert ^2_{\mathbf {B}}-\Vert c-b\Vert ^2_{\mathbf {B}}-\Vert a-c\Vert ^2_{\mathbf {B}}\\= & {} (a-b)^\top \mathbf {B}(a-b)- (c-b)^\top \mathbf {B}(c-b)-(a-c)^\top \mathbf {B}(a-c)\\= & {} a^\top \mathbf {B}a-a^\top \mathbf {B}b-b^\top \mathbf {B}a+b^\top \mathbf {B}b-c^\top \mathbf {B}c+c^\top \mathbf {B}b+b^\top \mathbf {B}c-b^\top \mathbf {B}b\\&- a^\top \mathbf {B}a+a^\top \mathbf {B}c+c^\top \mathbf {B}a-c^\top \mathbf {B}c \\= & {} 2a^\top \mathbf {B}c-2a^\top \mathbf {B}b-2c^\top \mathbf {B}c+2c^\top \mathbf {B}b \end{aligned}

LHS (left-hand side) = RHS (right-hand side) and this completes the proof. $$\square$$

### Appendix 2: Proof of Theorem 1

First, we decompose (52)

We will now analyze the three expressions separately. The first expression can be written as (53)

We will now bound the second expression. First, we have (54)

Using the identity from Lemma 12 for the vectors $$x_k, x_*$$ and $$x_{k-1}$$ we obtain:

\begin{aligned} 2 \langle x_k-x_*, x_*-x_{k-1} \rangle _{\mathbf {B}}= \Vert x_k-x_{k-1}\Vert ^2_{\mathbf {B}}- \Vert x_{k-1}-x_*\Vert ^2_{\mathbf {B}}-\Vert x_k-x_*\Vert ^2_{\mathbf {B}}. \end{aligned}

Substituting this into (54) gives (55)

The third expression can be bounded as (56)

By substituting the bounds (53), (55), (56) into (52) we obtain

\begin{aligned}&\Vert x_{k+1}-x_*\Vert ^2_{\mathbf {B}}\\&\quad \le \Vert x_k-x_*\Vert ^2_{\mathbf {B}}-2\omega (2-\omega )f_{\mathbf {S}_k}(x_k)\\&\qquad + \beta \Vert x_k-x_*\Vert ^2_{\mathbf {B}}+\beta \Vert x_{k}-x_{k-1}\Vert ^2_{\mathbf {B}}-\beta \Vert x_{k-1}-x_*\Vert ^2_{\mathbf {B}} \\&\qquad + 2\omega \beta \langle \nabla f_{\mathbf {S}_k}(x_k),x_{k-1}- x_k \rangle _{\mathbf {B}} + 2\beta ^2\Vert x_{k}-x_*\Vert ^2_{\mathbf {B}}+2\beta ^2\Vert x_{k-1}-x_*\Vert ^2_{\mathbf {B}}\\&\quad \le (1+3\beta + 2\beta ^2)\Vert x_k-x_*\Vert ^2_{\mathbf {B}}+ (\beta + 2\beta ^2 )\Vert x_{k-1}-x_*\Vert ^2_{\mathbf {B}}-2\omega (2-\omega )f_{\mathbf {S}_k}(x_k)\\&\qquad + 2\omega \beta \langle \nabla f_{\mathbf {S}_k}(x_k),x_{k-1}- x_k \rangle _{\mathbf {B}}. \end{aligned}

Now by first taking expectation with respect to $$\mathbf{S}_k$$, we obtain:

\begin{aligned} \mathbb {E}_{\mathbf{S}_k}[\Vert x_{k+1}-x_*\Vert ^2_{\mathbf {B}}]\le & {} (1+3\beta +2\beta ^2)\Vert x_k-x_*\Vert ^2_{\mathbf {B}}+ (\beta +2\beta ^2)\Vert x_{k-1}-x_*\Vert ^2_{\mathbf {B}} \\&-2\omega (2-\omega )f(x_k) + 2\omega \beta \langle \nabla f(x_k),x_{k-1}- x_k \rangle _{\mathbf {B}}\\\le & {} (1+3\beta +2\beta ^2)\Vert x_k-x_*\Vert ^2_{\mathbf {B}}+ (\beta +2\beta ^2)\Vert x_{k-1}-x_*\Vert ^2_{\mathbf {B}} \\&-2\omega (2-\omega )f(x_k) + 2\omega \beta (f(x_{k-1})-f(x_k))\\= & {} (1+3\beta +2\beta ^2)\Vert x_k-x_*\Vert ^2_{\mathbf {B}}+ (\beta +2\beta ^2)\Vert x_{k-1}-x_*\Vert ^2_{\mathbf {B}} \\&- (2\omega (2-\omega ) +2\omega \beta )f(x_k) + 2\omega \beta f(x_{k-1}). \end{aligned}

where in the second step we used the inequality $$\langle \nabla f(x_k),x_{k-1}- x_k \rangle \le f(x_{k-1})-f(x_k)$$ and the fact that $$\omega \beta \ge 0$$, which follows from the assumptions. We now apply inequalities (50) and (51), obtaining

\begin{aligned} \mathbb {E}_{\mathbf{S}_k}[\Vert x_{k+1}-x_*\Vert ^2_{\mathbf {B}}]\le & {} \underbrace{(1+3\beta +2\beta ^2 - (\omega (2-\omega ) +\omega \beta )\lambda _{\min }^+)}_{a_1}\Vert x_k-x_*\Vert ^2_{\mathbf {B}} \\&\quad + \underbrace{(\beta +2\beta ^2 + \omega \beta \lambda _{\max })}_{a_2}\Vert x_{k-1}-x_*\Vert ^2_{\mathbf {B}}. \end{aligned}

By taking expectation again, and letting $$F_k{:}{=}\mathbb {E}[\Vert x_{k}-x_*\Vert ^2_{\mathbf {B}}]$$, we get the relation

\begin{aligned} F_{k+1} \le a_1 F_k + a_2 F_{k-1} . \end{aligned}
(57)

It suffices to apply Lemma 9 to the relation (57). The conditions of the lemma are satisfied. Indeed, $$a_2\ge 0$$, and if $$a_2=0$$, then $$\beta =0$$ and hence $$a_1=1-\omega (2-\omega )\lambda _{\min }^+>0$$. The condition $$a_1+a_2<1$$ holds by assumption.

The convergence result in function values, $$\mathbb {E}[f(x_k)]$$, follows as a corollary by applying inequality (50) to (23).

### Appendix 3: Proof of Theorem 3

Let $$p_t=\frac{\beta }{1-\beta }(x_t-x_{t-1})$$ and $$d_t = \Vert x_t + p_t -x_*\Vert _{\mathbf{B}}^2$$. In view of (22), we can write

\begin{aligned} x_{t+1}+p_{t+1}&= x_{t+1}+\frac{\beta }{1-\beta }(x_{t+1}-x_{t}) \nonumber \\&\overset{(22)}{=} x_{t}-\omega \nabla f_{\mathbf {S}_t}(x_t)+\beta (x_t-x_{t-1})\nonumber \\&\qquad +\frac{\beta }{1-\beta }\left( -\omega \nabla f_{\mathbf {S}_t}(x_t)+\beta (x_t-x_{t-1})\right) \nonumber \\&= x_{t}-[\omega +\frac{\beta }{1-\beta }\omega ] \nabla f_{\mathbf {S}_t}(x_t)+[\beta +\frac{\beta ^2}{1-\beta }](x_t-x_{t-1})\nonumber \\&= x_{t}-\frac{\omega }{1-\beta }\nabla f_{\mathbf {S}_t}(x_t)+\frac{\beta }{1-\beta }(x_t-x_{t-1})\nonumber \\&= x_t+p_t-\frac{\omega }{1-\beta } \nabla f_{\mathbf {S}_t}(x_t) \end{aligned}
(58)

and therefore

\begin{aligned} d_{t+1}&\overset{(58)}{=} \left\| x_t+p_t-\frac{\omega }{1-\beta } \nabla f_{\mathbf {S}_t}(x_t) -x_* \right\| ^2_{\mathbf {B}} \\&= d_t -2 \frac{\omega }{1-\beta } \langle x_t+p_t-x_*, \nabla f_{\mathbf {S}_t}(x_t) \rangle _{\mathbf {B}} + \frac{\omega ^2}{(1-\beta )^2} \Vert \nabla f_{\mathbf {S}_t}(x_t)\Vert ^2_{\mathbf {B}}\\&= d_t -\frac{2\omega }{1-\beta } \langle x_t-x_*, \nabla f_{\mathbf {S}_t}(x_t) \rangle _{\mathbf {B}} - \frac{2 \omega \beta }{(1-\beta )^2} \langle x_t-x_{t-1}, \nabla f_{\mathbf {S}_t}(x_t) \rangle _{\mathbf {B}}\\&\quad + \frac{\omega ^2}{(1-\beta )^2} \Vert \nabla f_{\mathbf {S}_t}(x_t)\Vert ^2_{\mathbf {B}}. \end{aligned}

Taking expectation with respect to the random matrix $$\mathbf {S}_t$$ we obtain:

\begin{aligned} \mathbb {E}_{\mathbf{S}_t}[d_{t+1}]&= d_t -\frac{2\omega }{1-\beta } \langle x_t-x_*, \nabla f(x_t) \rangle _{\mathbf {B}} - \frac{2 \omega \beta }{(1-\beta )^2} \langle x_t-x_{t-1}, \nabla f(x_t) \rangle _{\mathbf {B}} \\&\quad + \frac{\omega ^2}{(1-\beta )^2} 2 f(x_t) \\&\overset{(48)}{=} d_t -\frac{4\omega }{1-\beta } f(x_t) - \frac{2 \omega \beta }{(1-\beta )^2} \langle x_t-x_{t-1}, \nabla f(x_t) \rangle _{\mathbf {B}} + \frac{\omega ^2}{(1-\beta )^2} 2 f(x_t)\\&\le d_t -\frac{4\omega }{1-\beta } f(x_t) - \frac{2 \omega \beta }{(1-\beta )^2} [f(x_t)-f(x_{t-1})] + \frac{\omega ^2}{(1-\beta )^2} 2 f(x_t)\\&= d_t + \left[ -\frac{4\omega }{1-\beta } - \frac{2 \omega \beta }{(1-\beta )^2} +\frac{2 \omega ^2}{(1-\beta )^2}\right] f(x_t) + \frac{2 \omega \beta }{(1-\beta )^2} f(x_{t-1}), \end{aligned}

where the inequality follows from convexity of f. After rearranging the terms we get

\begin{aligned} \mathbb {E}_{\mathbf{S}_t}[d_{t+1}] + \frac{2 \omega \beta }{(1-\beta )^2} f(x_t) + \alpha f(x_t) \le d_t + \frac{2 \omega \beta }{(1-\beta )^2} f(x_{t-1}), \end{aligned}

where $$\alpha = \frac{4\omega }{1-\beta } -\frac{2 \omega ^2}{(1-\beta )^2} > 0$$. Taking expectations again and using the tower property, we get

\begin{aligned} \theta _{t+1} + \alpha \mathbb {E}[f(x_t)] \le \theta _t, \qquad t=1,2,\dots , \end{aligned}
(59)

where $$\theta _t = \mathbb {E}[d_t] + \frac{2 \omega \beta }{(1-\beta )^2}\mathbb {E}[ f(x_{t-1})]$$. By summing up (59) for $$t=1,\dots , k$$ we get

\begin{aligned} \sum _{t=1}^k \mathbb {E}[f(x_t)] \le \frac{\theta _1-\theta _{k-1}}{\alpha } \le \frac{\theta _1}{\alpha }. \end{aligned}
(60)

Finally, using Jensen’s inequality, we get

\begin{aligned} \mathbb {E}[f(\hat{x}_k)] = \mathbb {E}\left[ f\left( \frac{1}{k}\sum _{t=1}^k x_t\right) \right] \le \mathbb {E}\left[ \frac{1}{k}\sum _{t=1}^k f(x_t)\right] = \frac{1}{k}\sum _{t=1}^k \mathbb {E}[f(x_t)] \overset{(60)}{\le } \frac{\theta _1}{\alpha k}. \end{aligned}

It remains to note that $$\theta _1 = \Vert x_0-x_*\Vert _{\mathbf{B}}^2 + \frac{2\omega \beta }{(1-\beta )^2 }f(x_0).$$

### Appendix 4: Proof of Theorem 4

In the proof of Theorem 4 the following two lemmas are used.

### Lemma 13

() Assume exactness. Let $$x\in \mathbb {R}^n$$ and $$x_* = \varPi _\mathcal {L}^\mathbf {B}(x)$$. If $$\lambda _i=0$$, then $$u_i^\top \mathbf {B}^{1/2} (x-x_*)=0$$.

### Lemma 14

([18, 21]) Consider the second degree linear homogeneous recurrence relation:

\begin{aligned} r_{k+1}= a_1r_k+a_2 r_{k-1} \end{aligned}
(61)

with initial conditions $$r_0,r_1 \in \mathbb {R}$$. Assume that the constant coefficients $$a_1$$ and $$a_2$$ satisfy the inequality $$a_1^2 +4a_2<0$$ (the roots of the characteristic equation $$t^2-a_1t-a_2=0$$ are imaginary). Then there are complex constants $$C_0$$ and $$C_1$$ (depending on the initial conditions $$r_0$$ and $$r_1$$) such that:

\begin{aligned} r_k=2 M^k (C_0 \cos ( \theta k) + C_1 \sin (\theta k)) \end{aligned}

where $$M= \bigg (\sqrt{\frac{a_1^2}{4}+\frac{(-a_1^2-4a_2)}{4}} \bigg )=\sqrt{-a_2}$$ and $$\theta$$ is such that $$a_1=2 M \cos (\theta )$$ and $$\sqrt{-a_1^2-4a_2}=2 M \sin (\theta )$$.

We can now turn to the proof of Theorem 4. Plugging in the expression for the stochastic gradient, mSGD can be written in the form

\begin{aligned} x_{k+1}= & {} x_k -\omega \nabla f_{\mathbf {S}_k}(x_k) + \beta (x_k - x_{k-1}) \nonumber \\&\overset{(10)}{=}&x_k- \omega {\mathbf {B}}^{-1} \mathbf {Z}_k(x_k-x_*) + \beta (x_k - x_{k-1}). \end{aligned}
(62)

Subtracting $$x_*$$ from both sides of (62), we get

\begin{aligned} x_{k+1}-x_*= & {} (\mathbf {I}- \omega {\mathbf {B}}^{-1} \mathbf {Z}_k)(x_k-x_*) + \beta (x_k -x_* +x_* - x_{k-1})\\= & {} \left( (1+\beta )\mathbf {I}- \omega {\mathbf {B}}^{-1} \mathbf {Z}_k\right) (x_k-x_*) - \beta (x_{k-1}-x_*). \end{aligned}

Multiplying the last identity from the left by $$\mathbf {B}^{1/2}$$, we get

\begin{aligned} \mathbf {B}^{1/2} (x_{k+1}-x_*)= & {} \left( (1+\beta )\mathbf {I}- \omega \mathbf {B}^{-1/2} \mathbf {Z}_k \mathbf {B}^{-1/2}\right) \mathbf {B}^{1/2}(x_{k} -x_*) \\&- \beta \mathbf {B}^{1/2}(x_{k-1}-x_*). \end{aligned}

Taking expectations, conditioned on $$x_k$$ (that is, the expectation is with respect to $$\mathbf {S}_k$$):

\begin{aligned} \mathbf {B}^{1/2} \mathbb {E}[x_{k+1} -x_* \;|\; x_k]= & {} \left( (1+\beta )\mathbf {I}- \omega \mathbf {B}^{-1/2} \mathbb {E}[\mathbf {Z}] \mathbf {B}^{-1/2}\right) \mathbf {B}^{1/2}(x_{k} -x_*) \\&- \beta \mathbf {B}^{1/2}(x_{k-1}-x_*) . \end{aligned}

Taking expectations again, and using the tower property, we get

\begin{aligned} \mathbf {B}^{1/2} \mathbb {E}[x_{k+1} -x_*]= & {} \mathbf {B}^{1/2}\mathbb {E}\left[ \mathbb {E}[x_{k+1} -x_* \;|\; x_k]\right] \\= & {} \left( (1+\beta )\mathbf {I}- \omega \mathbf {B}^{-1/2} \mathbb {E}[\mathbf {Z}] \mathbf {B}^{-1/2} \right) \mathbf {B}^{1/2} \mathbb {E}[x_{k} -x_*] \\&- \beta \mathbf {B}^{1/2} \mathbb {E}[x_{k-1}-x_*]. \end{aligned}

Plugging the eigenvalue decomposition $${\mathbf {U}}\varvec{\varLambda } {{\mathbf {U}}}^\top$$ of the matrix $$\mathbf {W}=\mathbf {B}^{-1/2} \mathbb {E}[\mathbf {Z}] \mathbf {B}^{-1/2}$$ into the above, and multiplying both sides from the left by $${{\mathbf {U}}}^\top$$, we obtain

\begin{aligned} {{\mathbf {U}}}^\top \mathbf {B}^{1/2} \mathbb {E}[x_{k+1} -x_*]&= {{\mathbf {U}}}^\top \left( (1+\beta )\mathbf {I}- \omega {\mathbf {U}}\varvec{\varLambda } {{\mathbf {U}}}^\top \right) \mathbf {B}^{1/2} \mathbb {E}[x_{k} -x_*]\nonumber \\&\quad \, - \beta {{\mathbf {U}}}^\top \mathbf {B}^{1/2} \mathbb {E}[x_{k-1}-x_*]. \end{aligned}
(63)

Let us define $$s_k{:}{=}{{\mathbf {U}}}^\top \mathbf {B}^{1/2} \mathbb {E}[x_{k} -x_*] \in \mathbb {R}^n$$. Then relation (63) takes the form of the recursion

\begin{aligned} s_{k+1}= [(1+\beta )\mathbf{I}- \omega \varvec{\varLambda } ] s_k - \beta s_{k-1}, \end{aligned}

which can be written in a coordinate-by-coordinate form as follows:

\begin{aligned} s_{k+1}^i= [(1+\beta ) - \omega \lambda _i ] s_k^i - \beta s_{k-1}^i \quad \text {for all} \quad i= 1,2,3,\ldots ,n, \end{aligned}
(64)

where $$s_k^i$$ indicates the ith coordinate of $$s_k$$.

We will now fix i and analyze recursion (64) using Lemma 14. Note that (64) is a second degree linear homogeneous recurrence relation of the form (61) with $$a_1=1+\beta - \omega \lambda _i$$ and $$a_2=- \beta$$. Recall that $$0\le \lambda _i \le 1$$ for all i. Since we assume that $$0< \omega \le 1/\lambda _{\max }$$, we know that $$0\le \omega \lambda _i \le 1$$ for all i. We now consider two cases:

1. 1.

$$\lambda _i =0$$.

In this case, (64) takes the form:

\begin{aligned} s_{k+1}^i=(1+\beta )s_k^i-\beta s_{k-1}^i. \end{aligned}
(65)

Applying Proposition 2, we know that $$x_*=\varPi _\mathcal {L}^\mathbf {B}(x_0)=\varPi _\mathcal {L}^\mathbf {B}(x_1)$$. Using Lemma 13 twice, once for $$x=x_0$$ and then for $$x=x_1$$, we observe that $$s_0^i=u_i^\top \mathbf {B}^{1/2} (x_0-x_*)=0$$ and $$s_1^i=u_i^\top \mathbf {B}^{1/2} (x_1-x_*)=0$$. Finally, in view of (65) we conclude that

\begin{aligned} s_k^i=0 \quad \text {for all} \quad k\ge 0 . \end{aligned}
(66)
2. 2.

$$\lambda _i >0$$.

Since $$0<\omega \lambda _i \le 1$$ and $$\beta \ge 0$$, we have $$1+\beta - \omega \lambda _i \ge 0$$ and hence

\begin{aligned} a_1^2+4a_2=(1+\beta -\omega \lambda _i)^2-4\beta \le (1+\beta -\omega \lambda _{\min }^+)^2-4\beta < 0, \end{aligned}

where the last inequality can be shown to holdFootnote 17 for $$\left( 1-\sqrt{\omega \lambda _{\min }^+}\right) ^2< \beta < 1$$. Applying Lemma 14 the following bound can be deduced

\begin{aligned} s_k^i= & {} 2(-a_2)^{k/2} (C_0 \cos (\theta k) +C_1 \sin (\theta k)) \; \le \; 2 \beta ^{k/2} P_i, \end{aligned}
(67)

where $$P_i$$ is a constant depending on the initial conditions (we can simply choose $$P_i = |C_0| + |C_1|$$).

Now putting the two cases together, for all $$k\ge 0$$ we have

\begin{aligned} \Vert \mathbb {E}[x_{k} -x_*]\Vert _{\mathbf {B}}^2&= \mathbb {E}[x_{k} -x_*]^\top \mathbf {B}\mathbb {E}[x_{k} -x_*] \; = \; \mathbb {E}[x_{k} -x_*] \mathbf {B}^{1/2} \mathbf {U}{\mathbf {U}}^\top \mathbf {B}^{1/2} \mathbb {E}[x_{k} -x_*] \\&= \Vert {\mathbf {U}}^\top \mathbf {B}^{1/2} \mathbb {E}[x_{k} -x_*] \Vert _2^2 = \Vert s_k\Vert ^2 = \sum _{i=1}^{n} (s_k^i)^2 \\&= \sum _{i: \lambda _i=0} (s_k^i)^2 + \sum _{i: \lambda _i>0} (s_k^i)^2 \; \overset{(66)}{=}\; \sum _{i: \lambda _i>0} (s_k^i)^2\\&\overset{(67)}{\le } \sum _{i: \lambda _i >0} 4 \beta ^k P_i^2 \\&= \beta ^k C, \end{aligned}

where $$C=4\sum _{i: \lambda _i >0} P_i^2$$.

### Appendix 5: Proof of Theorem 7

The proof follows a similar pattern to that of Theorem 1. However, stochasticity in the momentum term introduces an additional layer of complexity, which we shall tackle by utilizing a more involved version of the tower property.

For simplicity, let $$i=i_k$$ and $$r_{k}^i {:}{=}e_i^\top (x_k-x_{k-1})e_i$$. First, we decompose

\begin{aligned} \Vert x_{k+1}-x_*\Vert ^2= & {} \Vert x_k-\omega \nabla f_{\mathbf {S}_k}(x_k)+\gamma r_k^i -x_*\Vert ^2 \nonumber \\= & {} \Vert x_k-\omega \nabla f_{\mathbf {S}_k}(x_k)-x_*\Vert ^2+2\langle x_k-\omega \nabla f_{\mathbf {S}_k}(x_k)\nonumber \\&-x_*, \gamma r_k^i \rangle + \gamma ^2\Vert r_k^i\Vert ^2. \end{aligned}
(68)

We shall use the tower property in the form

\begin{aligned} \mathbb {E}[\mathbb {E}[\mathbb {E}[ X \;|\; x_k, \mathbf{S}_k] \;|\; x_k]] = \mathbb {E}[X], \end{aligned}
(69)

where X is some random variable. We shall perform the three expectations in order, from the innermost to the outermost. Applying the inner expectation to the identity (68), we get (70)

We will now analyze the three expressions separately. The first expression is constant under the expectation, and hence we can write (71)

We will now bound the second expression. Using the identity

\begin{aligned} \mathbb {E}[r_k^i \;|\; x_k, \mathbf{S}_k] = \mathbb {E}_i [r_k^i] = \sum _{i=1}^n \frac{1}{n}r_k^i = \frac{1}{n}(x_k-x_{k-1}), \end{aligned}
(72)

we can write (73)

Using the fact that for arbitrary vectors $$a,b,c \in \mathbb {R}^n$$ we have the identity $$2 \langle a-c,c-b \rangle =\Vert a-b\Vert ^2-\Vert c-b\Vert ^2-\Vert a-c\Vert ^2,$$ we obtain

\begin{aligned} 2 \langle x_k-x_*, x_*-x_{k-1} \rangle = \Vert x_k-x_{k-1}\Vert ^2- \Vert x_{k-1}-x_*\Vert ^2-\Vert x_k-x_*\Vert ^2. \end{aligned}

Substituting this into (73) gives (74)

The third expression can be bound as (75)

By substituting the bounds (71), (74), (75) into (70) we obtain

\begin{aligned} \mathbb {E}[\Vert x_{k+1}-x_*\Vert ^2 \;|\; x_k, \mathbf{S}_k ]&\le \Vert x_k-x_*\Vert ^2-2\omega (2-\omega ) f_{\mathbf {S}_k}(x_k)\nonumber \\&\quad + \tfrac{\gamma }{n} \Vert x_k-x_*\Vert ^2+ \tfrac{\gamma }{n} \Vert x_{k}-x_{k-1}\Vert ^2 -\tfrac{\gamma }{n} \Vert x_{k-1}-x_*\Vert ^2 \nonumber \\&\quad + 2\omega \tfrac{\gamma }{n} \langle \nabla f_{\mathbf {S}_k}(x_k), x_{k-1}- x_k \rangle + 2\tfrac{\gamma ^2}{n} \Vert x_{k}-x_*\Vert ^2 \nonumber \\&\quad + 2\tfrac{\gamma ^2}{n}\Vert x_{k-1}-x_*\Vert ^2 \nonumber \\&\overset{(56)}{\le } \left( 1+3\tfrac{\gamma }{n} + 2\tfrac{\gamma ^2}{n}\right) \Vert x_k-x_*\Vert ^2+ \left( \tfrac{\gamma }{n} + 2\tfrac{\gamma ^2}{n} \right) \Vert x_{k-1}-x_*\Vert ^2 \nonumber \\&\quad - 2\omega (2-\omega )f_{\mathbf {S}_k}(x_k) + 2\omega \tfrac{\gamma }{n} \langle \nabla f_{\mathbf {S}_k}(x_k),x_{k-1}- x_k \rangle . \end{aligned}
(76)

We now take the middle expectation (see (69)) and apply it to inequality (76):

\begin{aligned}&\mathbb {E}[\mathbb {E}[\Vert x_{k+1}-x_*\Vert ^2 \;|\; x_k, \mathbf{S}_k ] \;|\; x_k] \\&\quad \le \left( 1+3\tfrac{\gamma }{n} + 2\tfrac{\gamma ^2}{n}\right) \Vert x_k-x_*\Vert ^2+ \left( \tfrac{\gamma }{n} + 2\tfrac{\gamma ^2}{n} \right) \Vert x_{k-1}-x_*\Vert ^2 \\&\qquad -2\omega (2-\omega )f(x_k) + 2\omega \tfrac{\gamma }{n} \langle \nabla f(x_k),x_{k-1}- x_k \rangle \\&\quad \le \left( 1+3\tfrac{\gamma }{n} + 2\tfrac{\gamma ^2}{n}\right) \Vert x_k-x_*\Vert ^2+ \left( \tfrac{\gamma }{n} + 2\tfrac{\gamma ^2}{n} \right) \Vert x_{k-1}-x_*\Vert ^2 \\&\qquad -2\omega (2-\omega )f(x_k) + 2\omega \tfrac{\gamma }{n}(f(x_{k-1})-f(x_k))\\&\quad = \left( 1+3\tfrac{\gamma }{n} + 2\tfrac{\gamma ^2}{n}\right) \Vert x_k-x_*\Vert ^2+ \left( \tfrac{\gamma }{n} + 2\tfrac{\gamma ^2}{n} \right) \Vert x_{k-1}-x_*\Vert ^2 \\&\qquad - \left( 2\omega (2-\omega ) +2\omega \tfrac{\gamma }{n}\right) f(x_k) + 2\omega \tfrac{\gamma }{n} f(x_{k-1}). \end{aligned}

where in the second step we used the inequality $$\langle \nabla f(x_k),x_{k-1}- x_k \rangle \le f(x_{k-1})-f(x_k)$$ and the fact that $$\omega \gamma \ge 0$$, which follows from the assumptions. We now apply inequalities (50) and (51), obtaining

\begin{aligned}&\mathbb {E}[\mathbb {E}[\Vert x_{k+1}-x_*\Vert ^2 \;|\; x_k, \mathbf{S}_k ] \;|\; x_k]\\&\quad \le \underbrace{\left( 1+3\tfrac{\gamma }{n}+2\tfrac{\gamma ^2}{n} - \left( \omega (2-\omega ) +\omega \tfrac{\gamma }{n}\right) \lambda _{\min }^+ \right) }_{a_1}\Vert x_k-x_*\Vert ^2 \\&\qquad + \underbrace{\tfrac{1}{n}\left( \gamma +2\gamma ^2 + \omega \gamma \lambda _{\max }\right) }_{a_2}\Vert x_{k-1}-x_*\Vert ^2. \end{aligned}

By taking expectation again (outermost expectation in the tower rule (69)), and letting $$F_k{:}{=}\mathbb {E}[\Vert x_{k}-x_*\Vert ^2_{\mathbf {B}}]$$, we get the relation

\begin{aligned} F_{k+1} \le a_1 F_k + a_2 F_{k-1} . \end{aligned}
(77)

It suffices to apply Lemma 9 to the relation (57). The conditions of the lemma are satisfied. Indeed, $$a_2\ge 0$$, and if $$a_2=0$$, then $$\gamma =0$$ and hence $$a_1=1-\omega (1-\omega )\lambda _{\min }^+>0$$. The condition $$a_1+a_2<1$$ holds by assumption.

The convergence result in function values follows as a corollary by applying inequality (50) to (32).

### Appendix 6: Notation glossary

For the frequently used notation, see Table 8.

## Rights and permissions

Reprints and Permissions