# Analysis of biased stochastic gradient descent using sequential semidefinite programs

## Abstract

We present a convergence rate analysis for biased stochastic gradient descent (SGD), where individual gradient updates are corrupted by computation errors. We develop stochastic quadratic constraints to formulate a small linear matrix inequality (LMI) whose feasible points lead to convergence bounds of biased SGD. Based on this LMI condition, we develop a sequential minimization approach to analyze the intricate trade-offs that couple stepsize selection, convergence rate, optimization accuracy, and robustness to gradient inaccuracy. We also provide feasible points for this LMI and obtain theoretical formulas that quantify the convergence properties of biased SGD under various assumptions on the loss functions.

This is a preview of subscription content, log in to check access.

## Notes

1. 1.

When $$\delta =c=0$$, this rate bound does not reduce to $$\rho ^2=1-2m\alpha +O(\alpha ^2)$$. This is due to the inherent differences between the analyses of biased SGD and the standard SGD. See Remark 4 for a detailed explanation.

2. 2.

This case is a variant of the common assumption $$\frac{1}{n}\sum _{i=1}^n\left\| \nabla f_i(x)\right\| ^2 \le \beta$$. One can check that this case holds for several $$\ell _2$$-regularized problems including SVM and logistic regression.

3. 3.

The loss functions for SVM are non-smooth, and $$u_k$$ is actually updated using the subgradient information. For simplicity, we abuse our notation and use $$\nabla f_i$$ to denote the subgradient of $$f_i$$ for SVM problems.

4. 4.

Ensuring such a condition in practice can be challenging for many cases since it heavily relies on the estimations of problem parameters.

5. 5.

When $$M_{21}=0$$, this condition always holds. When $$\delta =0$$, this condition is equivalent to $$M_{21}\alpha \le 1$$. Hence the above corollary can be directly applied if $$M_{21}=0$$ or $$\delta =0$$. If $$M_{21}> 0$$ and $$\delta >0$$, the condition $$M_{21}\left( \alpha +\frac{\delta ^2}{m}\right) \le 1$$ can be rewritten as a condition on $$\alpha$$ in a case-by-case manner.

## References

1. 1.

Agarwal, A., Bartlett, P.L., Ravikumar, P., Wainwright, M.J.: Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Trans. Inf. Theory 58(5), 3235–3249 (2012)

2. 2.

Arora, S., Ge, R., Ma, T., Moitra, A.: Simple, efficient, and neural algorithms for sparse coding. In: Conference on Learning Theory, pp. 113–149 (2015)

3. 3.

Bertsekas, D.: Nonlinear Programming, 2nd edn. Athena scientific, Belmont (2002)

4. 4.

Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, pp. 177–186 (2010)

5. 5.

Bottou, L., Curtis, F., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)

6. 6.

Bottou, L., LeCun, Y.: Large scale online learning. Adv. Neural Inf. Process. Syst. 16, 217 (2004)

7. 7.

Bubeck, S.: Convex optimization: algorithms and complexity. Found. Trends® Mach. Learn. 8(3–4), 231–357 (2015)

8. 8.

Chen, Y., Candes, E.: Solving random quadratic systems of equations is nearly as easy as solving linear systems. In: Advances in Neural Information Processing Systems, pp. 739–747 (2015)

9. 9.

d’Aspremont, A.: Smooth optimization with approximate gradient. SIAM J. Optim. 19(3), 1171–1183 (2008)

10. 10.

De Klerk, E., Glineur, F., Taylor, A.: On the worst-case complexity of the gradient method with exact line search for smooth strongly convex functions. Optim. Lett. 11(7), 1185–1199 (2017)

11. 11.

Defazio, A., Bach, F., Lacoste-Julien, S.: Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems (2014)

12. 12.

Defazio, A., Domke, J., Caetano, T.: Finito: A faster, permutable incremental gradient method for big data problems. In: Proceedings of the 31st International Conference on Machine Learning, pp. 1125–1133 (2014)

13. 13.

Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Math. Program. 146(1–2), 37–75 (2014)

14. 14.

Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimization: a novel approach. Math. Program. 145(1–2), 451–482 (2014)

15. 15.

Feyzmahdavian, H., Aytekin, A., Johansson, M.: A delayed proximal gradient method with linear convergence rate. In: 2014 IEEE International Workshop on Machine Learning for Signal Processing, pp. 1–6 (2014)

16. 16.

Grant, M., Boyd, S.: Graph implementations for nonsmooth convex programs. In: Blondel, V., Boyd, S., Kimura, H. (eds.) Recent Advances in Learning and Control. Lecture Notes in Control and Information Sciences, pp. 95–110. Springer (2008). http://stanford.edu/~boyd/graph_dcp.html

17. 17.

Grant, M., Boyd, S.: CVX: Matlab software for disciplined convex programming, version 2.1. http://cvxr.com/cvx (2014)

18. 18.

Hu, B., Seiler, P., Rantzer, A.: A unified analysis of stochastic optimization methods using jump system theory and quadratic constraints. In: Proceedings of the 2017 Conference on Learning Theory, vol. 65, pp. 1157–1189 (2017)

19. 19.

Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)

20. 20.

Lee, J.C., Valiant, P.: Optimizing star-convex functions. In: 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pp. 603–614 (2016)

21. 21.

Lessard, L., Recht, B., Packard, A.: Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016)

22. 22.

Moulines, E., Bach, F.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Advances in Neural Information Processing Systems, pp. 451–459 (2011)

23. 23.

Nedić, A., Bertsekas, D.: Convergence rate of incremental subgradient algorithms. In: Stochastic Optimization: Algorithms and Applications, pp. 223–264 (2001)

24. 24.

Needell, D., Ward, R., Srebro, N.: Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. In: Advances in Neural Information Processing Systems, pp. 1017–1025 (2014)

25. 25.

Nishihara, R., Lessard, L., Recht, B., Packard, A., Jordan, M.: A general analysis of the convergence of ADMM. In: Proceedings of the 32nd International Conference on Machine Learning, pp. 343–352 (2015)

26. 26.

Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)

27. 27.

Roux, N., Schmidt, M., Bach, F.: A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. In: Advances in Neural Information Processing Systems (2012)

28. 28.

Schmidt, M., Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017)

29. 29.

Schmidt, M., Roux, N.L., Bach, F.R.: Convergence rates of inexact proximal-gradient methods for convex optimization. In: Advances in Neural Information Processing Systems, pp. 1458–1466 (2011)

30. 30.

Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss. J. Mach. Learn. Res. 14(1), 567–599 (2013)

31. 31.

Sun, R., Luo, Z.Q.: Guaranteed matrix completion via non-convex factorization. IEEE Trans. Inf. Theory 62(11), 6535–6579 (2016)

32. 32.

Taylor, A., Bach, F.: Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions. In: Proceedings of the 2019 Conference on Learning Theory, pp. 2934–2992 (2019)

33. 33.

Taylor, A., Hendrickx, J., Glineur, F.: Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Math. Program. 161(1–2), 307–345 (2017)

34. 34.

Taylor, A., Hendrickx, J.M., Glineur, F.: Exact worst-case performance of first-order methods for composite convex optimization. SIAM J. Optim. 27(3), 1283–1313 (2017)

35. 35.

Taylor, A., Van Scoy, B., Lessard, L.: Lyapunov functions for first-order methods: Tight automated convergence guarantees. In: Proceedings of the 35th International Conference on Machine Learning, pp. 4897–4906 (2018)

36. 36.

Teo, C., Smola, A., Vishwanathan, S., Le, Q.: A scalable modular convex solver for regularized risk minimization. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 727–736 (2007)

## Author information

Authors

### Corresponding author

Correspondence to Bin Hu.

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work is supported by the NSF Awards 1656951, 1750162, 1254129, and the NASA Langley NRA Cooperative Agreement NNX12AM55A. Bin Hu and Laurent Lessard also acknowledge support from the Wisconsin Institute for Discovery, the College of Engineering, and the Department of Electrical and Computer Engineering at the University of Wisconsin–Madison.

## Appendices

### Proof of Theorem 1

First notice that since $$i_k$$ is uniformly distributed on $$\{1,\dots ,n\}$$ and $$x_k$$ and $$i_k$$ are independent, we have:

\begin{aligned} \,{\mathbb {E}} \bigl ( u_k \,\big |\, x_k \bigr ) = \,{\mathbb {E}} \bigl ( \nabla f_{i_k}(x_k) \,\big |\, x_k \bigr ) = \frac{1}{n}\sum _{i=1}^n \nabla f_i(x_k) = \nabla g(x_k) \end{aligned}

Consequently, we have:

\begin{aligned} \,{\mathbb {E}} \left( \begin{bmatrix} x_k-x_\star \\ u_k \end{bmatrix}^{\mathsf {T}}\! \begin{bmatrix} -2mI_p &{} I_p \\ I_p &{} 0_p\end{bmatrix} \begin{bmatrix} x_k-x_\star \\ u_k \end{bmatrix}\bigg |\ x_k\right) = \begin{bmatrix} x_k-x_\star \\ \nabla g(x_k) \end{bmatrix}^{\mathsf {T}}\! \begin{bmatrix} -2m I_p &{} I_p\\ I_p &{} 0_p\end{bmatrix} \begin{bmatrix} x_k-x_\star \\ \nabla g(x_k) \end{bmatrix} \ge 0 \end{aligned}
(54)

where the inequality in (54) follows from the definition of $$g \in {\mathcal {S}}(m,\infty )$$.

Next we prove (12), let’s start with Case I, the boundedness constraint $$\Vert \nabla f_i(x_k)\Vert \le \beta$$ implies that $$\left\| u_k\right\| \le \beta$$ for all k. Rewrite as a quadratic form to obtain:

\begin{aligned} \begin{bmatrix} x_k-x_\star \\ u_k \end{bmatrix}^{\mathsf {T}}\begin{bmatrix} 0_p &{}\quad 0_p \\ 0_p &{}\quad -I_p \end{bmatrix} \begin{bmatrix} x_k-x_\star \\ u_k \end{bmatrix} \ge -\beta ^2 \end{aligned}
(55)

The boundedness constraint $$\Vert \nabla f_i(x_k)-mx_k\Vert \le \beta$$ implies that:

\begin{aligned} \left\| u_k-m(x_k-x_\star )\right\| ^2&\le \left\| (u_k-mx_k)+mx_\star \right\| ^2 + \left\| (u_k-mx_k)-mx_\star \right\| ^2\\&=2\left\| u_k-mx_k\right\| ^2 + 2m^2\left\| x_\star \right\| ^2 \\&\le 2\beta ^2 + 2m^2\left\| x_\star \right\| ^2 \end{aligned}

As in the proof of Case I, rewrite the above inequality as a quadratic form and we obtain the second row of Table 1.

To prove the three remaining cases, we begin by showing that an inequality of the following form holds for each $$f_i$$:

\begin{aligned} \begin{bmatrix} x_k-x_\star \\ \nabla f_i(x_k)-\nabla f_i(x_\star ) \end{bmatrix}^{\mathsf {T}}\begin{bmatrix} M_{11}I_p &{}\quad M_{12}I_p \\ M_{21}I_p &{}\quad -2I_p\end{bmatrix} \begin{bmatrix} x_k-x_\star \\ \nabla f_i(x_k)-\nabla f_i(x_\star ) \end{bmatrix} \ge 0 \end{aligned}
(56)

The verification for (56) follows directly from the definitions of L-smoothness and convexity. In the smooth case (Definition 1), for example, $$\Vert \nabla f_i(x_k)-\nabla f_i(x_\star )\Vert \le L\Vert x_k-x_\star \Vert$$. So (56) holds with $$M_{11}=2L^2$$, $$M_{12}=M_{21}=0$$. The cases for $${\mathcal {F}}(0,L)$$ and $${\mathcal {F}}(m,L)$$ follow directly from Definition 2. In Table 1, we always have $$M_{22}=-1$$. Therefore,

\begin{aligned}&\,{\mathbb {E}} \left( \begin{bmatrix} x_k-x_\star \\ u_k \end{bmatrix}^{\mathsf {T}}\begin{bmatrix} M_{11}I_p &{} M_{12}I_p \\ M_{21}I_p &{} M_{22}I_p\end{bmatrix} \begin{bmatrix} x_k-x_\star \\ u_k \end{bmatrix}\,\,\bigg |\ \, x_k\right) \nonumber \\&\quad =\frac{1}{n} \sum _{i=1}^n \begin{bmatrix} x_k-x_\star \\ \nabla f_i(x_k)\end{bmatrix}^{\mathsf {T}}\begin{bmatrix} M_{11}I_p &{} M_{12}I_p \\ M_{21}I_p &{} 0_p\end{bmatrix} \begin{bmatrix} x_k-x_\star \\ \nabla f_i(x_k)\end{bmatrix}- \frac{1}{n} \sum _{i=1}^n \Vert \nabla f_i(x_k)\Vert ^2 \end{aligned}
(57)

Since $$\frac{1}{n}\sum _{i=1}^n \nabla f_i(x_\star ) = \nabla g(x_\star ) = 0$$, the first term on the right side of (57) is equal to

\begin{aligned} \frac{1}{n} \sum _{i=1}^n \begin{bmatrix} x_k-x_\star \\ \nabla f_i(x_k)-\nabla f_i(x_\star )\end{bmatrix} \begin{bmatrix} M_{11}I_p &{} M_{12}I_p \\ M_{21}I_p &{} 0_p\end{bmatrix} \begin{bmatrix} x_k-x_\star \\ \nabla f_i(x_k)-\nabla f_i(x_\star ) \end{bmatrix} \end{aligned}

Based on the constraint condition (56), we know that the above term is greater than or equal to $$\frac{2}{n} \sum _{i=1}^n \Vert \nabla f_i(x_k)-\nabla f_i(x_\star ) \Vert ^2$$. Substituting this fact back into (57) leads to the inequality:

\begin{aligned}&\,{\mathbb {E}} \left( \begin{bmatrix} x_k-x_\star \\ u_k \end{bmatrix}^{\mathsf {T}}\begin{bmatrix} M_{11}I_p &{} M_{12}I_p \\ M_{21}I_p &{} M_{22}I_p\end{bmatrix} \begin{bmatrix} x_k-x_\star \\ u_k \end{bmatrix}\,\,\bigg |\ \, x_k\right) \nonumber \\&\quad \ge \frac{1}{n} \sum _{i=1}^n \left( 2\Vert \nabla f_i(x_k)-\nabla f_i(x_\star ) \Vert ^2- \Vert \nabla f_i(x_k)\Vert ^2\right) \nonumber \\&\quad = \frac{1}{n} \sum _{i=1}^n \left( \Vert \nabla f_i(x_k)-2\nabla f_i(x_\star ) \Vert ^2- 2\Vert \nabla f_i(x_\star )\Vert ^2\right) \nonumber \\&\quad \ge -\frac{2}{n} \sum _{i=1}^n \Vert \nabla f_i(x_\star )\Vert ^2 \end{aligned}
(58)

Taking the expectation of both sides, we arrive at (12), as desired. Now we are ready to prove our main theorem. By Schur complement, (10) is equivalent to (15), which can be further rewritten as

\begin{aligned} \left( \begin{bmatrix} 1-\rho _k^2 &{}\quad -\alpha _k &{}\quad -\alpha _k\\ alpha_k &{}\quad \alpha _k^2 &{}\quad \alpha _k^2 \\ alpha_k &{}\quad \alpha _k^2 &{}\quad \alpha _k^2 \end{bmatrix} +\nu _k\!\begin{bmatrix} -2m &{}\quad 1 &{}\quad 0\\ 1 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad 0 &{}\quad 0 \end{bmatrix} +\lambda _k\!\begin{bmatrix} M_{11} &{}\quad M_{12} &{}\quad 0\\ M_{21} &{}\quad M_{22} &{}\quad 0 \\ 0 &{}\quad 0 &{}\quad 0 \end{bmatrix} +\mu _k\!\begin{bmatrix} 0 &{}\quad 0 &{}\quad 0\\ 0 &{}\quad \delta ^2 &{}\quad 0 \\ 0 &{}\quad 0 &{}\quad -1 \end{bmatrix}\right) \otimes I_p\preceq 0 \end{aligned}
(59)

Since $$x_{k+1}-x_\star =x_k-x_\star -\alpha _k(u_k+e_k)$$, we have

\begin{aligned} \begin{bmatrix} x_k-x_\star \\ u_k \\ e_k \end{bmatrix}^{\mathsf {T}}\left( \begin{bmatrix} 1 &{}\quad -\alpha _k &{}\quad -\alpha _k\\ alpha_k &{}\quad \alpha _k^2 &{}\quad \alpha _k^2 \\ -\alpha _k &{}\quad \alpha _k^2 &{}\quad \alpha _k^2 \end{bmatrix}\otimes I_p\right) \begin{bmatrix} x_k-x_\star \\ u_k \\ e_k \end{bmatrix}=\Vert x_{k+1}-x_\star \Vert ^2 \end{aligned}
(60)

Now we can left and right multiply (59) by $$[(x_k-x_\star )^{\mathsf {T}}, u_k^{\mathsf {T}}, e_k^{\mathsf {T}}]$$ and $$[(x_k-x_\star )^{\mathsf {T}}, u_k^{\mathsf {T}}, e_k^{\mathsf {T}}]^{\mathsf {T}}$$, and apply the inequalities (4), (54), and (12) to get the desired conclusion. $$\square$$

### Proof of Lemma 2

We use an induction argument to prove Item 1. For simplicity, we denote (48) as $$V_{k+1}=h(V_k)$$. Suppose we have $$V_k=h(V_{k-1})$$ and $$V_{k-1}>V_\star$$. We are going to show $$V_{k+1}=h(V_k)$$ and $$V_k>V_\star$$. We can rewrite (48) as

\begin{aligned} V_{k+1}=\,\, \mathop {{{\,\mathrm{minimize}\,}}}\limits _{\zeta > 0} \quad A_k(1+Z_k^{-1}) + B_k(1+Z_k) \end{aligned}
(61)

where $$A_k$$, $$B_k$$, and $$Z_k$$ are defined as

\begin{aligned} A_k&=\frac{m^2 V_k^2 \left( c^2+(2G^2+{\tilde{M}} V_k)\delta ^2\right) }{(2G^2+{\tilde{M}} V_k)^2}\\ B_k&=\frac{(2G^2 V_k+({\tilde{M}}-m^2) V_k^2)((2G^2+{\tilde{M}} V_k)(1-\delta ^2)-c^2)}{(2G^2+{\tilde{M}} V_k)^2}\\ Z_k&=\frac{\bigl (2 G^2+{\tilde{M}} V_k\bigr )(\delta ^2+\zeta _k )+c^2}{(2G^2+{\tilde{M}} V_k)(1-\delta ^2)-c^2} \end{aligned}

Note that $$A_k \ge 0$$ and $$B_k \ge 0$$ due to the condition $$2G^2(1-\delta ^2) \ge c^2$$. The objective in (61) therefore has a form very similar to the objective in (26). Applying Lemma 1, we deduce that $$V_{k+1} =(\sqrt{A_k}+\sqrt{B_k})^2$$, which is the same as (49). The associated $$Z_k^opt$$ is $$\sqrt{\tfrac{A_k}{B_k}}$$. To ensure this is a feasible choice, it remains to check that the associated $$\zeta _k^opt > 0$$ as well. Via algebraic manipulations, one can show that $$\zeta _k > 0$$ is equivalent to $$V_k > V_\star$$. We can also verify $$A_k$$ is a monotonically increasing function of $$V_k$$, and $$B_k$$ is a monotonically nondecreasing function of $$V_k$$. Hence h is a monotonically increasing function. Also notice $$V_\star$$ is a fixed point of (49). Therefore, if we assume $$V_k=h(V_{k-1})$$ and $$V_{k-1}>V_\star$$, we have $$V_k=h(V_{k-1})>h(V_\star )=V_\star$$. Hence we guarantee $$\zeta _k>0$$ and $$V_{k+1}=h(V_k)$$. By similar arguments, one can verify $$V_1=h(V_0)$$. And it is assumed that $$V_0>V_\star$$. This completes the induction argument.

Item 2 follows from a similar argument to the one used in Sect. 2.2. Finally, Item 3 can be proven by choosing a sufficiently small constant stepsize $$\alpha$$ to make $${\hat{U}}_k$$ arbitrarily close to $$V_\star$$. Since $$V_\star \le V_k \le {\hat{U}}_k$$, we conclude that $$\lim _{k\rightarrow \infty } V_k = V_\star$$, as required. $$\square$$

## Rights and permissions

Reprints and Permissions

Hu, B., Seiler, P. & Lessard, L. Analysis of biased stochastic gradient descent using sequential semidefinite programs. Math. Program. (2020). https://doi.org/10.1007/s10107-020-01486-1

• Accepted:

• Published: