Analysis of biased stochastic gradient descent using sequential semidefinite programs

Hu, Bin; Seiler, Peter; Lessard, Laurent

doi:10.1007/s10107-020-01486-1

Analysis of biased stochastic gradient descent using sequential semidefinite programs

Full Length Paper
Series A
Published: 23 March 2020

Volume 187, pages 383–408, (2021)
Cite this article

Mathematical Programming Submit manuscript

1911 Accesses
14 Citations
Explore all metrics

Abstract

We present a convergence rate analysis for biased stochastic gradient descent (SGD), where individual gradient updates are corrupted by computation errors. We develop stochastic quadratic constraints to formulate a small linear matrix inequality (LMI) whose feasible points lead to convergence bounds of biased SGD. Based on this LMI condition, we develop a sequential minimization approach to analyze the intricate trade-offs that couple stepsize selection, convergence rate, optimization accuracy, and robustness to gradient inaccuracy. We also provide feasible points for this LMI and obtain theoretical formulas that quantify the convergence properties of biased SGD under various assumptions on the loss functions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Variance Controlled Stochastic Method with Biased Estimation for Faster Non-convex Optimization

Unified Analysis of Stochastic Gradient Methods for Composite Convex and Smooth Optimization

Article 27 September 2023

Stochastic Gradient Method with Barzilai–Borwein Step for Unconstrained Nonlinear Optimization

Article 01 January 2021

Notes

When $\delta =c=0$, this rate bound does not reduce to $\rho ^2=1-2m\alpha +O(\alpha ^2)$. This is due to the inherent differences between the analyses of biased SGD and the standard SGD. See Remark 4 for a detailed explanation.
This case is a variant of the common assumption $\frac{1}{n}\sum _{i=1}^n\left\| \nabla f_i(x)\right\| ^2 \le \beta $. One can check that this case holds for several $\ell _2$-regularized problems including SVM and logistic regression.
The loss functions for SVM are non-smooth, and $u_k$ is actually updated using the subgradient information. For simplicity, we abuse our notation and use $\nabla f_i$ to denote the subgradient of $f_i$ for SVM problems.
Ensuring such a condition in practice can be challenging for many cases since it heavily relies on the estimations of problem parameters.
When $M_{21}=0$, this condition always holds. When $\delta =0$, this condition is equivalent to $M_{21}\alpha \le 1$. Hence the above corollary can be directly applied if $M_{21}=0$ or $\delta =0$. If $M_{21}> 0$ and $\delta >0$, the condition $M_{21}\left( \alpha +\frac{\delta ^2}{m}\right) \le 1$ can be rewritten as a condition on $\alpha $ in a case-by-case manner.

References

Agarwal, A., Bartlett, P.L., Ravikumar, P., Wainwright, M.J.: Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization. IEEE Trans. Inf. Theory 58(5), 3235–3249 (2012)
Article MathSciNet Google Scholar
Arora, S., Ge, R., Ma, T., Moitra, A.: Simple, efficient, and neural algorithms for sparse coding. In: Conference on Learning Theory, pp. 113–149 (2015)
Bertsekas, D.: Nonlinear Programming, 2nd edn. Athena scientific, Belmont (2002)
Google Scholar
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010, pp. 177–186 (2010)
Bottou, L., Curtis, F., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)
Article MathSciNet Google Scholar
Bottou, L., LeCun, Y.: Large scale online learning. Adv. Neural Inf. Process. Syst. 16, 217 (2004)
Google Scholar
Bubeck, S.: Convex optimization: algorithms and complexity. Found. Trends® Mach. Learn. 8(3–4), 231–357 (2015)
Article Google Scholar
Chen, Y., Candes, E.: Solving random quadratic systems of equations is nearly as easy as solving linear systems. In: Advances in Neural Information Processing Systems, pp. 739–747 (2015)
d’Aspremont, A.: Smooth optimization with approximate gradient. SIAM J. Optim. 19(3), 1171–1183 (2008)
Article MathSciNet Google Scholar
De Klerk, E., Glineur, F., Taylor, A.: On the worst-case complexity of the gradient method with exact line search for smooth strongly convex functions. Optim. Lett. 11(7), 1185–1199 (2017)
Article MathSciNet Google Scholar
Defazio, A., Bach, F., Lacoste-Julien, S.: Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In: Advances in Neural Information Processing Systems (2014)
Defazio, A., Domke, J., Caetano, T.: Finito: A faster, permutable incremental gradient method for big data problems. In: Proceedings of the 31st International Conference on Machine Learning, pp. 1125–1133 (2014)
Devolder, O., Glineur, F., Nesterov, Y.: First-order methods of smooth convex optimization with inexact oracle. Math. Program. 146(1–2), 37–75 (2014)
Article MathSciNet Google Scholar
Drori, Y., Teboulle, M.: Performance of first-order methods for smooth convex minimization: a novel approach. Math. Program. 145(1–2), 451–482 (2014)
Article MathSciNet Google Scholar
Feyzmahdavian, H., Aytekin, A., Johansson, M.: A delayed proximal gradient method with linear convergence rate. In: 2014 IEEE International Workshop on Machine Learning for Signal Processing, pp. 1–6 (2014)
Grant, M., Boyd, S.: Graph implementations for nonsmooth convex programs. In: Blondel, V., Boyd, S., Kimura, H. (eds.) Recent Advances in Learning and Control. Lecture Notes in Control and Information Sciences, pp. 95–110. Springer (2008). http://stanford.edu/~boyd/graph_dcp.html
Grant, M., Boyd, S.: CVX: Matlab software for disciplined convex programming, version 2.1. http://cvxr.com/cvx (2014)
Hu, B., Seiler, P., Rantzer, A.: A unified analysis of stochastic optimization methods using jump system theory and quadratic constraints. In: Proceedings of the 2017 Conference on Learning Theory, vol. 65, pp. 1157–1189 (2017)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)
Lee, J.C., Valiant, P.: Optimizing star-convex functions. In: 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pp. 603–614 (2016)
Lessard, L., Recht, B., Packard, A.: Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 26(1), 57–95 (2016)
Article MathSciNet Google Scholar
Moulines, E., Bach, F.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Advances in Neural Information Processing Systems, pp. 451–459 (2011)
Nedić, A., Bertsekas, D.: Convergence rate of incremental subgradient algorithms. In: Stochastic Optimization: Algorithms and Applications, pp. 223–264 (2001)
Needell, D., Ward, R., Srebro, N.: Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. In: Advances in Neural Information Processing Systems, pp. 1017–1025 (2014)
Nishihara, R., Lessard, L., Recht, B., Packard, A., Jordan, M.: A general analysis of the convergence of ADMM. In: Proceedings of the 32nd International Conference on Machine Learning, pp. 343–352 (2015)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Article MathSciNet Google Scholar
Roux, N., Schmidt, M., Bach, F.: A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with finite training sets. In: Advances in Neural Information Processing Systems (2012)
Schmidt, M., Roux, N., Bach, F.: Minimizing finite sums with the stochastic average gradient. Math. Program. 162(1–2), 83–112 (2017)
Article MathSciNet Google Scholar
Schmidt, M., Roux, N.L., Bach, F.R.: Convergence rates of inexact proximal-gradient methods for convex optimization. In: Advances in Neural Information Processing Systems, pp. 1458–1466 (2011)
Shalev-Shwartz, S., Zhang, T.: Stochastic dual coordinate ascent methods for regularized loss. J. Mach. Learn. Res. 14(1), 567–599 (2013)
MathSciNet MATH Google Scholar
Sun, R., Luo, Z.Q.: Guaranteed matrix completion via non-convex factorization. IEEE Trans. Inf. Theory 62(11), 6535–6579 (2016)
Article MathSciNet Google Scholar
Taylor, A., Bach, F.: Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions. In: Proceedings of the 2019 Conference on Learning Theory, pp. 2934–2992 (2019)
Taylor, A., Hendrickx, J., Glineur, F.: Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Math. Program. 161(1–2), 307–345 (2017)
Article MathSciNet Google Scholar
Taylor, A., Hendrickx, J.M., Glineur, F.: Exact worst-case performance of first-order methods for composite convex optimization. SIAM J. Optim. 27(3), 1283–1313 (2017)
Article MathSciNet Google Scholar
Taylor, A., Van Scoy, B., Lessard, L.: Lyapunov functions for first-order methods: Tight automated convergence guarantees. In: Proceedings of the 35th International Conference on Machine Learning, pp. 4897–4906 (2018)
Teo, C., Smola, A., Vishwanathan, S., Le, Q.: A scalable modular convex solver for regularized risk minimization. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 727–736 (2007)

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, University of Illinois at Urbana–Champaign, 306 N Wright St., Urbana, IL, 61801, USA
Bin Hu
Department of Electrical Engineering and Computer Science, University of Michigan, 1301 Beal Avenue, Ann Arbor, MI, 48109, USA
Peter Seiler
Department of Electrical and Computer Engineering, Wisconsin Institute for Discovery, University of Wisconsin–Madison, 330 N. Orchard St., Madison, WI, 53715, USA
Laurent Lessard

Authors

Bin Hu
View author publications
You can also search for this author in PubMed Google Scholar
Peter Seiler
View author publications
You can also search for this author in PubMed Google Scholar
Laurent Lessard
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bin Hu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work is supported by the NSF Awards 1656951, 1750162, 1254129, and the NASA Langley NRA Cooperative Agreement NNX12AM55A. Bin Hu and Laurent Lessard also acknowledge support from the Wisconsin Institute for Discovery, the College of Engineering, and the Department of Electrical and Computer Engineering at the University of Wisconsin–Madison.

Appendices

Appendix

Proof of Theorem 1

First notice that since $i_k$ is uniformly distributed on $\{1,\dots ,n\}$ and $x_k$ and $i_k$ are independent, we have:

$$\begin{aligned} \,{\mathbb {E}} \bigl ( u_k \,\big |\, x_k \bigr ) = \,{\mathbb {E}} \bigl ( \nabla f_{i_k}(x_k) \,\big |\, x_k \bigr ) = \frac{1}{n}\sum _{i=1}^n \nabla f_i(x_k) = \nabla g(x_k) \end{aligned}$$

Consequently, we have:

$$\begin{aligned} \,{\mathbb {E}} \left( \begin{bmatrix} x_k-x_\star \\ u_k \end{bmatrix}^{\mathsf {T}}\! \begin{bmatrix} -2mI_p &{} I_p \\ I_p &{} 0_p\end{bmatrix} \begin{bmatrix} x_k-x_\star \\ u_k \end{bmatrix}\bigg |\ x_k\right) = \begin{bmatrix} x_k-x_\star \\ \nabla g(x_k) \end{bmatrix}^{\mathsf {T}}\! \begin{bmatrix} -2m I_p &{} I_p\\ I_p &{} 0_p\end{bmatrix} \begin{bmatrix} x_k-x_\star \\ \nabla g(x_k) \end{bmatrix} \ge 0 \end{aligned}$$

(54)

where the inequality in (54) follows from the definition of $g \in {\mathcal {S}}(m,\infty )$.

Next we prove (12), let’s start with Case I, the boundedness constraint $\Vert \nabla f_i(x_k)\Vert \le \beta $ implies that $\left\| u_k\right\| \le \beta $ for all k. Rewrite as a quadratic form to obtain:

$$\begin{aligned} \begin{bmatrix} x_k-x_\star \\ u_k \end{bmatrix}^{\mathsf {T}}\begin{bmatrix} 0_p &{}\quad 0_p \\ 0_p &{}\quad -I_p \end{bmatrix} \begin{bmatrix} x_k-x_\star \\ u_k \end{bmatrix} \ge -\beta ^2 \end{aligned}$$

(55)

The boundedness constraint $\Vert \nabla f_i(x_k)-mx_k\Vert \le \beta $ implies that:

$$\begin{aligned} \left\| u_k-m(x_k-x_\star )\right\| ^2&\le \left\| (u_k-mx_k)+mx_\star \right\| ^2 + \left\| (u_k-mx_k)-mx_\star \right\| ^2\\&=2\left\| u_k-mx_k\right\| ^2 + 2m^2\left\| x_\star \right\| ^2 \\&\le 2\beta ^2 + 2m^2\left\| x_\star \right\| ^2 \end{aligned}$$

As in the proof of Case I, rewrite the above inequality as a quadratic form and we obtain the second row of Table 1.

To prove the three remaining cases, we begin by showing that an inequality of the following form holds for each $f_i$:

$$\begin{aligned} \begin{bmatrix} x_k-x_\star \\ \nabla f_i(x_k)-\nabla f_i(x_\star ) \end{bmatrix}^{\mathsf {T}}\begin{bmatrix} M_{11}I_p &{}\quad M_{12}I_p \\ M_{21}I_p &{}\quad -2I_p\end{bmatrix} \begin{bmatrix} x_k-x_\star \\ \nabla f_i(x_k)-\nabla f_i(x_\star ) \end{bmatrix} \ge 0 \end{aligned}$$

(56)

The verification for (56) follows directly from the definitions of L-smoothness and convexity. In the smooth case (Definition 1), for example, $\Vert \nabla f_i(x_k)-\nabla f_i(x_\star )\Vert \le L\Vert x_k-x_\star \Vert $. So (56) holds with $M_{11}=2L^2$, $M_{12}=M_{21}=0$. The cases for ${\mathcal {F}}(0,L)$ and ${\mathcal {F}}(m,L)$ follow directly from Definition 2. In Table 1, we always have $M_{22}=-1$. Therefore,

$$\begin{aligned}&\,{\mathbb {E}} \left( \begin{bmatrix} x_k-x_\star \\ u_k \end{bmatrix}^{\mathsf {T}}\begin{bmatrix} M_{11}I_p &{} M_{12}I_p \\ M_{21}I_p &{} M_{22}I_p\end{bmatrix} \begin{bmatrix} x_k-x_\star \\ u_k \end{bmatrix}\,\,\bigg |\ \, x_k\right) \nonumber \\&\quad =\frac{1}{n} \sum _{i=1}^n \begin{bmatrix} x_k-x_\star \\ \nabla f_i(x_k)\end{bmatrix}^{\mathsf {T}}\begin{bmatrix} M_{11}I_p &{} M_{12}I_p \\ M_{21}I_p &{} 0_p\end{bmatrix} \begin{bmatrix} x_k-x_\star \\ \nabla f_i(x_k)\end{bmatrix}- \frac{1}{n} \sum _{i=1}^n \Vert \nabla f_i(x_k)\Vert ^2 \end{aligned}$$

(57)

Since $\frac{1}{n}\sum _{i=1}^n \nabla f_i(x_\star ) = \nabla g(x_\star ) = 0$, the first term on the right side of (57) is equal to

$$\begin{aligned} \frac{1}{n} \sum _{i=1}^n \begin{bmatrix} x_k-x_\star \\ \nabla f_i(x_k)-\nabla f_i(x_\star )\end{bmatrix} \begin{bmatrix} M_{11}I_p &{} M_{12}I_p \\ M_{21}I_p &{} 0_p\end{bmatrix} \begin{bmatrix} x_k-x_\star \\ \nabla f_i(x_k)-\nabla f_i(x_\star ) \end{bmatrix} \end{aligned}$$

Based on the constraint condition (56), we know that the above term is greater than or equal to $\frac{2}{n} \sum _{i=1}^n \Vert \nabla f_i(x_k)-\nabla f_i(x_\star ) \Vert ^2$. Substituting this fact back into (57) leads to the inequality:

$$\begin{aligned}&\,{\mathbb {E}} \left( \begin{bmatrix} x_k-x_\star \\ u_k \end{bmatrix}^{\mathsf {T}}\begin{bmatrix} M_{11}I_p &{} M_{12}I_p \\ M_{21}I_p &{} M_{22}I_p\end{bmatrix} \begin{bmatrix} x_k-x_\star \\ u_k \end{bmatrix}\,\,\bigg |\ \, x_k\right) \nonumber \\&\quad \ge \frac{1}{n} \sum _{i=1}^n \left( 2\Vert \nabla f_i(x_k)-\nabla f_i(x_\star ) \Vert ^2- \Vert \nabla f_i(x_k)\Vert ^2\right) \nonumber \\&\quad = \frac{1}{n} \sum _{i=1}^n \left( \Vert \nabla f_i(x_k)-2\nabla f_i(x_\star ) \Vert ^2- 2\Vert \nabla f_i(x_\star )\Vert ^2\right) \nonumber \\&\quad \ge -\frac{2}{n} \sum _{i=1}^n \Vert \nabla f_i(x_\star )\Vert ^2 \end{aligned}$$

(58)

Taking the expectation of both sides, we arrive at (12), as desired. Now we are ready to prove our main theorem. By Schur complement, (10) is equivalent to (15), which can be further rewritten as

$$\begin{aligned} \left( \begin{bmatrix} 1-\rho _k^2 &{}\quad -\alpha _k &{}\quad -\alpha _k\\ alpha_k &{}\quad \alpha _k^2 &{}\quad \alpha _k^2 \\ alpha_k &{}\quad \alpha _k^2 &{}\quad \alpha _k^2 \end{bmatrix} +\nu _k\!\begin{bmatrix} -2m &{}\quad 1 &{}\quad 0\\ 1 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad 0 &{}\quad 0 \end{bmatrix} +\lambda _k\!\begin{bmatrix} M_{11} &{}\quad M_{12} &{}\quad 0\\ M_{21} &{}\quad M_{22} &{}\quad 0 \\ 0 &{}\quad 0 &{}\quad 0 \end{bmatrix} +\mu _k\!\begin{bmatrix} 0 &{}\quad 0 &{}\quad 0\\ 0 &{}\quad \delta ^2 &{}\quad 0 \\ 0 &{}\quad 0 &{}\quad -1 \end{bmatrix}\right) \otimes I_p\preceq 0 \end{aligned}$$

(59)

Since $x_{k+1}-x_\star =x_k-x_\star -\alpha _k(u_k+e_k)$, we have

$$\begin{aligned} \begin{bmatrix} x_k-x_\star \\ u_k \\ e_k \end{bmatrix}^{\mathsf {T}}\left( \begin{bmatrix} 1 &{}\quad -\alpha _k &{}\quad -\alpha _k\\ alpha_k &{}\quad \alpha _k^2 &{}\quad \alpha _k^2 \\ -\alpha _k &{}\quad \alpha _k^2 &{}\quad \alpha _k^2 \end{bmatrix}\otimes I_p\right) \begin{bmatrix} x_k-x_\star \\ u_k \\ e_k \end{bmatrix}=\Vert x_{k+1}-x_\star \Vert ^2 \end{aligned}$$

(60)

Now we can left and right multiply (59) by $[(x_k-x_\star )^{\mathsf {T}}, u_k^{\mathsf {T}}, e_k^{\mathsf {T}}]$ and $[(x_k-x_\star )^{\mathsf {T}}, u_k^{\mathsf {T}}, e_k^{\mathsf {T}}]^{\mathsf {T}}$, and apply the inequalities (4), (54), and (12) to get the desired conclusion. $\square $

Proof of Lemma 2

We use an induction argument to prove Item 1. For simplicity, we denote (48) as $V_{k+1}=h(V_k)$. Suppose we have $V_k=h(V_{k-1})$ and $V_{k-1}>V_\star $. We are going to show $V_{k+1}=h(V_k)$ and $V_k>V_\star $. We can rewrite (48) as

$$\begin{aligned} V_{k+1}=\,\, \mathop {{{\,\mathrm{minimize}\,}}}\limits _{\zeta > 0} \quad A_k(1+Z_k^{-1}) + B_k(1+Z_k) \end{aligned}$$

(61)

where $A_k$, $B_k$, and $Z_k$ are defined as

$$\begin{aligned} A_k&=\frac{m^2 V_k^2 \left( c^2+(2G^2+{\tilde{M}} V_k)\delta ^2\right) }{(2G^2+{\tilde{M}} V_k)^2}\\ B_k&=\frac{(2G^2 V_k+({\tilde{M}}-m^2) V_k^2)((2G^2+{\tilde{M}} V_k)(1-\delta ^2)-c^2)}{(2G^2+{\tilde{M}} V_k)^2}\\ Z_k&=\frac{\bigl (2 G^2+{\tilde{M}} V_k\bigr )(\delta ^2+\zeta _k )+c^2}{(2G^2+{\tilde{M}} V_k)(1-\delta ^2)-c^2} \end{aligned}$$

Note that $A_k \ge 0$ and $B_k \ge 0$ due to the condition $2G^2(1-\delta ^2) \ge c^2$. The objective in (61) therefore has a form very similar to the objective in (26). Applying Lemma 1, we deduce that $V_{k+1} =(\sqrt{A_k}+\sqrt{B_k})^2$, which is the same as (49). The associated $Z_k^opt $ is $\sqrt{\tfrac{A_k}{B_k}}$. To ensure this is a feasible choice, it remains to check that the associated $\zeta _k^opt > 0$ as well. Via algebraic manipulations, one can show that $\zeta _k > 0$ is equivalent to $V_k > V_\star $. We can also verify $A_k$ is a monotonically increasing function of $V_k$, and $B_k$ is a monotonically nondecreasing function of $V_k$. Hence h is a monotonically increasing function. Also notice $V_\star $ is a fixed point of (49). Therefore, if we assume $V_k=h(V_{k-1})$ and $V_{k-1}>V_\star $, we have $V_k=h(V_{k-1})>h(V_\star )=V_\star $. Hence we guarantee $\zeta _k>0$ and $V_{k+1}=h(V_k)$. By similar arguments, one can verify $V_1=h(V_0)$. And it is assumed that $V_0>V_\star $. This completes the induction argument.

Item 2 follows from a similar argument to the one used in Sect. 2.2. Finally, Item 3 can be proven by choosing a sufficiently small constant stepsize $\alpha $ to make ${\hat{U}}_k$ arbitrarily close to $V_\star $. Since $V_\star \le V_k \le {\hat{U}}_k$, we conclude that $\lim _{k\rightarrow \infty } V_k = V_\star $, as required. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hu, B., Seiler, P. & Lessard, L. Analysis of biased stochastic gradient descent using sequential semidefinite programs. Math. Program. 187, 383–408 (2021). https://doi.org/10.1007/s10107-020-01486-1

Download citation

Received: 25 June 2018
Accepted: 02 March 2020
Published: 23 March 2020
Issue Date: May 2021
DOI: https://doi.org/10.1007/s10107-020-01486-1

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Analysis of biased stochastic gradient descent using sequential semidefinite programs

Abstract

Access this article

Similar content being viewed by others

A Variance Controlled Stochastic Method with Biased Estimation for Faster Non-convex Optimization

Unified Analysis of Stochastic Gradient Methods for Composite Convex and Smooth Optimization

Stochastic Gradient Method with Barzilai–Borwein Step for Unconstrained Nonlinear Optimization

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix

Proof of Theorem 1

Proof of Lemma 2

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Analysis of biased stochastic gradient descent using sequential semidefinite programs

Abstract

Access this article

Similar content being viewed by others

A Variance Controlled Stochastic Method with Biased Estimation for Faster Non-convex Optimization

Unified Analysis of Stochastic Gradient Methods for Composite Convex and Smooth Optimization

Stochastic Gradient Method with Barzilai–Borwein Step for Unconstrained Nonlinear Optimization

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix

Proof of Theorem 1

Proof of Lemma 2

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation