Skip to main content

Second-Order Step-Size Tuning of SGD for Non-Convex Optimization

Abstract

In view of a direct and simple improvement of vanilla SGD, this paper presents a fine-tuning of its step-sizes in the mini-batch case. For doing so, one estimates curvature, based on a local quadratic model and using only noisy gradient approximations. One obtains a new stochastic first-order method (Step-Tuned SGD), enhanced by second-order information, which can be seen as a stochastic version of the classical Barzilai-Borwein method. Our theoretical results ensure almost sure convergence to the critical set and we provide convergence rates. Experiments on deep residual network training illustrate the favorable properties of our approach. For such networks we observe, during training, both a sudden drop of the loss and an improvement of test accuracy at medium stages, yielding better results than SGD, RMSprop, or ADAM.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Notes

  1. For a fair comparison we implement this method with the scaling-factor \(\alpha \) of Algorithm 1.

  2. There is also the possibility of computing additional estimates as [41] previously did for a stochastic BFGS algorithm, but this would double the computational cost.

  3. Step-Tuned SGD achieves the same small level of error as SGD when doing additional epochs thanks to the decay schedule present in Algorithm 2.

  4. Default values: \((\nu ,\beta ,{\tilde{m}},{\tilde{M}},\delta ) = (2,0.9,0.5,2,0.001)\).

  5. An alternative common practice consists in manually decaying the step-size at pre-defined epochs. This technique although efficient in practice to achieve state-of-the-art results makes the comparison of algorithms harder, hence we stick to a usual Robbins-Monro type of decay.

References

  1. Alber YI, Iusem AN, Solodov MV (1998) On the projected subgradient method for nonsmooth convex optimization in a hilbert space. Math Program 81(1):23–35

    MathSciNet  Article  Google Scholar 

  2. Allen-Zhu Z (2018) Natasha 2: faster non-convex optimization than SGD. In: Advances in Neural Information Processing Systems (NIPS), pp 2675–2686

  3. Alvarez F, Cabot A (2004) Steepest descent with curvature dynamical system. J Optim Theory Appl 120(2):247–273

    MathSciNet  Article  Google Scholar 

  4. Babaie-Kafaki S, Fatemi M (2013) A modified two-point stepsize gradient algorithm for unconstrained minimization. Optim Methods Softw 28(5):1040–1050

    MathSciNet  Article  Google Scholar 

  5. Barakat A, Bianchi P (2018) Convergence of the ADAM algorithm from a dynamical system viewpoint. arXiv:1810.02263

  6. Barzilai J, Borwein JM (1988) Two-point step size gradient methods. IMA J Numer Anal 8(1):141–148

    MathSciNet  Article  Google Scholar 

  7. Bertsekas DP, Hager W, Mangasarian O (1998) Nonlinear programming. Athena Scientific, Belmont, MA

    Google Scholar 

  8. Biglari F, Solimanpur M (2013) Scaling on the spectral gradient method. J Optim Theory Appl 158:626–635

    MathSciNet  Article  Google Scholar 

  9. Bolte J, Pauwels E (2020) A mathematical model for automatic differentiation in machine learning. In: advances in Neural Information Processing Systems (NIPS)

  10. Carmon Y, Duchi JC, Hinder O, Sidford A (2017) Convex until proven guilty: dimension-free acceleration of gradient descent on non-convex functions. In: proceedings of the international conference on machine learning (ICML), pp 654–663

  11. Castera C, Bolte J, Févotte C, Pauwels E (2021) An inertial Newton algorithm for deep learning. J Mach Learn Res 22(134):1–31

    MathSciNet  MATH  Google Scholar 

  12. Curtis FE, Guo W (2016) Handling nonpositive curvature in a limited memory steepest descent method. IMA J Numer Anal 36(2):717–742

    MathSciNet  Article  Google Scholar 

  13. Curtis FE, Robinson DP (2019) Exploiting negative curvature in deterministic and stochastic optimization. Math Program 176(1–2):69–94

    MathSciNet  Article  Google Scholar 

  14. Dai Y, Yuan J, Yuan YX (2002) Modified two-point stepsize gradient methods for unconstrained optimization. Comput Optim Appl 22(1):103–109

    MathSciNet  Article  Google Scholar 

  15. Davis D, Drusvyatskiy D, Kakade S, Lee JD (2020) Stochastic subgradient method converges on tame functions. Found Comput Math 20(1):119–154

    MathSciNet  Article  Google Scholar 

  16. Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res, 12(7)

  17. Duchi JC, Ruan F (2018) Stochastic methods for composite and weakly convex optimization problems. SIAM J Optim 28(4):3229–3259

    MathSciNet  Article  Google Scholar 

  18. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778

  19. Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

    MathSciNet  Article  Google Scholar 

  20. Hunter JD (2007) Matplotlib: a 2d graphics environment. Comput Sci Eng 9(3):90–95

    Article  Google Scholar 

  21. Idelbayev Y (2018) Proper ResNet implementation for CIFAR10/CIFAR100 in PyTorch. https://github.com/akamaster/pytorch_resnet_cifar10

  22. Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: proceedings of the international conference on machine learning (ICML), pp 448–456

  23. Johnson R, Zhang T (2013) Accelerating stochastic gradient descent using predictive variance reduction. In: advances in neural information processing systems (NIPS), pp 315–323

  24. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: proceedings of the international conference on learning representations (ICLR)

  25. Krishnan S, Xiao Y, Saurous RA (2018) Neumann optimizer: a practical optimization algorithm for deep neural networks. In: proceedings of the international conference on learning representations (ICLR)

  26. Krizhevsky A (2009) Learning multiple layers of features from tiny images. Tech. rep, Canadian Institute for Advanced Research

  27. LeCun Y, Bottou L, Bengio Y, Haffner P et al (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

    Article  Google Scholar 

  28. LeCun Y, Cortes C, Burges C (2010) MNIST handwritten digit database. ATT Labs [Online] Available: www.https://yannlecuncom/exdb/mnist

  29. Li X, Orabona F (2019) On the convergence of stochastic gradient descent with adaptive stepsizes. In: proceedings of the international conference on artificial intelligence and statistics (AISTATS), pp 983–992

  30. Liang J, Xu Y, Bao C, Quan Y, Ji H (2019) Barzilai-Borwein-based adaptive learning rate for deep learning. Pattern Recognit Lett 128:197–203

    Article  Google Scholar 

  31. Lin M, Chen Q, Yan S (2013) Network in network. arXiv:1312.4400

  32. Liu M, Yang T (2017) On noisy negative curvature descent: Competing with gradient descent for faster non-convex optimization. arXiv:1709.08571

  33. Martens J, Grosse R (2015) Optimizing neural networks with kronecker-factored approximate curvature. In: proceedings of the international conference on machine learning (ICML), pp 2408–2417

  34. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. In: advances in neural information processing systems (NIPS), pp 8026–8037

  35. Raydan M (1997) The Barzilai and Borwein gradient method for the large scale unconstrained minimization problem. SIAM J Optim 7(1):26–33

    MathSciNet  Article  Google Scholar 

  36. Robbins H, Monro S (1951) A stochastic approximation method. Ann Math Stat 22(1):400–407

    MathSciNet  Article  Google Scholar 

  37. Robbins H, Siegmund D (1971) A convergence theorem for non negative almost supermartingales and some applications. In: optimizing methods in statistics, Elsevier, pp 233–257

  38. Robles-Kelly A, Nazari A (2019) Incorporating the Barzilai-Borwein adaptive step size into subgradient methods for deep network training. In: 2019 digital image computing: techniques and applications (DICTA), pp 1–6

  39. Rossum G (1995) Python reference manual. CWI (Centre for Mathematics and Computer Science)

  40. Royer CW, Wright SJ (2018) Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. SIAM J Optim 28(2):1448–1477

    MathSciNet  Article  Google Scholar 

  41. Schraudolph NN, Yu J, Günter S (2007) A stochastic Quasi-Newton method for online convex optimization. In: proceedings of the international conference on artificial intelligence and statistics (AISTATS)

  42. Tan C, Ma S, Dai YH, Qian Y (2016) Barzilai-Borwein step size for stochastic gradient descent. In: advances in neural information processing systems (NIPS), pp 685–693

  43. Tieleman T, Hinton G (2012) Lecture 6.5-RMSprop: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw Mach Learn 4(2):26–31

    Google Scholar 

  44. Svd Walt, Colbert SC, Varoquaux G (2011) The numpy array: a structure for efficient numerical computation. Comput Sci Eng 13(2):22–30

    Article  Google Scholar 

  45. Wilson AC, Roelofs R, Stern M, Srebro N, Recht B (2017) The marginal value of adaptive gradient methods in machine learning. In: advances in neural information processing systems (NIPS), pp 4148–4158

  46. Xiao Y, Wang Q, Wang D (2010) Notes on the Dai-Yuan-Yuan modified spectral gradient method. J Comput Appl Math 234(10):2986–2992

    MathSciNet  Article  Google Scholar 

  47. Zhuang J, Tang T, Ding Y, Tatikonda SC, Dvornek N, Papademetris X, Duncan J (2020) Adabelief optimizer: adapting stepsizes by the belief in observed gradients. Advances in Neural Information Processing Systems (NIPS) 33

Download references

Acknowledgements

The authors acknowledge the support of the European Research Council (ERC FACTORY-CoG-6681839), the Agence Nationale de la Recherche (ANR 3IA-ANITI, ANR-17-EURE-0010 CHESS, ANR-19-CE23-0017 MASDOL) and the Air Force Office of Scientific Research (FA9550-18-1-0226). Part of the numerical experiments were done using the OSIRIM platform of IRIT, supported by the CNRS, the FEDER, Région Occitanie and the French government (http://osirim.irit.fr/site/en). We thank the development teams of the following libraries that were used in the experiments: Python [39], Numpy [44], Matplotlib [20], PyTorch [34], and the PyTorch implementation of ResNets from [21]. We thank Emmanuel Soubies and Sixin Zhang for useful discussions and Sébastien Gadat for pointing out flaws in the original proof.

Author information

Authors and Affiliations

Authors

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Last-three authors are listed in alphabetical order.

Appendices

Appendix A: Details About Deep Learning Experiments

In addition to the method described in Sect. 5.1, we provide in Table 1 a summary of each problem considered.

Table 1 Setting of the four different deep learning experiments

In the DL experiments of Sect. 5, we display the training error and the test accuracy of each algorithm as a function of the number of stochastic gradient estimates computed. Due to their adaptive procedures, ADAM, RMSprop and Step-Tuned SGD have additional sub-routines in comparison to SGD. Thus, in Table 2 we additionally provide the wall-clock time per epoch of these methods relatively to SGD. Unlike the number of back-propagations performed, wall-clock time depends on many factors: the network and datasets considered, the computer used, and most importantly, the implementation. Regarding implementation, we would like to emphasize the fact that we used the versions of SGD, ADAM and RMSprop provided in PyTorch, which are fully optimized (and in particular parallelized). Table 2 indicates that Step-Tuned SGD is slower than other adaptive methods for large networks but this is due to our non-parallel implementation. Actually on small networks (where the benefits of parallel computing is small), we observe that running Step-Tuned SGD for one epoch is actually faster than for SGD. As a conclusion, the number of back-propagations is a more suitable metric for comparing the algorithms, and all methods considered require a single back-propagation per iteration.

Table 2 Relative wall-clock time per epoch compared to SGD

Appendix B: Proof of the Theoretical Results

We state a lemma that we will use to prove Theorem 1.

Preliminary Lemma

The result is the following.

Lemma 1

( [1, Proposition 2]) Let \((u_k)_{k\in {\mathbb {N}}}\) and \((v_k)_{k\in {\mathbb {N}}}\) two non-negative real sequences. Assume that \(\sum _{k=0}^{+\infty } u_k v_k <+\infty \), and \(\sum _{k=0}^{+\infty } v_k =+\infty \). If there exists a constant \(C>0\) such that \(\forall k\in {\mathbb {N}}, \vert u_{k+1} - u_k \vert \le C v_k\), then \(u_k\xrightarrow [k\rightarrow +\infty ]{}0\).

Proof of the main theorem

We can now prove Theorem 1.

Proof of Theorem 1

We first clarify the random process induced by the draw of the mini-batches. Algorithm 2 takes a sequence of mini-batches as input. This sequence is represented by the random variables \(({\mathsf {B}}_k)_{k\in {\mathbb {N}}}\) as described in Sect. 3.2. Each of these random variables is independent of the others. In particular, for \(k\in {\mathbb {N}}_{>0}\), \({\mathsf {B}}_k\) is independent of the previous mini-batches \({\mathsf {B}}_0,\ldots , {\mathsf {B}}_{k-1}\). For convenience, we will denote \(\underline{{\mathsf {B}}}_k = \left\{ {\mathsf {B}}_0,\ldots ,{\mathsf {B}}_k\right\} \), the mini-batches up to iteration k. Due to the randomness of the mini-batches, the algorithm is a random process as well. As such, \(\theta _{k}\) is a random variable with a deterministic dependence on \(\underline{{\mathsf {B}}}_{k-1}\) and is independent of \({\mathsf {B}}_k\). However, \(\theta _{k+\frac{1}{2}}\) and \({\mathsf {B}}_{k}\) are not independent. Similarly, we constructed \(\gamma _k\) such that it is a random variable with a deterministic dependence on \(\underline{{\mathsf {B}}}_{k-1}\), which is independent of \({\mathsf {B}}_k\). This dependency structure will be crucial to derive and bound conditional expectations. Finally, we highlight the following important identity, for any \(k\in {\mathbb {N}}_{>0}\),

(25)

Indeed, the iterate \(\theta _{k}\) is a deterministic function of \(\underline{{\mathsf {B}}}_{k-1}\), so taking the expectation over \({\mathsf {B}}_k\), which is independent of \(\underline{{\mathsf {B}}}_{k-1}\), we recover the full gradient of \({\mathcal {J}}\) as the distribution of \({\mathsf {B}}_k\) is the same as that of \({\mathsf {S}}\) in Sect. 3.2. Notice in addition that a similar identity does not hold for \(\theta _{k+\frac{1}{2}}\) (as it depends on \({\mathsf {B}}_k\)).

We now provide estimates that will be used extensively in the rest of the proof. The gradient of the loss function \(\nabla {\mathcal {J}}\) is locally Lipschitz continuous as \({\mathcal {J}}\) is twice continuously differentiable. By assumption, there exists a compact convex set \({\mathsf {C}}\subset {\mathbb {R}}^P\), such that with probability 1, the sequence of iterates \((\theta _{k})_{k\in \frac{1}{2}{\mathbb {N}}}\) belongs to \({\mathsf {C}}\). Therefore, by local Lipschitz continuity, the restriction of \(\nabla {\mathcal {J}}\) to \({\mathsf {C}}\) is Lipschitz continuous on \({\mathsf {C}}\). Similarly, each \(\nabla {\mathcal {J}}_n\) is also Lipschitz continuous on \({\mathsf {C}}\). We denote by \(L>0\) a Lipschitz constant common to each \(\nabla {\mathcal {J}}_n\), \(n=1,\ldots , N\). Notice that the Lipschitz continuity is preserved by averaging, in other words,

$$\begin{aligned} \forall {\mathsf {B}}\subseteq \left\{ 1,\ldots ,N\right\} ,\forall \psi _1,\psi _2\in {\mathsf {C}}, \quad \Vert \nabla {\mathcal {J}}_{\mathsf {B}}(\psi _1) -\nabla {\mathcal {J}}_{\mathsf {B}}(\psi _2) \Vert \le L\Vert \psi _1-\psi _2\Vert . \end{aligned}$$
(26)

In addition, using the continuity of the \(\nabla {\mathcal {J}}_n\)’s, there exists a constant \(C_2>0\), such that,

$$\begin{aligned} \forall {\mathsf {B}}\subseteq \left\{ 1,\ldots ,N\right\} ,\forall \psi \in {\mathsf {C}}, \quad \Vert \nabla {\mathcal {J}}_{\mathsf {B}}(\psi )\Vert \le C_2. \end{aligned}$$
(27)

Finally, for a function \(g:{\mathbb {R}}^P\rightarrow {\mathbb {R}}\) with L-Lipschitz continuous gradient, we recall the following inequality called descent lemma (see for example [7, Proposition A.24]). For any \(\theta \in {\mathbb {R}}^P\) and any \(d\in {\mathbb {R}}^P\),

$$\begin{aligned} g(\theta +d) \le g(\theta ) + \langle \nabla g(\theta ), d\rangle + \frac{L}{2}\Vert d \Vert ^2. \end{aligned}$$
(28)

In our case since we only have the L-Lipschitz continuity of \(\nabla {\mathcal {J}}\) on \({\mathsf {C}}\) which is convex, we have a similar bound for \(\nabla {\mathcal {J}}\) on \({\mathsf {C}}\): for any \(\theta \in {\mathsf {C}}\) and any \(d\in {\mathbb {R}}^P\) such that \(\theta +d\in {\mathsf {C}}\),

$$\begin{aligned} {\mathcal {J}}(\theta +d) \le {\mathcal {J}}(\theta ) + \langle \nabla {\mathcal {J}}(\theta ), d\rangle + \frac{L}{2}\Vert d \Vert ^2. \end{aligned}$$
(29)

Let \(\theta _0\in {\mathbb {R}}^P\) and let \((\theta _{k})_{k\in \frac{1}{2}{\mathbb {N}}}\) a sequence generated by Algorithm 2 initialized at \(\theta _0\). By assumption this sequence belongs to \({\mathsf {C}}\) almost surely. To simplify, for \(k\in {\mathbb {N}}\), we denote \(\eta _k = \alpha \gamma _k (k+1)^{-(1/2+\delta )}\). Fix an iteration \(k\in {\mathbb {N}}\), we can use (29) with \(\theta = \theta _k\) and \(d = -\eta _k\nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _k)\), almost surely (with respect to the boundedness assumption),

$$\begin{aligned} {\mathcal {J}}(\theta _{k+\frac{1}{2}}) \le {\mathcal {J}}(\theta _k) - \eta _k \langle \nabla {\mathcal {J}}(\theta _k), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _k) \rangle + \frac{\eta _k^2}{2}L \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _k)\Vert ^2. \end{aligned}$$
(30)

Similarly with \(\theta = \theta _{k+\frac{1}{2}}\) and \(d = -\eta _k\nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _{k+\frac{1}{2}})\), almost surely,

$$\begin{aligned} {\mathcal {J}}(\theta _{k+1}) \le {\mathcal {J}}(\theta _{k+\frac{1}{2}}) - \eta _k \langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k+\frac{1}{2}}) \rangle + \frac{\eta _k^2}{2}L \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _{k+\frac{1}{2}})\Vert ^2. \end{aligned}$$
(31)

We combine (30) and (31), almost surely,

$$\begin{aligned} \begin{aligned}&{\mathcal {J}}(\theta _{k+1}) \le {\mathcal {J}}(\theta _{k}) - \eta _k \left( \langle \nabla {\mathcal {J}}(\theta _k), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _k)\rangle + \langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k+\frac{1}{2}}) \rangle \right) \\&\quad + \frac{\eta _k^2}{2}L \left( \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _k)\Vert ^2+ \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _{k+\frac{1}{2}})\Vert ^2\right) . \end{aligned} \end{aligned}$$
(32)

Using the boundedness assumption and (27), almost surely,

$$\begin{aligned} \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _k)\Vert ^2 \le C_2 \quad \text {and}\quad \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _{k+\frac{1}{2}})\Vert ^2 \le C_2. \end{aligned}$$
(33)

So almost surely,

$$\begin{aligned} \begin{aligned} {\mathcal {J}}(\theta _{k+1}) \le {\mathcal {J}}(\theta _{k})&- \eta _k \left( \langle \nabla {\mathcal {J}}(\theta _k), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _k)\rangle + \langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k+\frac{1}{2}}) \rangle \right) \\ {}&+ \eta _k^2L C_2. \end{aligned} \end{aligned}$$
(34)

Then, we take the conditional expectation of (34) over \({\mathsf {B}}_k\) conditionally on \(\underline{{\mathsf {B}}}_{k-1}\) (the mini-batches used up to iteration \(k-1\)), we have,

(35)

As explained at the beginning of the proof, \(\theta _{k}\) is a deterministic function of \(\underline{{\mathsf {B}}}_{k-1}\), thus, . Similarly, by construction \(\eta _k\) is independent of the current mini-batch \({\mathsf {B}}_k\), it is a deterministic function of \(\underline{{\mathsf {B}}}_{k-1}\). Hence, (35) reads,

(36)

Then, we use the fact that . Overall, we obtain,

(37)

We will now bound the last term of (37). First we write,

$$\begin{aligned} \begin{aligned}&-\langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k+\frac{1}{2}}) \rangle \\&\quad =-\langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k+\frac{1}{2}})- \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k})\rangle - \langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \rangle . \end{aligned} \end{aligned}$$
(38)

Using the Cauchy-Schwarz inequality, as well as (26) and (27), almost surely,

$$\begin{aligned} \begin{aligned} |\langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k+\frac{1}{2}})- \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k})\rangle |&\le \Vert \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}) \Vert \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k+\frac{1}{2}})- \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k})\Vert \\&\le \Vert \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}) \Vert L\Vert \theta _{k+\frac{1}{2}}-\theta _{k}\Vert \\&\le \Vert \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}) \Vert L\Vert -\eta _k\nabla {\mathcal {J}}_{{\mathsf {B}}_k}(\theta _{k})\Vert \\&\le LC_2^2\eta _k. \end{aligned} \end{aligned}$$
(39)

Hence,

$$\begin{aligned} \begin{aligned}&-\langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k+\frac{1}{2}}) \rangle \le LC_2^2\eta _k - \langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \rangle . \end{aligned} \end{aligned}$$
(40)

We perform similar computations on the last term of (40), almost surely

$$\begin{aligned} \begin{aligned}&-\langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \rangle \\&= -\langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}})-\nabla {\mathcal {J}}(\theta _{k}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \rangle - \langle \nabla {\mathcal {J}}(\theta _{k}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \rangle \\&\le \Vert \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}})-\nabla {\mathcal {J}}(\theta _{k})\Vert \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \Vert - \langle \nabla {\mathcal {J}}(\theta _{k}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \rangle \\&\le LC_2\Vert \theta _{k+\frac{1}{2}}-\theta _{k}\Vert - \langle \nabla {\mathcal {J}}(\theta _{k}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \rangle \\&\le LC_2^2\eta _k - \langle \nabla {\mathcal {J}}(\theta _{k}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \rangle . \end{aligned} \end{aligned}$$
(41)

Finally we obtain by combining (38), (40) and (41), almost surely,

$$\begin{aligned} \begin{aligned}&-\langle \nabla {\mathcal {J}}(\theta _{k+\frac{1}{2}}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k+\frac{1}{2}}) \rangle \le 2LC_2^2\eta _k - \langle \nabla {\mathcal {J}}(\theta _{k}), \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _{k}) \rangle . \end{aligned} \end{aligned}$$
(42)

Going back to the last term of (37), we have, taking the conditional expectation of (42), almost surely

(43)

In the end we obtain, for an arbitrary iteration \(k\in {\mathbb {N}}\), almost surely

(44)

To simplify we assume that \({{\tilde{M}}}\ge \nu \) (otherwise set \(\tilde{M} = \max ({{\tilde{M}}},\nu )\)). We use the fact that, \(\eta _k\in [\frac{\alpha \tilde{m}}{(k+1)^{1/2+\delta }},\frac{\alpha {{\tilde{M}}}}{(k+1)^{1/2+\delta }}]\), to obtain almost surely,

(45)

Since by assumption, the last term is summable, we can now invoke Robbins-Siegmund convergence theorem [37] to obtain that, almost surely, \(({\mathcal {J}}(\theta _{k}))_{k\in {\mathbb {N}}}\) converges and,

$$\begin{aligned} \sum _{k=0}^{+\infty }\frac{1}{(k+1)^{1/2+\delta }}\Vert \nabla {\mathcal {J}}(\theta _k) \Vert ^2 < + \infty . \end{aligned}$$
(46)

Since \(\sum _{k=0}^{+\infty }\frac{1}{(k+1)^{1/2+\delta }}=+\infty \), this implies at least that almost surely,

$$\begin{aligned} \liminf _{k\rightarrow \infty }\Vert \nabla {\mathcal {J}}(\theta _k) \Vert ^2=0. \end{aligned}$$
(47)

To prove that in addition \(\displaystyle \lim _{k\rightarrow \infty }\Vert \nabla {\mathcal {J}}(\theta _k)\Vert ^2 = 0\), we will use Lemma 1 with \(u_k = \Vert \nabla {\mathcal {J}}(\theta _{k})\Vert ^2\) and \(v_k = \frac{1}{(k+1)^{1/2+\delta }}\), for all \(k\in {\mathbb {N}}\). So we need to prove that there exists \(C_3>0\) such that \(\vert u_{k+1} - u_k\vert \le C_3 v_k\). To do so, we use the L-Lipschitz continuity of the gradients on \({\mathsf {C}}\), triangle inequalities and (27). It holds, almost surely, for all \(k \in {\mathbb {N}}\)

$$\begin{aligned}&\left| \Vert \nabla {\mathcal {J}}(\theta _{k+1})\Vert ^2-\Vert \nabla {\mathcal {J}}(\theta _{k})\Vert ^2\right| \nonumber \\&\quad = \;\left( \;\Vert \nabla {\mathcal {J}}(\theta _{k+1})\Vert + \Vert \nabla {\mathcal {J}}(\theta _{k})\Vert \;\right) \;\times \; \left| \;\Vert \;\nabla {\mathcal {J}}(\theta _{k+1})\Vert - \Vert \nabla {\mathcal {J}}(\theta _{k})\;\Vert \;\right| \nonumber \\&\quad \le 2C_2 \left| \Vert \nabla {\mathcal {J}}(\theta _{k+1})\Vert - \Vert \nabla {\mathcal {J}}(\theta _{k})\Vert \right| \nonumber \\&\quad \le 2C_2 \Vert \nabla {\mathcal {J}}(\theta _{k+1})-\nabla {\mathcal {J}}(\theta _{k})\Vert \nonumber \\&\quad \le 2C_2 L \Vert \theta _{k+1}-\theta _{k}\Vert \\&\quad \le 2C_2 L \left\| -\eta _k\nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _k) -\eta _k\nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _{k+\frac{1}{2}})\right\| \nonumber \\&\quad \le 2C_2 L\frac{\alpha {\tilde{M}}}{(k+1)^{1/2+\delta }} \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _k)+\nabla {\mathcal {J}}_{{\mathsf {B}}_{k}}(\theta _{k+\frac{1}{2}})\Vert \nonumber \\&\quad \le 4C_2^2 L\frac{\alpha {\tilde{M}}}{(k+1)^{1/2+\delta }}.\nonumber \end{aligned}$$
(48)

So taking \(C_3 =4C_2^2 L\alpha {\tilde{M}} \), by Lemma 1, almost surely, \(\lim _{k\rightarrow +\infty } \Vert \nabla {\mathcal {J}}(\theta _{k})\Vert ^2=0\). This concludes the almost sure convergence proof.

As for the rate, consider the expectation of (45) (with respect to the random variables \(({\mathsf {B}}_k)_{k\in {\mathbb {N}}}\)). The tower property of the conditional expectation gives \( {\mathbb {E}}[{\mathbb {E}}[{\mathcal {J}}(\theta _{k+1})|\underline{{\mathsf {B}}}_{k-1}]]={\mathbb {E}}\left[ {\mathcal {J}}(\theta _{k+1})\right] \), so we obtain, for all \(k\in {\mathbb {N}}\),

$$\begin{aligned} \begin{aligned} 2\frac{\alpha {\tilde{m}}}{(k+1)^{1/2+\delta }}{\mathbb {E}}\left[ \Vert \nabla {\mathcal {J}}(\theta _k)\Vert ^2\right] \le&{\mathbb {E}}\left[ {\mathcal {J}}(\theta _k)\right] - {\mathbb {E}}\left[ {\mathcal {J}}(\theta _{k+1})\right] + \frac{\alpha ^2{\tilde{M}}^2}{(k+1)^{1+2\delta }}L(C_2+2C_2^2). \end{aligned} \end{aligned}$$
(49)

Then for \(K\ge 1\), we sum from 0 to \(K-1\),

$$\begin{aligned} \begin{aligned} \sum _{k=0}^{K-1}2\frac{\alpha {\tilde{m}}}{(k+1)^{1/2+\delta }}&{\mathbb {E}}\left[ \Vert \nabla {\mathcal {J}}(\theta _k)\Vert ^2\right] \\&\le \sum _{k=0}^{K-1}{\mathbb {E}}\left[ {\mathcal {J}}(\theta _k)\right] -\sum _{k=0}^{K-1} {\mathbb {E}}\left[ {\mathcal {J}}(\theta _{k+1})\right] + \sum _{k=0}^{K-1} \frac{\alpha ^2{\tilde{M}}^2}{(k+1)^{1+2\delta }}L(C_2+2C_2^2)\\&={\mathcal {J}}(\theta _0) - {\mathbb {E}}\left[ {\mathcal {J}}(\theta _{K})\right] + \sum _{k=0}^{K-1}\frac{\alpha ^2{\tilde{M}}^2}{(k+1)^{1+2\delta }}L(C_2+2C_2^2)\\&\le {\mathcal {J}}(\theta _0) - \inf _{\psi \in {\mathbb {R}}^P}{\mathcal {J}}(\psi ) + \sum _{k=0}^{K-1}\frac{\alpha ^2{\tilde{M}}^2}{(k+1)^{1+2\delta }}L(C_2+2C_2^2),\ \end{aligned} \end{aligned}$$
(50)

The right-hand side is finite, so there is a constant \(C_4>0\) such that for any \(K\in {\mathbb {N}}\), it holds,

$$\begin{aligned} C_4\ge \sum _{k=0}^K \frac{1}{(k+1)^{1/2+\delta }} {\mathbb {E}}\left[ \Vert \nabla {\mathcal {J}}(\theta _{k})\Vert ^2\right]&\ge \min _{k\in \left\{ 1,\ldots ,K\right\} }{\mathbb {E}}\left[ \Vert \nabla {\mathcal {J}}(\theta _{k})\Vert ^2\right] \sum _{k=0}^K \frac{1}{(k+1)^{1/2+\delta }} \nonumber \\&\ge \left( K+1\right) ^{1/2-\delta }\min _{k\in \left\{ 1,\ldots ,K\right\} } {\mathbb {E}}\left[ \Vert \nabla {\mathcal {J}}(\theta _{k})\Vert ^2\right] , \end{aligned}$$
(51)

and we obtain the rate. \(\square \)

C Proof of the Corollary

Before proving the corollary we recall the following result.

Lemma 2

Let \(g:{\mathbb {R}}^P\rightarrow {\mathbb {R}}\) a L-Lipschitz continuous and differentiable function. Then \(\nabla g\) is uniformly bounded on \({\mathbb {R}}^P\).

We can now prove the corollary.

Proof of Corollary 1

The proof is very similar to the one of Theorem 1. Denote L the Lipschitz constant of \(\nabla {\mathcal {J}}\). Then, the descent lemma (30) holds surely. Furthermore, since for all \(n\in \{1,\ldots ,N\}\), each \({\mathcal {J}}_n\) is Lipschitz, so is \({\mathcal {J}}\), and globally Lipschitz functions have uniformly bounded gradients so \(\nabla {\mathcal {J}}\) has bounded gradient. This is enough to obtain (45). Similarly, at iteration \(k\in {\mathbb {N}}\), \({\mathbb {E}}\left[ \Vert \nabla {\mathcal {J}}_{{\mathsf {B}}_{k}} (\theta _k)\Vert \right] \) is also uniformly bounded. Overall these arguments allows to follow the lines of the proof of Theorem 1 and the same conclusions follow by repeating the same arguments. \(\square \)

Appendix C: Details on the Synthetic Experiments

We detail the non-convex regression problem that we presented in Figs. 2 and 3. Given a matrix \(A\in {\mathbb {R}}^{N \times P}\) and a vector \(b\in {\mathbb {R}}^N\), denote \(A_n\) the n-th line of A. The problem consists in minimizing a loss function of the form,

$$\begin{aligned} \theta \in {\mathbb {R}}^P\mapsto {\mathcal {J}}(\theta ) = \frac{1}{N}\sum _{n}^{N} \phi (A_n^T\theta -b_n), \end{aligned}$$
(52)

where the non-convexity comes from the function \(t\in {\mathbb {R}}\mapsto \phi (t) = t^2/(1+t^2)\). For more details on the initialization of A and b we refer to [10] where this problem is initially proposed. In the experiments of Fig. 3, the mini-batch approximation was made by selecting a subset of the lines of A, which amounts to computing only a few terms of the full sum in (52). We used \(N=500\), \(P=30\) and mini-batches of size 50.

In the deterministic setting we ran each algorithm during 250 iterations and selected the hyper-parameters of each algorithm such that they achieved \(\vert {\mathcal {J}}(\theta )-{\mathcal {J}}^\star \vert <10^{-1}\) as fast as possible. In the mini-batch experiments we ran each algorithm during 250 epochs and selected the hyper-parameters that yielded the smallest value of \({\mathcal {J}}(\theta )\) after 50 epochs.

Appendix D: Description of Auxiliary Algorithms

We precise the heuristic algorithms used in Fig. 3 and discussed in Sect. 3.3. Note that the step-size in Algorithm 5 is equivalent to Expected-GV but is written differently to avoid storing an additional gradient estimate.

figure c
figure d
figure e

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Castera, C., Bolte, J., Févotte, C. et al. Second-Order Step-Size Tuning of SGD for Non-Convex Optimization. Neural Process Lett 54, 1727–1752 (2022). https://doi.org/10.1007/s11063-021-10705-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-021-10705-5

Keywords

  • Non-convex optimization
  • Deep learning
  • Stochastic optimization
  • Adaptive methods
  • Mini-batch algorithms