Skip to main content

Advertisement

Log in

Deterministic Neural Networks Optimization from a Continuous and Energy Point of View

  • Published:
Journal of Scientific Computing Aims and scope Submit manuscript

Abstract

Getting an efficient neural network can be a very difficult task for engineers and researchers because of the huge number of hyperparameters to tune and their interconnections. To make the tuning step easier and more understandable, this work focuses on probably one of the most important leverage to improve Neural Networks efficiency: the optimizer. These recent years, a great number of algorithms have been developed but they need an accurate tuning to be efficient. To get rid of this long and experimental step, we are looking for generic and desirable properties for non-convex optimization. For this purpose, the optimizers are reinterpreted or analyzed as a discretization of a continuous dynamical system. This continuous framework offers many mathematical tools in order to interpret the sensitivity of the optimizer with respect to the initial guess such as Lyapunov stability. By enforcing the discrete decrease of Lyapunov functionals, new robust and efficient optimizers are designed. They also considerably simplify the tuning of hyperparameters (learning rate, momentum etc.). These Lyapunov based algorithms outperform several state of the art optimizers on different benchmarks of the literature. Drawing its inspiration from the numerical analysis of PDEs, this paper emphasizes the essential role of some hidden energy/entropy quantities for machine learning tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability

The implementation of the optimizers in C/C++ and the Python (matplotlib) script to generate the graphics can be found on the git repository: https://github.com/bbensaid30/NN_Levenberg_Marquardt/tree/shaman.

Notes

  1. (https://gitlab.com/numerical_shaman/shaman).

  2. 0.01 is just here to avoid the iterations where the increasing is negligible but there is no real consequence on the results

  3. Tensorflow documentation: link

References

  1. Attouch, H., Peypouquet, J.: The rate of convergence of nesterovś accelerated forward-backward method is actually \(o(k^{-2})\). SIAM J. Optim. (2015). https://doi.org/10.1137/15M1046095

    Article  MATH  Google Scholar 

  2. Barakat, A., Bianchi, P.: Convergence and dynamical behavior of the adam algorithm for nonconvex stochastic optimization. SIAM J. Optim. 31(1), 244–274 (2021). https://doi.org/10.1137/19M1263443

    Article  MathSciNet  MATH  Google Scholar 

  3. Behery, G., El-Harby, A., El-Bakry, M.: Reorganizing neural network system for two spirals and linear low-density polyethylene copolymer problems. Appl. Comput. Intell. Soft Comput. (2009). https://doi.org/10.1155/2009/721370

    Article  Google Scholar 

  4. Berner, J., Grohs, P., Kutyniok, G., Petersen, P.: The Modern Mathematics of Deep Learning (2021). https://doi.org/10.48550/ARXIV.2105.04026

  5. Bof, N., Carli, R., Schenato, L.: Lyapunov theory for discrete time systems (2018). https://doi.org/10.48550/ARXIV.1809.05289

  6. Bourriaud, A., Loubère, R., Turpault, R.: A priori neural networks versus a posteriori mood loop: a high accurate 1d fv scheme testing bed. J. Sci. Comput. (2020). https://doi.org/10.1007/s10915-020-01282-1

    Article  MathSciNet  MATH  Google Scholar 

  7. Chalup, S.K., Wiklendt, L.: Variations of the two-spiral task. Connect. Sci. 19(2), 183–199 (2007). https://doi.org/10.1080/09540090701398017

    Article  Google Scholar 

  8. Chartier, P., Hairer, E., Vilmart, G.: Numerical integrators based on modified differential equations. Math. Comput. 76, 1941–1953 (2007). https://doi.org/10.1090/S0025-5718-07-01967-9

    Article  MathSciNet  MATH  Google Scholar 

  9. Chetaev, N.G.: The stability of motion., [2d rev. ed.] translated from the russian, by morton nadler. translation editors: a. w. babister [and] j. burlak. edn. Pergamon Press, New York (1961)

  10. Das, H., Reza, N.: On gröbner basis and their uses in solving system of polynomial equations and graph coloring. J. Math. Stat. 14, 175–182 (2018). https://doi.org/10.3844/jmssp.2018.175.182

    Article  Google Scholar 

  11. Demeure, N.: Compromise between precision and performance in high performance computing. Theses, École Normale supérieure Paris-Saclay (2021). https://tel.archives-ouvertes.fr/tel-03116750

  12. Després, B., Jourdren, H.: Machine learning design of volume of fluid schemes for compressible flows. J. Comput. Phys. 408, 109275 (2020). https://doi.org/10.1016/j.jcp.2020.109275

    Article  MathSciNet  MATH  Google Scholar 

  13. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS (2010)

  14. Griffiths, D., Sanz-Serna, J.: On the scope of the method of modified equations. SIAM J. Sci. Stat. Comput. 7, 994–1008 (1986). https://doi.org/10.1137/0907067

    Article  MathSciNet  MATH  Google Scholar 

  15. Hagan, M., Menhaj, M.: Training feedforward networks with the marquardt algorithm. IEEE Trans. Neural Netw. 5(6), 989–993 (1994). https://doi.org/10.1109/72.329697

    Article  Google Scholar 

  16. Kalkbrener, M.: Solving systems of algebraic equations by using gröbner bases. In: EUROCAL (1987)

  17. Kelley Pace, R., Barry, R.: Sparse spatial autoregressions. Stat. Probab. Lett. 33(3), 291–297 (1997). https://doi.org/10.1016/S0167-7152(96)00140-X

    Article  MATH  Google Scholar 

  18. Khalil, H.K.: Nonlinear systems; 3rd ed. Prentice-Hall, Upper Saddle River, NJ (2002). https://cds.cern.ch/record/1173048. The book can be consulted by contacting: PH-AID: Wallet, Lionel

  19. Kingma, D., Ba, J.: Adam: a method for stochastic optimization. International Conference on Learning Representations (2014). https://doi.org/10.48550/ARXIV.1412.6980

  20. Kluth, G., Humbird, K.D., Spears, B.K., Peterson, J.L., Scott, H.A., Patel, M.V., Koning, J., Marinak, M., Divol, L., Young, C.V.: Deep learning for nlte spectral opacities. Phys. Plasmas 27(5), 052707 (2020). https://doi.org/10.1063/5.0006784

    Article  Google Scholar 

  21. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Tech. Rep. 0, University of Toronto, Toronto, Ontario (2009)

  22. Lamy, C., Dubroca, B., Nicolaï, P., Tikhonchuk, V., Feugeas, J.L.: Modeling of electron nonlocal transport in plasmas using artificial neural networks. Phys. Rev. E 105, 055201 (2022). https://doi.org/10.1103/PhysRevE.105.055201

    Article  Google Scholar 

  23. LaSalle, J.: Stability theory for ordinary differential equations. J. Differen. Equ. 4(1), 57–65 (1968). https://doi.org/10.1016/0022-0396(68)90048-X

    Article  MathSciNet  MATH  Google Scholar 

  24. Lazard, D.: Solving zero-dimensional algebraic systems. J. Symb. Comput. 13, 117–132 (1992)

    Article  MathSciNet  MATH  Google Scholar 

  25. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998). https://doi.org/10.1109/5.726791

    Article  Google Scholar 

  26. Li, C., Miao, Y., Maio, Q.: A method to judge the stability of dynamical system. IFAC Proc. Vol. 28(16), 101–105 (1995). https://doi.org/10.1016/S1474-6670(17)45161-5

    Article  Google Scholar 

  27. Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., Han, J.: On the variance of the adaptive learning rate and beyond. p. 13. ICLR (2020). https://doi.org/10.48550/ARXIV.1908.03265

  28. Lydia, A., Francis, S.: Adagrad - an optimizer for stochastic gradient descent. Int. J. Inf. Comput. Sci. 6, 566–568 (2019)

    Google Scholar 

  29. Mo, W., Luo, X., Zhong, Y., Jiang, W.: Image recognition using convolutional neural network combined with ensemble learning algorithm. J. Phys: Conf. Ser. 1237, 022026 (2019). https://doi.org/10.1088/1742-6596/1237/2/022026

    Article  Google Scholar 

  30. Nesterov, Y.: A method for solving the convex programming problem with convergence rate \(o(1/k^2)\). Proc. USSR Acad. Sci. 269, 543–547 (1983)

    Google Scholar 

  31. Novello, P., Poëtte, G., Lugato, D., Congedo, P.M.: Explainable Hyperparameters Optimization using Hilbert-Schmidt Independence Criterion (2021). https://hal.archives-ouvertes.fr/hal-03128298. Accepted in Journal of Sci. Comput

  32. Novello, P., Poëtte, G., Lugato, D., Peluchon, S., Congedo, P.M.: Accelerating hypersonic reentry simulations using deep learning-based hybridization (with guarantees) (2022). https://arxiv.org/abs/2209.13434.

  33. Polyak, B.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964). https://doi.org/10.1016/0041-5553(64)90137-5

    Article  Google Scholar 

  34. Raissi, M., Perdikaris, P., Karniadakis, G.: Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 378, 686–707 (2019). https://doi.org/10.1016/j.jcp.2018.10.045

    Article  MathSciNet  MATH  Google Scholar 

  35. Reddi, S.J., Kale, S., Kumar, S.: On the convergence of adam and beyond. In: International Conference on Learning Representations (2018). https://doi.org/10.48550/ARXIV.1904.09237

  36. Richardson, F., Reynolds, D., Dehak, N.: Deep neural network approaches to speaker and language recognition. IEEE Signal Process. Lett. 22(10), 1671–1675 (2015). https://doi.org/10.1109/LSP.2015.2420092

    Article  Google Scholar 

  37. Ripoll, J.F., Kluth, G., Has, S., Fischer, A., Mougeot, M., Camporeale, E.: Exploring pitch-angle diffusion during high speed streams with neural networks. In: 2022 3rd URSI Atlantic and Asia Pacific Radio Science Meeting (AT-AP-RASC), pp. 1–4 (2022). https://doi.org/10.23919/AT-AP-RASC54737.2022.9814235

  38. Rouche, N., Habets, P., Laloy, M.: Stability Theory by Liapunov’s Direct Method. Applied Mathematical Sciences. 3Island Press (1977). https://books.google.fr/books?id=yErqoQEACAAJ

  39. Shi, B., Du, S., Jordan, M., Su, W.: Understanding the acceleration phenomenon via high-resolution differential equations. Math. Program. (2021). https://doi.org/10.1007/s10107-021-01681-8

    Article  MATH  Google Scholar 

  40. Smith, L.: Cyclical learning rates for training neural networks. In: IEEE winter conference on applications of computer vision, pp. 464–472 (2017)

  41. Su, W., Boyd, S., Candès, E.J.: A differential equation for modeling nesterovś accelerated gradient method: theory and insights. J. Mach. Learn. Res. 17(153), 1–43 (2016)

    MathSciNet  MATH  Google Scholar 

  42. Tian, D., Liu, Y., Wang, J.: Fuzzy neural network structure identification based on soft competitive learning. Int. J. Hybrid Intell. Syst. 4, 231–242 (2007). https://doi.org/10.3233/HIS-2007-4403

    Article  MATH  Google Scholar 

  43. Tieleman, T., Hinton, G., et al.: Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw. Mach. Learn. 4(2), 26–31 (2012)

    Google Scholar 

  44. Wibisono, A., Wilson, A.C., Jordan, M.I.: A variational perspective on accelerated methods in optimization. Proc. Natl. Acad. Sci. 113(47), E7351–E7358 (2016). https://doi.org/10.1073/pnas.1614734113

    Article  MathSciNet  MATH  Google Scholar 

  45. Wilkinson, J.H.: Modern error analysis. SIAM Rev. 13(4), 548–568 (1971). https://doi.org/10.1137/1013095

    Article  MathSciNet  MATH  Google Scholar 

  46. Wilson, A.C., Recht, B., Jordan, M.I.: A Lyapunov analysis of accelerated methods in optimization. J. Mach. Learn. Res. 22, 5040–5073 (2021)

    MathSciNet  MATH  Google Scholar 

  47. Xu, Z., Dai, A.M., Kemp, J., Metz, L.: Learning an adaptive learning rate schedule. NeurIPS, arXiv (2019). https://doi.org/10.48550/ARXIV.1909.09712

  48. You, K., Long, M., Wang, J., Jordan, M.: How does learning rate decay help modern neural networks (2019)

  49. Zeiler, M.: Adadelta: An adaptive learning rate method 1212 (2012). https://doi.org/10.48550/ARXIV.1212.5701

Download references

Funding

This work was supported by CEA/CESTA and LRC Anabase.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bilel Bensaid.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Analytical benchmarks and their subtleties

All the analytical examples of this section are based on the cost function (7) reminded here:

$$\begin{aligned} {\mathcal {R}}(w,b)=\frac{1}{4}\left[ g(b)^2+g(w+b)^2\right] . \end{aligned}$$

Their general structures are the same, that is to say \({\mathcal {N}}=(1)\) and \({\mathcal {G}}=(g)\), but they are built by choosing different activation functions. Each new activation function, in this section, leads to a more difficult learning problem. Here, when Q is claimed to be a good approximation of H, it means that \(Q(\theta ^*)=H(\theta ^*)\). If not, it means that \( \displaystyle {\lim _{\theta \rightarrow \theta ^*}}\Vert \frac{Q(\theta )}{S(\theta )}\Vert = \infty \). In this appendix, several benchmarks are presented with their corresponding properties. These are summed-up in Tables 5, 6, 7, 8, 9. We recall that Q does not involve any second order derivatives, see (1), contrary to S, see (2). In the next points, we sum-up the subtleties of each new analytical benchmark we introduce:

  • Benchmark #1 admits four global minimums and the quasi hessian is sufficient to describe the curvature of the surface around the minimums. It is an ideal case.

  • Benchmark #2 presents one global minimum and three local minimums of different values. In the neighborhood of the global minimum the quasi hessian is a good approximation of the hessian (called LM hypothesis). However, the curvature of the other minimums involves the second order derivatives of u (the network output). This example helps to identify the influence of the curvature of u on the attractive power of the minimums. But given that the best minimum is the only one that satisfies the LM hypothesis, the attractive power can be impacted by the fact that this minimum is global. So if the region of attractivity of a minimum is larger, it is hard to know if it is due to the second order information or to its training performances.

  • With this in mind, benchmark #4 presents global/local minimums that are equal with respect to the LM hypothesis. For the same reasons, benchmark #5 checks this assumption around the local minimum but not in the vicinity of the global one. Therefore a comparison with benchmark #2 makes it possible to have an idea about the role of H(u) on the proportions found by a given algorithm. Some minimums of these examples show non invertible/zero hessian to allow a possible study about the influence of singular curvature.

  • To be complete, concerning benchmark #3 the four global minimums involve the curvature of u. Besides, this example seems hard to train with classical algorithms due to its stiffness see Table 2.

Every of the above characteristics of the benchmarks are summed up in the Tables 5, 6, 7, 8, 9. The term “order of dominating terms” refers to the asymptotic development of \({\mathcal {R}}(\theta )-{\mathcal {R}}(\theta ^*)\) around a minimum \(\theta ^*\). The values of the minima \(\theta ^*\), i.e. \({\mathcal {R}}(\theta ^*)\), are given in the same order than the set of the minima.

Benchmark #1 The activation function is simply set to:

$$\begin{aligned} g(z)=z^2-1. \end{aligned}$$
(26)
Table 5 Minimum properties for benchmark #1

Benchmark #2 In this case:

$$\begin{aligned} g(z)=2z^3-3z^2+5. \end{aligned}$$
(27)
Table 6 Minimum properties for benchmark #2

Benchmark #3 In this case:

$$\begin{aligned} g(z)=z^4-2z^2+3. \end{aligned}$$
(28)
Table 7 Minimum properties for benchmark #3

Benchmark #4 In this case:

$$\begin{aligned} g(z)=z^5-4z^4+2z^3+8z^2-11z-12. \end{aligned}$$
(29)

.

Table 8 Minimum properties for benchmark #4

Benchmark #5 In this case:

$$\begin{aligned} g(z) = 35z^8-360z^7+1540z^6-3528z^5+4620z^4-3360z^3+1120z^2+1. \end{aligned}$$
(30)
Table 9 Minimum properties for benchmark #5

B Proof of the stability/attractivity theorem 1

In order to prove the stability/attractivity theorem 1, we rely on classical theorems such as Lyapunov/LaSalle/Salvadori theorems in order to study the behaviors in the vicinities of minima and such as Chetaev/Salvadori theorems for the study of maxima and saddle points. These are usually formalised for a general ODE of the form:

$$\begin{aligned} Y'(t) = F(t,Y(t)), \end{aligned}$$

of unknown \(Y(t)\in {\mathbb {R}}^n\) and \(F:{\mathbb {R}}^n \mapsto {\mathbb {R}}^n\). The theorems we use next have a common point: they imply finding functionals that satisfy certain assumptions. The common point of these functionals is that they assume having an additional redundant scalar equation of the form

$$\begin{aligned} {\dot{V}}(Y(t))= \nabla V (Y(t)) \cdot F(t,Y(t)) + \partial _t V(Y(t)). \end{aligned}$$

Depending on the properties satisfied by V (definite positiveness, definite negativeness etc.), we are going to conclude on the stability/attractivity in the vicinities of the minima/maxima/saddle points. Let us remark that the functions V(t) defined in Sect. 4 are not strictly speaking Lyapunov functions because \(V(\theta ^*)\ne 0\). Indeed, it is required and sufficient to substract \({\mathcal {R}}(\theta ^*)\) to them.

In the following section, we denote by \({\mathcal {Z}}=\{Y\in {\mathbb {R}}^N, {\dot{V}}(Y)=0\}\) the set of points of \({\mathbb {R}}\) which cancel \({\dot{V}}(t)=V'(t)\) and \({\mathcal {M}}\) the largest invariant set included in \({\mathcal {Z}}\).

Proof

(Theorem 1 for GD) Let us choose \(V(\theta ) = {\mathcal {R}}(\theta ) - {\mathcal {R}}(\theta ^*)\) where \(\theta ^*\) is a stationary point. For GD, we have

$$\begin{aligned} {\dot{V}}(\theta ) = -\Vert \nabla {\mathcal {R}}(\theta )\Vert ^2. \end{aligned}$$
(31)

For GD, \({\dot{V}}\) as computed above is definite negative for isolated critical points. We have immediately:

  • if \(\theta ^*\) is a isolated local minimum then V is definite positive.

  • If \(\theta ^*\) is a isolated local maximum then V is definite negative.

  • If \(\theta ^*\) is a isolated saddle point V is undefined.

The result for the stability/instability of GD of theorem 1 is established by Lyapunov and extension of the Chetaev theorem. For the attractivity of GD, i.e. for the convergence towards \({\mathcal {E}}\), it is sufficient to note that \({\mathcal {Z}}={\mathcal {E}}\) and use the LaSalle’s theorem. \(\square \)

Proof

(Theorem 1 for Momentum) For Momentum, function F, for all \((v,\theta ) \in {\mathbb {R}}^N\times {\mathbb {R}}^N\), is given by:

$$\begin{aligned} F(v,\theta ) = \begin{pmatrix} -\bar{\beta _1}v-\bar{\beta _1} \nabla {\mathcal {R}}(\theta ) \\ v \end{pmatrix}. \end{aligned}$$
(32)

The stationary points, i.e. \((v,\theta )\in {\mathbb {R}}^N\times {\mathbb {R}}^N\) such that \(F(v,\theta )=0\), are of the form \((0,\theta ^*)\) where \(\theta ^*\) is a critical point of \({\mathcal {R}}\). Set

$$\begin{aligned} V(v,\theta ) = \frac{\Vert v\Vert ^2}{2} + \bar{\beta _1}({\mathcal {R}}(\theta )-{\mathcal {R}}(\theta ^*)), \end{aligned}$$

then we have

$$\begin{aligned} {\dot{V}}(v,\theta ) = -\bar{\beta _1}\Vert v\Vert ^2. \end{aligned}$$
(33)

The difficulty for Momentum (in opposition to the proof for GD) is that \({\dot{V}}\) is just semi-negative. So it is impossible to apply, for example, the extension of Chetaev theorem. The trick is to see Momentum as a dissipative Lagrangian system. Let us define:

  • a kinetic energy \(T(v)=\frac{\Vert v\Vert ^2}{2}\),

  • a potential energy \(\Pi (\theta ) = \bar{\beta _1}{\mathcal {R}}(\theta )\),

  • and a dissipative term \(Q(v) = -\bar{\beta _1}v\).

The Lagrangian is \(L=T-\Pi \) and the equation of motion is:

$$\begin{aligned} \frac{d}{dt}\frac{\partial L}{\partial v}-\frac{\partial L}{\partial \theta } = Q. \end{aligned}$$

With the above definitions of \(T, \Pi ,Q\), the above equation corresponds to the Momentum ODE.

It is sufficient to check the assumptions of the Salvadori theorem (theorem 5.2 p114 [38]):

  1. 1.

    If \(\theta ^*\) is a minimum for \({\mathcal {R}}\) then \(\theta ^*\) is a minimum for \(\Pi \). The same holds for maxima and saddle points.

  2. 2.

    \(\theta ^*\) is an isolated point.

  3. 3.

    The dissipation is complete that is to say:

    $$\begin{aligned} Q(v) \cdot v = -\bar{\beta _1} \Vert v\Vert ^2 \le -a(\Vert v\Vert ). \end{aligned}$$

    where \(\left( a: x\in {\mathbb {R}}^+ \mapsto \bar{\beta _1}x^2\right) \in {\mathcal {K}}\). Let us recall that \({\mathcal {K}}\) is, classically in stability theory, the set of functions from \({\mathbb {R}}^+\) to \({\mathbb {R}}^+\) continuous, strictly increasing such that \(a(0)=0\).

The stability/unstability result of theorem 1 for Momentum is established.

It remains to prove the convergence towards \({\mathcal {E}}\) thanks to the invariance principle [38]. First, note that \({\mathcal {Z}}=\{0\} \times {\mathbb {R}}^N\). Take a solution \((v(t),\theta (t))\) starting in the largest invariant set \({\mathcal {M}}\) included in \({\mathcal {Z}}\). As \({\mathcal {M}}\) is invariant, for all \(t\ge 0\), \(v(t)=0\). From the Momentum vector field (32), we have that \(\theta '(t)=v(t)\). As \(v(t)=0=\theta '(t)\), we deduce that the trajectory \(\theta (t)\) is constant. The second equation \(v'(t) = -\bar{\beta _1}v(t)-\bar{\beta _1} \nabla {\mathcal {R}}(\theta (t))\) gives taht \(\forall t\ge 0\), \(\theta (t)=\theta ^* \in {\mathcal {E}}\). As a consequence, \({\mathcal {M}}=\{0\} \times {\mathcal {E}}\) and we can conclude thanks to LaSalle’s invariance principle. \(\square \)

Proof

(Theorem 1 for AWB) For AWB, function F is given by \((v,s,\theta ) \in {\mathbb {R}}^N \times {\mathbb {R}}_+^N \times {\mathbb {R}}^N\):

$$\begin{aligned} F(v,s,\theta ) = \begin{pmatrix} -\bar{\beta _1}v+\bar{\beta _1} \nabla {\mathcal {R}}(\theta ) \\ -\bar{\beta _2}s+\bar{\beta _2} \nabla {\mathcal {R}}(\theta )^2 \\ -\frac{v}{\sqrt{s+\epsilon _a}} \end{pmatrix}. \end{aligned}$$

In order to apply LaSalle’s invariance principle, the definition domain of F has to be an open connected space: first, note that for \(t>0,\theta (0)\notin {\mathcal {E}}\) implies that we have \(s(t)>0\), see (14). It means we can restrict the study to \({\mathbb {R}}^N \times {\mathbb {R}}_+^N\setminus \{0\} \times {\mathbb {R}}^N\) as we establish an asymptotic property. The stationary points are of the form \((0,0,\theta ^*)\) where \(\theta ^*\) is a critical point of \({\mathcal {R}}\). Let us set:

$$\begin{aligned} V(v,s,\theta ) = \frac{1}{2}\displaystyle {\sum _{i=1}^{N}\frac{v_i^2}{\sqrt{s_i+\epsilon _a}}} +\bar{\beta _1}({\mathcal {R}}(\theta )-{\mathcal {R}}(\theta ^*)). \end{aligned}$$

Its derivative is:

$$\begin{aligned} {\dot{V}}(v,s,\theta ) = -\frac{\bar{\beta _2}}{4}\displaystyle {\sum _{i=1}^{N}\frac{v_i^2(\frac{\partial {\mathcal {R}}}{\partial \theta _i})^2(\theta )}{(s_i+\epsilon _a)\sqrt{s_i+\epsilon _a}}} +\displaystyle {\sum _{i=1}^{N}\frac{s_i^2}{\sqrt{s_i+\epsilon _a}}}\left[ -\bar{\beta _1} +\frac{\bar{\beta _2}}{4} \frac{s_i}{s_i+\epsilon _a}\right] . \end{aligned}$$

Noting that \(\frac{s_i}{s_i+\epsilon _a}<1\) as \(s_i \ge 0\) we have:

$$\begin{aligned} {\dot{V}}(v,s,\theta ) \le \displaystyle {\sum _{i=1}^{N}\frac{v_i^2}{\sqrt{s_i+\epsilon _a}}}\left[ -\bar{\beta _1} +\frac{\bar{\beta _2}}{4} \right] . \end{aligned}$$

As \(\bar{\beta _2} < 4 \bar{\beta _2}\), \({\dot{V}}\) is semi-negative.

We will now apply the LaSalle invariance principle. From the above inequality, we have \({\mathcal {Z}}=\{(0,s,\theta ), (s,\theta ) \in {\mathbb {R}}^{2N}\}\). Take a solution \((v(t),s(t),\theta (t))\) starting in the largest invariant set \({\mathcal {M}}\) included in \({\mathcal {Z}}\). As \({\mathcal {M}}\) is invariant, for all \(t\ge 0\), \(v(t)=0\). From (11), we can deduce several properties:

  • from \(\theta '(t) = -\frac{v(t)}{\sqrt{s(t)+\epsilon _a}}\), we obtain that for all \(t \ge 0\), we get:

    $$\begin{aligned} \theta '(t)=0. \end{aligned}$$
  • From the equation \(v'(t) = -\bar{\beta _1}v(t)+\bar{\beta _1} \nabla {\mathcal {R}}(\theta (t))\), we deduce that \(\forall t\ge 0\):

    $$\begin{aligned} \nabla {\mathcal {R}}(\theta (t))=0. \end{aligned}$$
  • From the equation \(s'(t) = -\bar{\beta _2}s(t)+\bar{\beta _2} \nabla {\mathcal {R}}(\theta (t))^2\), we deduce that:

    $$\begin{aligned} s(t)=s(0)e^{-\bar{\beta _2}t}. \end{aligned}$$

Hence for all \(t \ge 0\):

$$\begin{aligned} \theta (t) = \theta ^* \in {\mathcal {E}}. \end{aligned}$$

Therefore the following equality holds:

$$\begin{aligned} \{0\}\times {\mathbb {R}}_+^N \times {\mathcal {E}} = {\mathcal {M}}. \end{aligned}$$

According to the LaSalle’s invariance principle [23], all bounded trajectories \((v(t),s(t),\theta (t))\) converges to the set \({\mathcal {M}}\). The inclusion proved above shows that:

$$\begin{aligned} \displaystyle {\lim _{t \rightarrow +\infty }} d(\theta (t),{\mathcal {E}})=0, \end{aligned}$$

the attractivity of AWB is proved. \(\square \) \(\square \)

Proof

(Theorem 1 for Adam) For Adam, F is given by \((t,v,s,\theta ) \in {\mathbb {R}}^+ \times {\mathbb {R}}^N \times {\mathbb {R}}_+^N \times {\mathbb {R}}^N\):

$$\begin{aligned} F(t,v,s,\theta ) = \begin{pmatrix} -\bar{\beta _1}v+\bar{\beta _1} \nabla {\mathcal {R}}(\theta ) \\ -\bar{\beta _2}s+\bar{\beta _2} \nabla {\mathcal {R}}(\theta )^2 \\ -f(t)\frac{v}{\sqrt{s+\epsilon _a}} \end{pmatrix}. \end{aligned}$$

Let us recall that f(t) is defined in 4.

For the same reason as in the previous proof for AWB, we can restrict the study to \({\mathbb {R}}^+\setminus \{0\} \times {\mathbb {R}}^N \times {\mathbb {R}}_+^N\setminus \{0\} \times {\mathbb {R}}^N\). The stationary points are of the form \((0,0,\theta ^*)\) where \(\theta ^*\) is a critical point of \({\mathcal {R}}\).

Let us set:

$$\begin{aligned} V(t,v,s,\theta ) = \frac{f(t)}{2}\displaystyle {\sum _{i=1}^{N}\frac{v_i^2}{\sqrt{s_i+\epsilon _a}}} +\bar{\beta _1}{\mathcal {R}}(\theta ). \end{aligned}$$

Its derivative is:

$$\begin{aligned} {\dot{V}}(t,v,s,\theta )= & {} -\frac{\bar{\beta _2}f(t)}{4}\displaystyle {\sum _{i=1}^{N}\frac{v_i^2(\frac{\partial {\mathcal {R}}}{\partial \theta _i})^2(\theta )}{(s_i+\epsilon _a)\sqrt{s_i+\epsilon _a}}}-\bar{\beta _1}f(t)\displaystyle {\sum _{i=1}^{N}\frac{v_i^2}{\sqrt{s_i+\epsilon _a}}}\\{} & {} + \displaystyle {\sum _{i=1}^{N}\frac{v_i^2}{4\sqrt{s_i+\epsilon _a}}}\left[ 2f'(t)+(\bar{\beta _2}-4\bar{\beta _1})f(t)\frac{s_i}{s_i+\epsilon _a} \right] . \end{aligned}$$

Now, in order to prove the attractivity of Adam, it remains to check the hypothesis of the section 5.9 p206 (chapter 8) of [38] with the corollary 4.15. These assumptions are recalled below: there exists a continuous function \(\psi \) and for every compact set \(K \in {\mathbb {R}}^N \times {\mathbb {R}}_+^N \times {\mathbb {R}}^N\), there are constants \(A \in {\mathbb {R}}\), \(B>0\) such that:

  1. i

    F is bounded on K:

    $$\begin{aligned} \forall t\in {\mathbb {R}}_+, \forall (v,s,\theta )\in K, \Vert F(t,v,s,\theta )\Vert \le A. \end{aligned}$$
  2. ii

    V is bounded by below:

    $$\begin{aligned} \forall t\in {\mathbb {R}}_+, \forall (v,s,\theta )\in K, V(t,v,s,\theta )\ge -B. \end{aligned}$$
  3. iii

    \({\dot{V}}\) is bounded independently of the time:

    $$\begin{aligned} \forall t\in {\mathbb {R}}_+, \forall (v,s,\theta )\in K, {\dot{V}}(t,v,s,\theta ) \le -\psi (v,s,\theta ) \le 0. \end{aligned}$$
  4. iv

    The limit set is compact.

  5. v

    The ODE is asymptotically autonomous. This means that the translation \(F_a(t,v,s,\theta ) {:}{=}F(t+a,v,s,\theta )\) converges when a comes to \(\infty \) and we will denote the limit function \(F^*\).

Denoting by \({\mathcal {Z}}{:}{=}\{(v,s,\theta ) \in {\mathbb {R}}^N \times {\mathbb {R}}_+^N \times {\mathbb {R}}^N, \psi (v,s,\theta )=0\}\), the theorem claims that the solution converges towards the largest invariant set included in \({\mathcal {Z}}\) for the limit equation:

$$\begin{aligned} \frac{d}{dt}(v(t),s(t),\theta (t))^T=F^*(v(t),s(t),\theta (t)). \end{aligned}$$

In order to prove point iii, it is necessary to obtain an upper bound on \({\dot{V}}\). This is done by showing that \(g(t) = 2f'(t) + (\bar{\beta _2}-4\bar{\beta _1})f(t)\), is negative on \({\mathbb {R}}_+\) if \(\bar{\beta _2} < \bar{\beta _1}\). This will allow deducing a bound on \({\dot{V}}\). We can rewrite g as:

$$\begin{aligned} g(t) = \frac{P(t)}{2\sqrt{1-e^{-\bar{\beta _2}t}}(1-e^{-\bar{\beta _1}t})^2}, \end{aligned}$$

where:

$$\begin{aligned} P(t) = 2(\bar{\beta _2}-4\bar{\beta _1})+8\bar{\beta _1}e^{-\bar{\beta _2}t}+(4\bar{\beta _1}-2\bar{\beta _2})e^{-\bar{\beta _1}t} -4\bar{\beta _1}e^{-(\bar{\beta _1}+\bar{\beta _2})t}. \end{aligned}$$

As for all \(t\ge 0\), \(e^{-(\bar{\beta _1}+\bar{\beta _2})t} \le e^{-\bar{\beta _1}t}\):

$$\begin{aligned} P'(t) \le 2\bar{\beta _1}\bar{\beta _2}\left( 3e^{-\bar{\beta _1}t} -4 e^{-\bar{\beta _2}t}\right) \le 0. \end{aligned}$$

Note that P(0) = 0. Therefore g is negative on \({\mathbb {R}}_+\).

Let us analyse the function f to have a bound independent of the time. It is straightforward that \(\displaystyle {\lim _{t \rightarrow 0}} f(t) = \frac{\bar{\beta _2}}{\bar{\beta _1}}\) and \(\displaystyle {\lim _{t\rightarrow +\infty }} f(t) = 1\). Given that f is continuous on \({\mathbb {R}}_+^*\), f is bounded on \({\mathbb {R}}_+\). From now on, we consider the continuous extension of f and we abusively call it f. Denoting by \(m>0\) the lower bound of f on \({\mathbb {R}}_+\) (f is strictly positive) the following inequality is verified (assumption iii):

$$\begin{aligned} {\dot{V}}(t,x,y,z) \le -\bar{\beta _1}m\displaystyle {\sum _{i=1}^{N}\frac{x_i^2}{\sqrt{y_i+\epsilon _a}}} {:}{=}\psi (v,s,\theta ) \le 0, \end{aligned}$$
(34)

The positivity of \({\mathcal {R}}\) gives the second hypothesis ii:

$$\begin{aligned} \forall t,v,s,\theta , V(t,v,s,\theta ) \ge 0. \end{aligned}$$
(35)

As f is bounded we deduce that for every compact \(K \subset {\mathbb {R}}^+ \times {\mathbb {R}}^N \times {\mathbb {R}}_+^N \times {\mathbb {R}}^N\), there exists \(A \in {\mathbb {R}}\) (a continuous function is bounded on a compact set) such that (assumption i):

$$\begin{aligned} \forall t\in {\mathbb {R}}^+, (v,s,\theta )\in K, \Vert F(t,v,s,\theta )\Vert \le A. \end{aligned}$$
(36)

As the assumptions of corollary 4.15 (the \({\mathcal {C}}^1\) regularity of F and V is straightforward) (36), (35), (34) are verified, it is sufficient to prove that the ODE is asymptotically autonomous (assumption v). A simple calculation gives:

$$\begin{aligned} F^*(v,s,\theta ) = \begin{pmatrix} -\bar{\beta _1}v+\bar{\beta _1} \nabla {\mathcal {R}}(\theta ) \\ -\bar{\beta _2}s+\bar{\beta _2} \nabla {\mathcal {R}}(\theta )^2 \\ -\frac{v}{\sqrt{s+\epsilon _a}} \end{pmatrix}. \end{aligned}$$

Consider a bounded trajectory \((v(t),s(t),\theta (t))\). It is a classical result (see theorem 4 p364 of appendix 2 in [38]) that the limit set is compact (assumption iv). The theorem of section 5.9 claims that the trajectory converges towards the invariant set of the limit ODE

$$\begin{aligned} \frac{d}{dt}(v(t),s(t),\theta (t))^T=F^*(v(t),s(t),\theta (t)), \end{aligned}$$

included in the set:

$$\begin{aligned} {\mathcal {Z}}=\{(0,s,\theta ), (s,\theta ) \in {\mathbb {R}}_+^N \times {\mathbb {R}}^{N}\}. \end{aligned}$$

Now note that this limit equation is exactly the ODE associated to AWB. So the study is already done. \(\square \) \(\square \)

Remark

This methodology is useful for many NNs optimizers not studied in this paper. For instance, we can apply the above methodology to RMSProp [43]. The following choices of V are possible (keeping the same notations as Adam):

$$\begin{aligned} {V}(t) = \bar{\beta _2} {\mathcal {R}}(\theta (t)), \end{aligned}$$

or

$$\begin{aligned} {V}(t) = \displaystyle {\sum _{i=1}^{N} \sqrt{s_i(t)}} + \bar{\beta _2} {\mathcal {R}}(\theta (t)). \end{aligned}$$

C Tests on analytical examples: supplement

In this appendix, we give more quantitative results than in Sect. 3.3 on benchmarks #2 (activation (27)) with the classical optimizers and the new ones we develop in this paper. These quantitative results are summed-up in Table 10. Note that the default settings for the parameters of the classical optimizers are the same as in Sect. 6.1. First, Table 10 shows that the performances observed are not particular to an example. The algorithms have been tested on all the benchmarks of the appendix A. Contrary to the benchmark #3 of the Sect. 6.1, it allows to see if the global minimum \((0,-1)\) gives an advantage or not. AWB is very troubling because it favours the worst minimum (0, 1) a lot contrary to Adam: a direct advantage of the biased steps appears clearly. A very promising insight is that the algorithms that enforce the energy decreasing promote the global minimum: concerning EGD, half of the trajectories converge to \((0,-1)\).

Table 10 Algorithms performances on benchmark #2

D Tests on machine learning tasks: supplement

The default settings are the same as Sect. 6.2. Table 11 is about the square function \((x,y) \mapsto (x^2,y^2)\) using 50 training/testing points uniformly sampled on \([-1,1]^2\). The architecture is the following: \({\mathcal {N}}=\left( 2,15,2\right) \) and \({\mathcal {G}}=\left( reLU,linear\right) \) so \(N=77\). Here the assumptions of propositions 1 and 4 are not verified. However the differences between classical methods and the adaptive ones are even more striking.

Table 11 Performances for the square function \((x,y) \mapsto (x^2,y^2)\)

E Proofs of propositions 5 and 6

The proofs consist in checking the three points of the Sect. 7 using Taylor expansion.

Proof

(Proposition 5) Since \(\nabla {\mathcal {R}}\) is Lipschitz-continuous, the explicit Euler scheme is 0-stable. The second point to check is simple: the initial condition is exact with \(l_0=\theta (0)-\theta _0=0\). Assume that \(\theta \) is the solution of the modified equation (22). The local error is given by:

$$\begin{aligned} l_{n} = \frac{\theta (t_{n+1})-\theta (t_n)}{\eta }+\nabla {\mathcal {R}}(\theta (t_n)). \end{aligned}$$

Using equation (22) and Taylor formula, there exists \(t_n \le \alpha _n \le t_{n+1}\) such that:

$$\begin{aligned} l_n = \frac{\eta }{2}\left[ \theta ''(t)-H(\theta (t))\nabla {\mathcal {R}}(\theta (t)) \right] +\frac{\eta ^2}{6}\theta ^{(3)}(\alpha _n). \end{aligned}$$

Let us denote by \(\nabla ^3 {\mathcal {R}}\) the tensor of the third derivatives. By differentiation of (22), we get (omitting the \(\theta (t_n)\) in the right term):

$$\begin{aligned} \theta '' = \frac{\eta }{2} \left[ \nabla ^3 {\mathcal {R}} : \left( I_N + \frac{\eta }{2}H\right) \nabla {\mathcal {R}}\right] \nabla {\mathcal {R}} + \left( I_N + \frac{\eta }{2}H\right) H\left( I_N + \frac{\eta }{2}H\right) \nabla {\mathcal {R}}. \end{aligned}$$

Rearranging the terms it follows:

$$\begin{aligned} l_n = \left( \frac{\eta }{2}\right) ^2 \biggl [ \left[ \nabla ^3 {\mathcal {R}} : \left( I_N + \frac{\eta }{2}H\right) \nabla {\mathcal {R}}\right] \nabla {\mathcal {R}} + H^2\nabla {\mathcal {R}}+\frac{\eta }{2}H^3\nabla {\mathcal {R}} \biggr ] + \frac{\eta ^2}{6}\theta ^{(3)}(\alpha _n). \end{aligned}$$

From the regularity theorem for parametric equations, it comes that \((\eta ,t) \mapsto \theta _{\eta }(t)\) is \({\mathcal {C}}^{\infty }([0,1] \times [0,T])\). Therefore the term in front of \(\eta ^2\) is uniformly bounded in \(\eta \) at the neighborhood of the origin. This ends the proof. \(\square \) \(\square \)

Proof

(Proposition 6) The 0-stability is already discussed in the remark 3. The second point is verified as \(l_0=\theta (0)-\theta _0=0\). By performing the change of variable \(\bar{\beta _1} = \frac{\beta _1}{\eta }\), we can write the algorithm as the following finite differences scheme:

$$\begin{aligned} \frac{\theta _{n+1}-2\theta _n+\theta _{n-1}}{\eta ^2}+\bar{\beta _1}\frac{\theta _n-\theta _{n-1}}{\eta }+\bar{\beta _1}\nabla {\mathcal {R}}(\theta _n) = 0. \end{aligned}$$

With this modification, the ODE (23) can be written as:

$$\begin{aligned} \theta ''(t) + \bar{\beta _1}\left( 1+\frac{\eta \bar{\beta _1}}{2} \right) \theta '(t) + \bar{\beta _1}\left( 1+\frac{\eta \bar{\beta _1}}{2} \right) \nabla {\mathcal {R}}(\theta (t)) = 0. \end{aligned}$$

By Taylor formula and after some manipulations (using the equation above), there exists \(t_n \le \alpha _n^0 \le t_{n+1}\), \(t_{n-1} \le \alpha _n^1 \le t_n\) and \(t_{n-1} \le \alpha _n^2 \le t_n\) such that:

$$\begin{aligned} l_n = \eta ^2 \left[ \frac{\bar{\beta _1}^3}{4}\theta '(t_n)+\frac{\bar{\beta _1}^3}{4}\nabla {\mathcal {R}}(\theta (t_n)) + \frac{\theta ^{(4)}(\alpha _n^0) + \theta ^{(4)}(\alpha _n^1) +4\bar{\beta _1}\theta ^{(3)}(\alpha _n^2)}{24}\right] . \end{aligned}$$

We conclude as in the proof of proposition 5 using the regularity theorem. \(\square \) \(\square \)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bensaid, B., Poëtte, G. & Turpault, R. Deterministic Neural Networks Optimization from a Continuous and Energy Point of View. J Sci Comput 96, 14 (2023). https://doi.org/10.1007/s10915-023-02215-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10915-023-02215-4

Keywords

Mathematics Subject Classification

Navigation