Deterministic Neural Networks Optimization from a Continuous and Energy Point of View

Bensaid, Bilel; Poëtte, Gaël; Turpault, Rodolphe

doi:10.1007/s10915-023-02215-4

Deterministic Neural Networks Optimization from a Continuous and Energy Point of View

Published: 24 May 2023

Volume 96, article number 14, (2023)
Cite this article

Journal of Scientific Computing Aims and scope Submit manuscript

199 Accesses
Explore all metrics

Abstract

Getting an efficient neural network can be a very difficult task for engineers and researchers because of the huge number of hyperparameters to tune and their interconnections. To make the tuning step easier and more understandable, this work focuses on probably one of the most important leverage to improve Neural Networks efficiency: the optimizer. These recent years, a great number of algorithms have been developed but they need an accurate tuning to be efficient. To get rid of this long and experimental step, we are looking for generic and desirable properties for non-convex optimization. For this purpose, the optimizers are reinterpreted or analyzed as a discretization of a continuous dynamical system. This continuous framework offers many mathematical tools in order to interpret the sensitivity of the optimizer with respect to the initial guess such as Lyapunov stability. By enforcing the discrete decrease of Lyapunov functionals, new robust and efficient optimizers are designed. They also considerably simplify the tuning of hyperparameters (learning rate, momentum etc.). These Lyapunov based algorithms outperform several state of the art optimizers on different benchmarks of the literature. Drawing its inspiration from the numerical analysis of PDEs, this paper emphasizes the essential role of some hidden energy/entropy quantities for machine learning tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Hierarchical multi-scale parametric optimization of deep neural networks

Article Open access 31 July 2023

Optimization for Deep Learning: An Overview

Article 13 June 2020

Deterministic Global Optimization with Artificial Neural Networks Embedded

Article 12 October 2018

Data availability

The implementation of the optimizers in C/C++ and the Python (matplotlib) script to generate the graphics can be found on the git repository: https://github.com/bbensaid30/NN_Levenberg_Marquardt/tree/shaman.

Notes

(https://gitlab.com/numerical_shaman/shaman).
0.01 is just here to avoid the iterations where the increasing is negligible but there is no real consequence on the results
Tensorflow documentation: link

References

Attouch, H., Peypouquet, J.: The rate of convergence of nesterovś accelerated forward-backward method is actually $o(k^{-2})$. SIAM J. Optim. (2015). https://doi.org/10.1137/15M1046095
Article MATH Google Scholar
Barakat, A., Bianchi, P.: Convergence and dynamical behavior of the adam algorithm for nonconvex stochastic optimization. SIAM J. Optim. 31(1), 244–274 (2021). https://doi.org/10.1137/19M1263443
Article MathSciNet MATH Google Scholar
Behery, G., El-Harby, A., El-Bakry, M.: Reorganizing neural network system for two spirals and linear low-density polyethylene copolymer problems. Appl. Comput. Intell. Soft Comput. (2009). https://doi.org/10.1155/2009/721370
Article Google Scholar
Berner, J., Grohs, P., Kutyniok, G., Petersen, P.: The Modern Mathematics of Deep Learning (2021). https://doi.org/10.48550/ARXIV.2105.04026
Bof, N., Carli, R., Schenato, L.: Lyapunov theory for discrete time systems (2018). https://doi.org/10.48550/ARXIV.1809.05289
Bourriaud, A., Loubère, R., Turpault, R.: A priori neural networks versus a posteriori mood loop: a high accurate 1d fv scheme testing bed. J. Sci. Comput. (2020). https://doi.org/10.1007/s10915-020-01282-1
Article MathSciNet MATH Google Scholar
Chalup, S.K., Wiklendt, L.: Variations of the two-spiral task. Connect. Sci. 19(2), 183–199 (2007). https://doi.org/10.1080/09540090701398017
Article Google Scholar
Chartier, P., Hairer, E., Vilmart, G.: Numerical integrators based on modified differential equations. Math. Comput. 76, 1941–1953 (2007). https://doi.org/10.1090/S0025-5718-07-01967-9
Article MathSciNet MATH Google Scholar
Chetaev, N.G.: The stability of motion., [2d rev. ed.] translated from the russian, by morton nadler. translation editors: a. w. babister [and] j. burlak. edn. Pergamon Press, New York (1961)
Das, H., Reza, N.: On gröbner basis and their uses in solving system of polynomial equations and graph coloring. J. Math. Stat. 14, 175–182 (2018). https://doi.org/10.3844/jmssp.2018.175.182
Article Google Scholar
Demeure, N.: Compromise between precision and performance in high performance computing. Theses, École Normale supérieure Paris-Saclay (2021). https://tel.archives-ouvertes.fr/tel-03116750
Després, B., Jourdren, H.: Machine learning design of volume of fluid schemes for compressible flows. J. Comput. Phys. 408, 109275 (2020). https://doi.org/10.1016/j.jcp.2020.109275
Article MathSciNet MATH Google Scholar
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS (2010)
Griffiths, D., Sanz-Serna, J.: On the scope of the method of modified equations. SIAM J. Sci. Stat. Comput. 7, 994–1008 (1986). https://doi.org/10.1137/0907067
Article MathSciNet MATH Google Scholar
Hagan, M., Menhaj, M.: Training feedforward networks with the marquardt algorithm. IEEE Trans. Neural Netw. 5(6), 989–993 (1994). https://doi.org/10.1109/72.329697
Article Google Scholar
Kalkbrener, M.: Solving systems of algebraic equations by using gröbner bases. In: EUROCAL (1987)
Kelley Pace, R., Barry, R.: Sparse spatial autoregressions. Stat. Probab. Lett. 33(3), 291–297 (1997). https://doi.org/10.1016/S0167-7152(96)00140-X
Article MATH Google Scholar
Khalil, H.K.: Nonlinear systems; 3rd ed. Prentice-Hall, Upper Saddle River, NJ (2002). https://cds.cern.ch/record/1173048. The book can be consulted by contacting: PH-AID: Wallet, Lionel
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. International Conference on Learning Representations (2014). https://doi.org/10.48550/ARXIV.1412.6980
Kluth, G., Humbird, K.D., Spears, B.K., Peterson, J.L., Scott, H.A., Patel, M.V., Koning, J., Marinak, M., Divol, L., Young, C.V.: Deep learning for nlte spectral opacities. Phys. Plasmas 27(5), 052707 (2020). https://doi.org/10.1063/5.0006784
Article Google Scholar
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Tech. Rep. 0, University of Toronto, Toronto, Ontario (2009)
Lamy, C., Dubroca, B., Nicolaï, P., Tikhonchuk, V., Feugeas, J.L.: Modeling of electron nonlocal transport in plasmas using artificial neural networks. Phys. Rev. E 105, 055201 (2022). https://doi.org/10.1103/PhysRevE.105.055201
Article Google Scholar
LaSalle, J.: Stability theory for ordinary differential equations. J. Differen. Equ. 4(1), 57–65 (1968). https://doi.org/10.1016/0022-0396(68)90048-X
Article MathSciNet MATH Google Scholar
Lazard, D.: Solving zero-dimensional algebraic systems. J. Symb. Comput. 13, 117–132 (1992)
Article MathSciNet MATH Google Scholar
Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998). https://doi.org/10.1109/5.726791
Article Google Scholar
Li, C., Miao, Y., Maio, Q.: A method to judge the stability of dynamical system. IFAC Proc. Vol. 28(16), 101–105 (1995). https://doi.org/10.1016/S1474-6670(17)45161-5
Article Google Scholar
Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., Han, J.: On the variance of the adaptive learning rate and beyond. p. 13. ICLR (2020). https://doi.org/10.48550/ARXIV.1908.03265
Lydia, A., Francis, S.: Adagrad - an optimizer for stochastic gradient descent. Int. J. Inf. Comput. Sci. 6, 566–568 (2019)
Google Scholar
Mo, W., Luo, X., Zhong, Y., Jiang, W.: Image recognition using convolutional neural network combined with ensemble learning algorithm. J. Phys: Conf. Ser. 1237, 022026 (2019). https://doi.org/10.1088/1742-6596/1237/2/022026
Article Google Scholar
Nesterov, Y.: A method for solving the convex programming problem with convergence rate $o(1/k^2)$. Proc. USSR Acad. Sci. 269, 543–547 (1983)
Google Scholar
Novello, P., Poëtte, G., Lugato, D., Congedo, P.M.: Explainable Hyperparameters Optimization using Hilbert-Schmidt Independence Criterion (2021). https://hal.archives-ouvertes.fr/hal-03128298. Accepted in Journal of Sci. Comput
Novello, P., Poëtte, G., Lugato, D., Peluchon, S., Congedo, P.M.: Accelerating hypersonic reentry simulations using deep learning-based hybridization (with guarantees) (2022). https://arxiv.org/abs/2209.13434.
Polyak, B.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964). https://doi.org/10.1016/0041-5553(64)90137-5
Article Google Scholar
Raissi, M., Perdikaris, P., Karniadakis, G.: Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 378, 686–707 (2019). https://doi.org/10.1016/j.jcp.2018.10.045
Article MathSciNet MATH Google Scholar
Reddi, S.J., Kale, S., Kumar, S.: On the convergence of adam and beyond. In: International Conference on Learning Representations (2018). https://doi.org/10.48550/ARXIV.1904.09237
Richardson, F., Reynolds, D., Dehak, N.: Deep neural network approaches to speaker and language recognition. IEEE Signal Process. Lett. 22(10), 1671–1675 (2015). https://doi.org/10.1109/LSP.2015.2420092
Article Google Scholar
Ripoll, J.F., Kluth, G., Has, S., Fischer, A., Mougeot, M., Camporeale, E.: Exploring pitch-angle diffusion during high speed streams with neural networks. In: 2022 3rd URSI Atlantic and Asia Pacific Radio Science Meeting (AT-AP-RASC), pp. 1–4 (2022). https://doi.org/10.23919/AT-AP-RASC54737.2022.9814235
Rouche, N., Habets, P., Laloy, M.: Stability Theory by Liapunov’s Direct Method. Applied Mathematical Sciences. 3Island Press (1977). https://books.google.fr/books?id=yErqoQEACAAJ
Shi, B., Du, S., Jordan, M., Su, W.: Understanding the acceleration phenomenon via high-resolution differential equations. Math. Program. (2021). https://doi.org/10.1007/s10107-021-01681-8
Article MATH Google Scholar
Smith, L.: Cyclical learning rates for training neural networks. In: IEEE winter conference on applications of computer vision, pp. 464–472 (2017)
Su, W., Boyd, S., Candès, E.J.: A differential equation for modeling nesterovś accelerated gradient method: theory and insights. J. Mach. Learn. Res. 17(153), 1–43 (2016)
MathSciNet MATH Google Scholar
Tian, D., Liu, Y., Wang, J.: Fuzzy neural network structure identification based on soft competitive learning. Int. J. Hybrid Intell. Syst. 4, 231–242 (2007). https://doi.org/10.3233/HIS-2007-4403
Article MATH Google Scholar
Tieleman, T., Hinton, G., et al.: Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw. Mach. Learn. 4(2), 26–31 (2012)
Google Scholar
Wibisono, A., Wilson, A.C., Jordan, M.I.: A variational perspective on accelerated methods in optimization. Proc. Natl. Acad. Sci. 113(47), E7351–E7358 (2016). https://doi.org/10.1073/pnas.1614734113
Article MathSciNet MATH Google Scholar
Wilkinson, J.H.: Modern error analysis. SIAM Rev. 13(4), 548–568 (1971). https://doi.org/10.1137/1013095
Article MathSciNet MATH Google Scholar
Wilson, A.C., Recht, B., Jordan, M.I.: A Lyapunov analysis of accelerated methods in optimization. J. Mach. Learn. Res. 22, 5040–5073 (2021)
MathSciNet MATH Google Scholar
Xu, Z., Dai, A.M., Kemp, J., Metz, L.: Learning an adaptive learning rate schedule. NeurIPS, arXiv (2019). https://doi.org/10.48550/ARXIV.1909.09712
You, K., Long, M., Wang, J., Jordan, M.: How does learning rate decay help modern neural networks (2019)
Zeiler, M.: Adadelta: An adaptive learning rate method 1212 (2012). https://doi.org/10.48550/ARXIV.1212.5701

Download references

Funding

This work was supported by CEA/CESTA and LRC Anabase.

Author information

Authors and Affiliations

Institut de Mathématiques de Bordeaux (IMB), CNRS, Bordeaux INP, Université de Bordeaux, 33405, Talence, France
Bilel Bensaid & Rodolphe Turpault
CEA, CESTA, DAM, 33114, Le Barp, France
Bilel Bensaid & Gaël Poëtte

Authors

Bilel Bensaid
View author publications
You can also search for this author in PubMed Google Scholar
Gaël Poëtte
View author publications
You can also search for this author in PubMed Google Scholar
Rodolphe Turpault
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bilel Bensaid.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Analytical benchmarks and their subtleties

All the analytical examples of this section are based on the cost function (7) reminded here:

$$\begin{aligned} {\mathcal {R}}(w,b)=\frac{1}{4}\left[ g(b)^2+g(w+b)^2\right] . \end{aligned}$$

Their general structures are the same, that is to say ${\mathcal {N}}=(1)$ and ${\mathcal {G}}=(g)$, but they are built by choosing different activation functions. Each new activation function, in this section, leads to a more difficult learning problem. Here, when Q is claimed to be a good approximation of H, it means that $Q(\theta ^*)=H(\theta ^*)$. If not, it means that $ \displaystyle {\lim _{\theta \rightarrow \theta ^*}}\Vert \frac{Q(\theta )}{S(\theta )}\Vert = \infty $. In this appendix, several benchmarks are presented with their corresponding properties. These are summed-up in Tables 5, 6, 7, 8, 9. We recall that Q does not involve any second order derivatives, see (1), contrary to S, see (2). In the next points, we sum-up the subtleties of each new analytical benchmark we introduce:

Benchmark #1 admits four global minimums and the quasi hessian is sufficient to describe the curvature of the surface around the minimums. It is an ideal case.
Benchmark #2 presents one global minimum and three local minimums of different values. In the neighborhood of the global minimum the quasi hessian is a good approximation of the hessian (called LM hypothesis). However, the curvature of the other minimums involves the second order derivatives of u (the network output). This example helps to identify the influence of the curvature of u on the attractive power of the minimums. But given that the best minimum is the only one that satisfies the LM hypothesis, the attractive power can be impacted by the fact that this minimum is global. So if the region of attractivity of a minimum is larger, it is hard to know if it is due to the second order information or to its training performances.
With this in mind, benchmark #4 presents global/local minimums that are equal with respect to the LM hypothesis. For the same reasons, benchmark #5 checks this assumption around the local minimum but not in the vicinity of the global one. Therefore a comparison with benchmark #2 makes it possible to have an idea about the role of H(u) on the proportions found by a given algorithm. Some minimums of these examples show non invertible/zero hessian to allow a possible study about the influence of singular curvature.
To be complete, concerning benchmark #3 the four global minimums involve the curvature of u. Besides, this example seems hard to train with classical algorithms due to its stiffness see Table 2.

Every of the above characteristics of the benchmarks are summed up in the Tables 5, 6, 7, 8, 9. The term “order of dominating terms” refers to the asymptotic development of ${\mathcal {R}}(\theta )-{\mathcal {R}}(\theta ^*)$ around a minimum $\theta ^*$. The values of the minima $\theta ^*$, i.e. ${\mathcal {R}}(\theta ^*)$, are given in the same order than the set of the minima.

Benchmark #1 The activation function is simply set to:

$$\begin{aligned} g(z)=z^2-1. \end{aligned}$$

(26)

Table 5 Minimum properties for benchmark #1

Full size table

Benchmark #2 In this case:

$$\begin{aligned} g(z)=2z^3-3z^2+5. \end{aligned}$$

(27)

Table 6 Minimum properties for benchmark #2

Full size table

Benchmark #3 In this case:

$$\begin{aligned} g(z)=z^4-2z^2+3. \end{aligned}$$

(28)

Table 7 Minimum properties for benchmark #3

Full size table

Benchmark #4 In this case:

$$\begin{aligned} g(z)=z^5-4z^4+2z^3+8z^2-11z-12. \end{aligned}$$

(29)

.

Table 8 Minimum properties for benchmark #4

Full size table

Benchmark #5 In this case:

$$\begin{aligned} g(z) = 35z^8-360z^7+1540z^6-3528z^5+4620z^4-3360z^3+1120z^2+1. \end{aligned}$$

(30)

Table 9 Minimum properties for benchmark #5

Full size table

B Proof of the stability/attractivity theorem 1

In order to prove the stability/attractivity theorem 1, we rely on classical theorems such as Lyapunov/LaSalle/Salvadori theorems in order to study the behaviors in the vicinities of minima and such as Chetaev/Salvadori theorems for the study of maxima and saddle points. These are usually formalised for a general ODE of the form:

$$\begin{aligned} Y'(t) = F(t,Y(t)), \end{aligned}$$

of unknown $Y(t)\in {\mathbb {R}}^n$ and $F:{\mathbb {R}}^n \mapsto {\mathbb {R}}^n$. The theorems we use next have a common point: they imply finding functionals that satisfy certain assumptions. The common point of these functionals is that they assume having an additional redundant scalar equation of the form

$$\begin{aligned} {\dot{V}}(Y(t))= \nabla V (Y(t)) \cdot F(t,Y(t)) + \partial _t V(Y(t)). \end{aligned}$$

Depending on the properties satisfied by V (definite positiveness, definite negativeness etc.), we are going to conclude on the stability/attractivity in the vicinities of the minima/maxima/saddle points. Let us remark that the functions V(t) defined in Sect. 4 are not strictly speaking Lyapunov functions because $V(\theta ^*)\ne 0$. Indeed, it is required and sufficient to substract ${\mathcal {R}}(\theta ^*)$ to them.

In the following section, we denote by ${\mathcal {Z}}=\{Y\in {\mathbb {R}}^N, {\dot{V}}(Y)=0\}$ the set of points of ${\mathbb {R}}$ which cancel ${\dot{V}}(t)=V'(t)$ and ${\mathcal {M}}$ the largest invariant set included in ${\mathcal {Z}}$.

Proof

(Theorem 1 for GD) Let us choose $V(\theta ) = {\mathcal {R}}(\theta ) - {\mathcal {R}}(\theta ^*)$ where $\theta ^*$ is a stationary point. For GD, we have

$$\begin{aligned} {\dot{V}}(\theta ) = -\Vert \nabla {\mathcal {R}}(\theta )\Vert ^2. \end{aligned}$$

(31)

For GD, ${\dot{V}}$ as computed above is definite negative for isolated critical points. We have immediately:

if $\theta ^*$ is a isolated local minimum then V is definite positive.
If $\theta ^*$ is a isolated local maximum then V is definite negative.
If $\theta ^*$ is a isolated saddle point V is undefined.

The result for the stability/instability of GD of theorem 1 is established by Lyapunov and extension of the Chetaev theorem. For the attractivity of GD, i.e. for the convergence towards ${\mathcal {E}}$, it is sufficient to note that ${\mathcal {Z}}={\mathcal {E}}$ and use the LaSalle’s theorem. $\square $

Proof

(Theorem 1 for Momentum) For Momentum, function F, for all $(v,\theta ) \in {\mathbb {R}}^N\times {\mathbb {R}}^N$, is given by:

$$\begin{aligned} F(v,\theta ) = \begin{pmatrix} -\bar{\beta _1}v-\bar{\beta _1} \nabla {\mathcal {R}}(\theta ) \\ v \end{pmatrix}. \end{aligned}$$

(32)

The stationary points, i.e. $(v,\theta )\in {\mathbb {R}}^N\times {\mathbb {R}}^N$ such that $F(v,\theta )=0$, are of the form $(0,\theta ^*)$ where $\theta ^*$ is a critical point of ${\mathcal {R}}$. Set

$$\begin{aligned} V(v,\theta ) = \frac{\Vert v\Vert ^2}{2} + \bar{\beta _1}({\mathcal {R}}(\theta )-{\mathcal {R}}(\theta ^*)), \end{aligned}$$

then we have

$$\begin{aligned} {\dot{V}}(v,\theta ) = -\bar{\beta _1}\Vert v\Vert ^2. \end{aligned}$$

(33)

The difficulty for Momentum (in opposition to the proof for GD) is that ${\dot{V}}$ is just semi-negative. So it is impossible to apply, for example, the extension of Chetaev theorem. The trick is to see Momentum as a dissipative Lagrangian system. Let us define:

a kinetic energy $T(v)=\frac{\Vert v\Vert ^2}{2}$,
a potential energy $\Pi (\theta ) = \bar{\beta _1}{\mathcal {R}}(\theta )$,
and a dissipative term $Q(v) = -\bar{\beta _1}v$.

The Lagrangian is $L=T-\Pi $ and the equation of motion is:

$$\begin{aligned} \frac{d}{dt}\frac{\partial L}{\partial v}-\frac{\partial L}{\partial \theta } = Q. \end{aligned}$$

With the above definitions of $T, \Pi ,Q$, the above equation corresponds to the Momentum ODE.

It is sufficient to check the assumptions of the Salvadori theorem (theorem 5.2 p114 [38]):

1.
If $\theta ^*$ is a minimum for ${\mathcal {R}}$ then $\theta ^*$ is a minimum for $\Pi $. The same holds for maxima and saddle points.
2.
$\theta ^*$ is an isolated point.
3.
The dissipation is complete that is to say:
$$\begin{aligned} Q(v) \cdot v = -\bar{\beta _1} \Vert v\Vert ^2 \le -a(\Vert v\Vert ). \end{aligned}$$
where $\left( a: x\in {\mathbb {R}}^+ \mapsto \bar{\beta _1}x^2\right) \in {\mathcal {K}}$. Let us recall that ${\mathcal {K}}$ is, classically in stability theory, the set of functions from ${\mathbb {R}}^+$ to ${\mathbb {R}}^+$ continuous, strictly increasing such that $a(0)=0$.

The stability/unstability result of theorem 1 for Momentum is established.

It remains to prove the convergence towards ${\mathcal {E}}$ thanks to the invariance principle [38]. First, note that ${\mathcal {Z}}=\{0\} \times {\mathbb {R}}^N$. Take a solution $(v(t),\theta (t))$ starting in the largest invariant set ${\mathcal {M}}$ included in ${\mathcal {Z}}$. As ${\mathcal {M}}$ is invariant, for all $t\ge 0$, $v(t)=0$. From the Momentum vector field (32), we have that $\theta '(t)=v(t)$. As $v(t)=0=\theta '(t)$, we deduce that the trajectory $\theta (t)$ is constant. The second equation $v'(t) = -\bar{\beta _1}v(t)-\bar{\beta _1} \nabla {\mathcal {R}}(\theta (t))$ gives taht $\forall t\ge 0$, $\theta (t)=\theta ^* \in {\mathcal {E}}$. As a consequence, ${\mathcal {M}}=\{0\} \times {\mathcal {E}}$ and we can conclude thanks to LaSalle’s invariance principle. $\square $

Proof

(Theorem 1 for AWB) For AWB, function F is given by $(v,s,\theta ) \in {\mathbb {R}}^N \times {\mathbb {R}}_+^N \times {\mathbb {R}}^N$:

$$\begin{aligned} F(v,s,\theta ) = \begin{pmatrix} -\bar{\beta _1}v+\bar{\beta _1} \nabla {\mathcal {R}}(\theta ) \\ -\bar{\beta _2}s+\bar{\beta _2} \nabla {\mathcal {R}}(\theta )^2 \\ -\frac{v}{\sqrt{s+\epsilon _a}} \end{pmatrix}. \end{aligned}$$

In order to apply LaSalle’s invariance principle, the definition domain of F has to be an open connected space: first, note that for $t>0,\theta (0)\notin {\mathcal {E}}$ implies that we have $s(t)>0$, see (14). It means we can restrict the study to ${\mathbb {R}}^N \times {\mathbb {R}}_+^N\setminus \{0\} \times {\mathbb {R}}^N$ as we establish an asymptotic property. The stationary points are of the form $(0,0,\theta ^*)$ where $\theta ^*$ is a critical point of ${\mathcal {R}}$. Let us set:

$$\begin{aligned} V(v,s,\theta ) = \frac{1}{2}\displaystyle {\sum _{i=1}^{N}\frac{v_i^2}{\sqrt{s_i+\epsilon _a}}} +\bar{\beta _1}({\mathcal {R}}(\theta )-{\mathcal {R}}(\theta ^*)). \end{aligned}$$

Its derivative is:

$$\begin{aligned} {\dot{V}}(v,s,\theta ) = -\frac{\bar{\beta _2}}{4}\displaystyle {\sum _{i=1}^{N}\frac{v_i^2(\frac{\partial {\mathcal {R}}}{\partial \theta _i})^2(\theta )}{(s_i+\epsilon _a)\sqrt{s_i+\epsilon _a}}} +\displaystyle {\sum _{i=1}^{N}\frac{s_i^2}{\sqrt{s_i+\epsilon _a}}}\left[ -\bar{\beta _1} +\frac{\bar{\beta _2}}{4} \frac{s_i}{s_i+\epsilon _a}\right] . \end{aligned}$$

Noting that $\frac{s_i}{s_i+\epsilon _a}<1$ as $s_i \ge 0$ we have:

$$\begin{aligned} {\dot{V}}(v,s,\theta ) \le \displaystyle {\sum _{i=1}^{N}\frac{v_i^2}{\sqrt{s_i+\epsilon _a}}}\left[ -\bar{\beta _1} +\frac{\bar{\beta _2}}{4} \right] . \end{aligned}$$

As $\bar{\beta _2} < 4 \bar{\beta _2}$, ${\dot{V}}$ is semi-negative.

We will now apply the LaSalle invariance principle. From the above inequality, we have ${\mathcal {Z}}=\{(0,s,\theta ), (s,\theta ) \in {\mathbb {R}}^{2N}\}$. Take a solution $(v(t),s(t),\theta (t))$ starting in the largest invariant set ${\mathcal {M}}$ included in ${\mathcal {Z}}$. As ${\mathcal {M}}$ is invariant, for all $t\ge 0$, $v(t)=0$. From (11), we can deduce several properties:

from $\theta '(t) = -\frac{v(t)}{\sqrt{s(t)+\epsilon _a}}$, we obtain that for all $t \ge 0$, we get:
$$\begin{aligned} \theta '(t)=0. \end{aligned}$$
From the equation $v'(t) = -\bar{\beta _1}v(t)+\bar{\beta _1} \nabla {\mathcal {R}}(\theta (t))$, we deduce that $\forall t\ge 0$:
$$\begin{aligned} \nabla {\mathcal {R}}(\theta (t))=0. \end{aligned}$$
From the equation $s'(t) = -\bar{\beta _2}s(t)+\bar{\beta _2} \nabla {\mathcal {R}}(\theta (t))^2$, we deduce that:
$$\begin{aligned} s(t)=s(0)e^{-\bar{\beta _2}t}. \end{aligned}$$

Hence for all $t \ge 0$:

$$\begin{aligned} \theta (t) = \theta ^* \in {\mathcal {E}}. \end{aligned}$$

Therefore the following equality holds:

$$\begin{aligned} \{0\}\times {\mathbb {R}}_+^N \times {\mathcal {E}} = {\mathcal {M}}. \end{aligned}$$

According to the LaSalle’s invariance principle [23], all bounded trajectories $(v(t),s(t),\theta (t))$ converges to the set ${\mathcal {M}}$. The inclusion proved above shows that:

$$\begin{aligned} \displaystyle {\lim _{t \rightarrow +\infty }} d(\theta (t),{\mathcal {E}})=0, \end{aligned}$$

the attractivity of AWB is proved. $\square $ $\square $

Proof

(Theorem 1 for Adam) For Adam, F is given by $(t,v,s,\theta ) \in {\mathbb {R}}^+ \times {\mathbb {R}}^N \times {\mathbb {R}}_+^N \times {\mathbb {R}}^N$:

$$\begin{aligned} F(t,v,s,\theta ) = \begin{pmatrix} -\bar{\beta _1}v+\bar{\beta _1} \nabla {\mathcal {R}}(\theta ) \\ -\bar{\beta _2}s+\bar{\beta _2} \nabla {\mathcal {R}}(\theta )^2 \\ -f(t)\frac{v}{\sqrt{s+\epsilon _a}} \end{pmatrix}. \end{aligned}$$

Let us recall that f(t) is defined in 4.

For the same reason as in the previous proof for AWB, we can restrict the study to ${\mathbb {R}}^+\setminus \{0\} \times {\mathbb {R}}^N \times {\mathbb {R}}_+^N\setminus \{0\} \times {\mathbb {R}}^N$. The stationary points are of the form $(0,0,\theta ^*)$ where $\theta ^*$ is a critical point of ${\mathcal {R}}$.

Let us set:

$$\begin{aligned} V(t,v,s,\theta ) = \frac{f(t)}{2}\displaystyle {\sum _{i=1}^{N}\frac{v_i^2}{\sqrt{s_i+\epsilon _a}}} +\bar{\beta _1}{\mathcal {R}}(\theta ). \end{aligned}$$

Its derivative is:

$$\begin{aligned} {\dot{V}}(t,v,s,\theta )= & {} -\frac{\bar{\beta _2}f(t)}{4}\displaystyle {\sum _{i=1}^{N}\frac{v_i^2(\frac{\partial {\mathcal {R}}}{\partial \theta _i})^2(\theta )}{(s_i+\epsilon _a)\sqrt{s_i+\epsilon _a}}}-\bar{\beta _1}f(t)\displaystyle {\sum _{i=1}^{N}\frac{v_i^2}{\sqrt{s_i+\epsilon _a}}}\\{} & {} + \displaystyle {\sum _{i=1}^{N}\frac{v_i^2}{4\sqrt{s_i+\epsilon _a}}}\left[ 2f'(t)+(\bar{\beta _2}-4\bar{\beta _1})f(t)\frac{s_i}{s_i+\epsilon _a} \right] . \end{aligned}$$

Now, in order to prove the attractivity of Adam, it remains to check the hypothesis of the section 5.9 p206 (chapter 8) of [38] with the corollary 4.15. These assumptions are recalled below: there exists a continuous function $\psi $ and for every compact set $K \in {\mathbb {R}}^N \times {\mathbb {R}}_+^N \times {\mathbb {R}}^N$, there are constants $A \in {\mathbb {R}}$, $B>0$ such that:

i
F is bounded on K:
$$\begin{aligned} \forall t\in {\mathbb {R}}_+, \forall (v,s,\theta )\in K, \Vert F(t,v,s,\theta )\Vert \le A. \end{aligned}$$
ii
V is bounded by below:
$$\begin{aligned} \forall t\in {\mathbb {R}}_+, \forall (v,s,\theta )\in K, V(t,v,s,\theta )\ge -B. \end{aligned}$$
iii
${\dot{V}}$ is bounded independently of the time:
$$\begin{aligned} \forall t\in {\mathbb {R}}_+, \forall (v,s,\theta )\in K, {\dot{V}}(t,v,s,\theta ) \le -\psi (v,s,\theta ) \le 0. \end{aligned}$$
iv
The limit set is compact.
v
The ODE is asymptotically autonomous. This means that the translation $F_a(t,v,s,\theta ) {:}{=}F(t+a,v,s,\theta )$ converges when a comes to $\infty $ and we will denote the limit function $F^*$.

Denoting by ${\mathcal {Z}}{:}{=}\{(v,s,\theta ) \in {\mathbb {R}}^N \times {\mathbb {R}}_+^N \times {\mathbb {R}}^N, \psi (v,s,\theta )=0\}$, the theorem claims that the solution converges towards the largest invariant set included in ${\mathcal {Z}}$ for the limit equation:

$$\begin{aligned} \frac{d}{dt}(v(t),s(t),\theta (t))^T=F^*(v(t),s(t),\theta (t)). \end{aligned}$$

In order to prove point iii, it is necessary to obtain an upper bound on ${\dot{V}}$. This is done by showing that $g(t) = 2f'(t) + (\bar{\beta _2}-4\bar{\beta _1})f(t)$, is negative on ${\mathbb {R}}_+$ if $\bar{\beta _2} < \bar{\beta _1}$. This will allow deducing a bound on ${\dot{V}}$. We can rewrite g as:

$$\begin{aligned} g(t) = \frac{P(t)}{2\sqrt{1-e^{-\bar{\beta _2}t}}(1-e^{-\bar{\beta _1}t})^2}, \end{aligned}$$

where:

$$\begin{aligned} P(t) = 2(\bar{\beta _2}-4\bar{\beta _1})+8\bar{\beta _1}e^{-\bar{\beta _2}t}+(4\bar{\beta _1}-2\bar{\beta _2})e^{-\bar{\beta _1}t} -4\bar{\beta _1}e^{-(\bar{\beta _1}+\bar{\beta _2})t}. \end{aligned}$$

As for all $t\ge 0$, $e^{-(\bar{\beta _1}+\bar{\beta _2})t} \le e^{-\bar{\beta _1}t}$:

$$\begin{aligned} P'(t) \le 2\bar{\beta _1}\bar{\beta _2}\left( 3e^{-\bar{\beta _1}t} -4 e^{-\bar{\beta _2}t}\right) \le 0. \end{aligned}$$

Note that P(0) = 0. Therefore g is negative on ${\mathbb {R}}_+$.

Let us analyse the function f to have a bound independent of the time. It is straightforward that $\displaystyle {\lim _{t \rightarrow 0}} f(t) = \frac{\bar{\beta _2}}{\bar{\beta _1}}$ and $\displaystyle {\lim _{t\rightarrow +\infty }} f(t) = 1$. Given that f is continuous on ${\mathbb {R}}_+^*$, f is bounded on ${\mathbb {R}}_+$. From now on, we consider the continuous extension of f and we abusively call it f. Denoting by $m>0$ the lower bound of f on ${\mathbb {R}}_+$ (f is strictly positive) the following inequality is verified (assumption iii):

$$\begin{aligned} {\dot{V}}(t,x,y,z) \le -\bar{\beta _1}m\displaystyle {\sum _{i=1}^{N}\frac{x_i^2}{\sqrt{y_i+\epsilon _a}}} {:}{=}\psi (v,s,\theta ) \le 0, \end{aligned}$$

(34)

The positivity of ${\mathcal {R}}$ gives the second hypothesis ii:

$$\begin{aligned} \forall t,v,s,\theta , V(t,v,s,\theta ) \ge 0. \end{aligned}$$

(35)

As f is bounded we deduce that for every compact $K \subset {\mathbb {R}}^+ \times {\mathbb {R}}^N \times {\mathbb {R}}_+^N \times {\mathbb {R}}^N$, there exists $A \in {\mathbb {R}}$ (a continuous function is bounded on a compact set) such that (assumption i):

$$\begin{aligned} \forall t\in {\mathbb {R}}^+, (v,s,\theta )\in K, \Vert F(t,v,s,\theta )\Vert \le A. \end{aligned}$$

(36)

As the assumptions of corollary 4.15 (the ${\mathcal {C}}^1$ regularity of F and V is straightforward) (36), (35), (34) are verified, it is sufficient to prove that the ODE is asymptotically autonomous (assumption v). A simple calculation gives:

$$\begin{aligned} F^*(v,s,\theta ) = \begin{pmatrix} -\bar{\beta _1}v+\bar{\beta _1} \nabla {\mathcal {R}}(\theta ) \\ -\bar{\beta _2}s+\bar{\beta _2} \nabla {\mathcal {R}}(\theta )^2 \\ -\frac{v}{\sqrt{s+\epsilon _a}} \end{pmatrix}. \end{aligned}$$

Consider a bounded trajectory $(v(t),s(t),\theta (t))$. It is a classical result (see theorem 4 p364 of appendix 2 in [38]) that the limit set is compact (assumption iv). The theorem of section 5.9 claims that the trajectory converges towards the invariant set of the limit ODE

$$\begin{aligned} \frac{d}{dt}(v(t),s(t),\theta (t))^T=F^*(v(t),s(t),\theta (t)), \end{aligned}$$

included in the set:

$$\begin{aligned} {\mathcal {Z}}=\{(0,s,\theta ), (s,\theta ) \in {\mathbb {R}}_+^N \times {\mathbb {R}}^{N}\}. \end{aligned}$$

Now note that this limit equation is exactly the ODE associated to AWB. So the study is already done. $\square $ $\square $

Remark

This methodology is useful for many NNs optimizers not studied in this paper. For instance, we can apply the above methodology to RMSProp [43]. The following choices of V are possible (keeping the same notations as Adam):

$$\begin{aligned} {V}(t) = \bar{\beta _2} {\mathcal {R}}(\theta (t)), \end{aligned}$$

or

$$\begin{aligned} {V}(t) = \displaystyle {\sum _{i=1}^{N} \sqrt{s_i(t)}} + \bar{\beta _2} {\mathcal {R}}(\theta (t)). \end{aligned}$$

C Tests on analytical examples: supplement

In this appendix, we give more quantitative results than in Sect. 3.3 on benchmarks #2 (activation (27)) with the classical optimizers and the new ones we develop in this paper. These quantitative results are summed-up in Table 10. Note that the default settings for the parameters of the classical optimizers are the same as in Sect. 6.1. First, Table 10 shows that the performances observed are not particular to an example. The algorithms have been tested on all the benchmarks of the appendix A. Contrary to the benchmark #3 of the Sect. 6.1, it allows to see if the global minimum $(0,-1)$ gives an advantage or not. AWB is very troubling because it favours the worst minimum (0, 1) a lot contrary to Adam: a direct advantage of the biased steps appears clearly. A very promising insight is that the algorithms that enforce the energy decreasing promote the global minimum: concerning EGD, half of the trajectories converge to $(0,-1)$.

Table 10 Algorithms performances on benchmark #2

Full size table

D Tests on machine learning tasks: supplement

The default settings are the same as Sect. 6.2. Table 11 is about the square function $(x,y) \mapsto (x^2,y^2)$ using 50 training/testing points uniformly sampled on $[-1,1]^2$. The architecture is the following: ${\mathcal {N}}=\left( 2,15,2\right) $ and ${\mathcal {G}}=\left( reLU,linear\right) $ so $N=77$. Here the assumptions of propositions 1 and 4 are not verified. However the differences between classical methods and the adaptive ones are even more striking.

Table 11 Performances for the square function $(x,y) \mapsto (x^2,y^2)$

Full size table

E Proofs of propositions 5 and 6

The proofs consist in checking the three points of the Sect. 7 using Taylor expansion.

Proof

(Proposition 5) Since $\nabla {\mathcal {R}}$ is Lipschitz-continuous, the explicit Euler scheme is 0-stable. The second point to check is simple: the initial condition is exact with $l_0=\theta (0)-\theta _0=0$. Assume that $\theta $ is the solution of the modified equation (22). The local error is given by:

$$\begin{aligned} l_{n} = \frac{\theta (t_{n+1})-\theta (t_n)}{\eta }+\nabla {\mathcal {R}}(\theta (t_n)). \end{aligned}$$

Using equation (22) and Taylor formula, there exists $t_n \le \alpha _n \le t_{n+1}$ such that:

$$\begin{aligned} l_n = \frac{\eta }{2}\left[ \theta ''(t)-H(\theta (t))\nabla {\mathcal {R}}(\theta (t)) \right] +\frac{\eta ^2}{6}\theta ^{(3)}(\alpha _n). \end{aligned}$$

Let us denote by $\nabla ^3 {\mathcal {R}}$ the tensor of the third derivatives. By differentiation of (22), we get (omitting the $\theta (t_n)$ in the right term):

$$\begin{aligned} \theta '' = \frac{\eta }{2} \left[ \nabla ^3 {\mathcal {R}} : \left( I_N + \frac{\eta }{2}H\right) \nabla {\mathcal {R}}\right] \nabla {\mathcal {R}} + \left( I_N + \frac{\eta }{2}H\right) H\left( I_N + \frac{\eta }{2}H\right) \nabla {\mathcal {R}}. \end{aligned}$$

Rearranging the terms it follows:

$$\begin{aligned} l_n = \left( \frac{\eta }{2}\right) ^2 \biggl [ \left[ \nabla ^3 {\mathcal {R}} : \left( I_N + \frac{\eta }{2}H\right) \nabla {\mathcal {R}}\right] \nabla {\mathcal {R}} + H^2\nabla {\mathcal {R}}+\frac{\eta }{2}H^3\nabla {\mathcal {R}} \biggr ] + \frac{\eta ^2}{6}\theta ^{(3)}(\alpha _n). \end{aligned}$$

From the regularity theorem for parametric equations, it comes that $(\eta ,t) \mapsto \theta _{\eta }(t)$ is ${\mathcal {C}}^{\infty }([0,1] \times [0,T])$. Therefore the term in front of $\eta ^2$ is uniformly bounded in $\eta $ at the neighborhood of the origin. This ends the proof. $\square $ $\square $

Proof

(Proposition 6) The 0-stability is already discussed in the remark 3. The second point is verified as $l_0=\theta (0)-\theta _0=0$. By performing the change of variable $\bar{\beta _1} = \frac{\beta _1}{\eta }$, we can write the algorithm as the following finite differences scheme:

$$\begin{aligned} \frac{\theta _{n+1}-2\theta _n+\theta _{n-1}}{\eta ^2}+\bar{\beta _1}\frac{\theta _n-\theta _{n-1}}{\eta }+\bar{\beta _1}\nabla {\mathcal {R}}(\theta _n) = 0. \end{aligned}$$

With this modification, the ODE (23) can be written as:

$$\begin{aligned} \theta ''(t) + \bar{\beta _1}\left( 1+\frac{\eta \bar{\beta _1}}{2} \right) \theta '(t) + \bar{\beta _1}\left( 1+\frac{\eta \bar{\beta _1}}{2} \right) \nabla {\mathcal {R}}(\theta (t)) = 0. \end{aligned}$$

By Taylor formula and after some manipulations (using the equation above), there exists $t_n \le \alpha _n^0 \le t_{n+1}$, $t_{n-1} \le \alpha _n^1 \le t_n$ and $t_{n-1} \le \alpha _n^2 \le t_n$ such that:

$$\begin{aligned} l_n = \eta ^2 \left[ \frac{\bar{\beta _1}^3}{4}\theta '(t_n)+\frac{\bar{\beta _1}^3}{4}\nabla {\mathcal {R}}(\theta (t_n)) + \frac{\theta ^{(4)}(\alpha _n^0) + \theta ^{(4)}(\alpha _n^1) +4\bar{\beta _1}\theta ^{(3)}(\alpha _n^2)}{24}\right] . \end{aligned}$$

We conclude as in the proof of proposition 5 using the regularity theorem. $\square $ $\square $

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Bensaid, B., Poëtte, G. & Turpault, R. Deterministic Neural Networks Optimization from a Continuous and Energy Point of View. J Sci Comput 96, 14 (2023). https://doi.org/10.1007/s10915-023-02215-4

Download citation

Received: 05 July 2022
Revised: 06 December 2022
Accepted: 16 April 2023
Published: 24 May 2023
DOI: https://doi.org/10.1007/s10915-023-02215-4

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deterministic Neural Networks Optimization from a Continuous and Energy Point of View

Abstract

Access this article

Similar content being viewed by others

Hierarchical multi-scale parametric optimization of deep neural networks

Optimization for Deep Learning: An Overview

Deterministic Global Optimization with Artificial Neural Networks Embedded

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

A Analytical benchmarks and their subtleties

B Proof of the stability/attractivity theorem 1

Proof

Proof

Proof

Proof

Remark

C Tests on analytical examples: supplement

D Tests on machine learning tasks: supplement

E Proofs of propositions 5 and 6

Proof

Proof

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Deterministic Neural Networks Optimization from a Continuous and Energy Point of View

Abstract

Access this article

Similar content being viewed by others

Hierarchical multi-scale parametric optimization of deep neural networks

Optimization for Deep Learning: An Overview

Deterministic Global Optimization with Artificial Neural Networks Embedded

Data availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

A Analytical benchmarks and their subtleties

B Proof of the stability/attractivity theorem 1

Proof

Proof

Proof

Proof

Remark

C Tests on analytical examples: supplement

D Tests on machine learning tasks: supplement

E Proofs of propositions 5 and 6

Proof

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation