Energy Conservation in Infinitely Wide Neural-Networks

Eguchi, Shu; Amaba, Takafumi

doi:10.1007/978-3-030-86380-7_15

Shu Eguchi¹² &
Takafumi Amaba¹²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12894))

Included in the following conference series:

International Conference on Artificial Neural Networks

2098 Accesses

Abstract

A three-layered neural-network (NN), which consists of an input layer, a wide hidden layer and an output layer, has three types of parameters. Two of them are pre-neuronal, namely, thresholds and weights to be applied to input data. The rest is post-neuronal weights to be applied after activation. The current paper consists of the following two parts. First, we consider three types of stochastic processes. They are constructed by summing up each of parameters over all neurons at each epoch, respectively. The neuron number will be regarded as another time different to epochs. In the wide neural-network with a neural-tangent-kernel- (NTK-) parametrization, it is well known that these parameters are hardly varied from their initial values during learning. We show that, however, the stochastic process associated with the post-neuronal parameters is actually varied during the learning while the stochastic processes associated with the pre-neuronal parameters are not. By our result, we can distinguish the type of parameters by focusing on those stochastic processes. Second, we show that the variance (sort of “energy”) of the parameters in the infinitely wide neural-network is conserved during the learning, and thus it gives a conserved quantity in learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

On the Systems of Conservation Laws and on a New Way To Construct for them Neural Networks Algorithms

Article 01 November 2021

Time Varying Stimulations in Simple Neural Networks and Convergence to Desired Outputs

Article 05 August 2016

Continuous neural network with windowed Hebbian learning

Article 13 February 2015

References

Chizat, L., Oyallon, E., Bach, F.: On lazy training in differentiable programming. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32, (NeurIPS 2019). Curran Associates, Inc (2018)
Google Scholar
Goldberg, P., Williams, C., Bishop, C.: Regression with input-dependent noise: a Gaussian process treatment. In: Advances in Neural Information Processing Systems, vol. 10, NIPS 1997. MIT Press (1998)
Google Scholar
Ikeda, N., Watanabe, S.: Stochastic Differential Equations and Diffusion Processes, Second edn. North-Holland Mathematical Library, 24. North-Holland Publishing Co., Amsterdam; Kodansha Ltd, Tokyo, p. xvi+555 (1989). ISBN: 0-444-87378-3
Google Scholar
Jacot, A., Gabriel. F., Hongler. C.: Neural tangent kernel: convergence and generalization in neural networks. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp. 8571–8580. Curran Associates, Inc (2018)
Google Scholar
Lee, J., Bahri, Y., Novak, R., Schoenholz, S., Pennington, J., Sohl-Dickstein, J.: Deep neural networks as Gaussian processes. In: International Conference on Learning Representations, (ICLR 2018) (2018 )
Google Scholar
Lee, J., et al.: Wide neural networks of any depth evolve as linear models under gradient descent. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32, (NeurIPS 2019), Curran Associates, Inc (2019)
Google Scholar
Neal, R.M.: Priors for infinite networks. In: Bayesian Learning for Neural Networks, pp. 29–53. Springer, New York (1996). https://doi.org/10.1007/978-1-4612-0745-0_2
Chapter MATH Google Scholar
Sonoda, S., Murata, N.: Neural network with unbounded activation functions is universal approximator. Appl. Comput. Harmonic Anal. 43(2), 233–268 (2017)
Article MathSciNet Google Scholar
Suzuki, T.: Generalization bound of globally optimal non-convex neural network training: transportation map estimation by infinite dimensional Langevin dynamics. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.), Advances in Neural Information Processing Systems, vol. 33, (NeurIPS 2020), pp. 19224–19237. Curran Associates, Inc (2020)
Google Scholar
Suzuki, T., Akiyama, S.: Benefit of deep learning with non-convex noisy gradient descent: provable excess risk bound and superiority to kernel methods. To appear in International Conference on Learning Representations, 2021 (ICLR 2021) (2021)
Google Scholar
Murata, N.: An integral representation of functions using three-layered networks and their approximation bounds. Neural Netw. 9(6), 947–956 (1996)
Article Google Scholar
Williams, C.: Computing with infinite networks. In: Mozer, M.C., Jordan, M., Petsche, T. (eds.) Advances in Neural Information Processing Systems, vol. 9, (NIPS 1996), MIT Press (1997)
Google Scholar

Download references

Acknowledgments

The authors would like to express their appreciation to Professor Masaru Tanaka and Professor Jun Fujiki who provided valuable comments and advices.

Author information

Authors and Affiliations

Fukuoka University, 8-19-1 Nanakuma, Jônan-ku, Fukuoka, 814-0180, Japan
Shu Eguchi & Takafumi Amaba

Authors

Shu Eguchi
View author publications
You can also search for this author in PubMed Google Scholar
Takafumi Amaba
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shu Eguchi .

Editor information

Editors and Affiliations

Comenius University in Bratislava, Bratislava, Slovakia
Igor Farkaš
iMotions A/S, Copenhagen, Denmark
Paolo Masulli
University of Tübingen, Tübingen, Baden-Württemberg, Germany
Sebastian Otte
Universität Hamburg, Hamburg, Germany
Stefan Wermter

A Proof of Theorem 1 and Theorem 2

Recall that the activation function $\sigma $ has been assumed to be non-negative and Lipschitz continuous. Then $\sigma $ is differentiable almost everywhere and the Lipschitz constant can be expressed as $ \Vert \sigma ^{\prime } \Vert _{\infty } := \mathrm {ess}\sup \vert \sigma ^{\prime } \vert $, where $\sigma ^{\prime }$ is the almost-everywhere-defined derivative of $\sigma $. We shall put $ \vert \mathcal {X} \vert := \max _{i = 1,2,\ldots , n} \vert x_{i} \vert $, where $\{ x_i \}_{i=1}^{m}$ is the input data. Note that the loss function $ L ( \theta ) = \frac{1}{n} \sum _{i=1}^{n} \big ( \hat{y}_{i} ( \theta ) - y_{i} \big )^{2} $ depends on the width m as does so for the outputs $ \hat{y}_{i} ( \theta ) = \frac{1}{\sqrt{m}} \sum _{j=1}^{m} \sigma ( a_{j} x_{i} + a_{0,j} ) b_{j} $.

1.1 A.1 Equipments About Gradient Flow $\frac{\mathrm {d}}{\mathrm {d} t} \theta (t) = - \frac{1}{2} ( \nabla _{\theta } L ) ( \theta (t) )$

Lemma 1

Along the gradient flow, we have $ L( \theta (t) ) \le L( \theta (0) ) $ for $t\ge 0$.

In the coordinate $ \theta (t) = ( \boldsymbol{a}_{0}(t), \boldsymbol{a}(t), \boldsymbol{b}(t) ) = ( \{ a_{0,j}(t) \}_{j=1}^{m}, \{ a_{j}(t) \}_{j=1}^{m},$$ \{ b_{j}(t) \}_{j=1}^{m} ) $, the gradient flow $ \frac{\mathrm {d}}{\mathrm {d} t} \theta (t) = - \frac{1}{2} ( \nabla _{\theta } L) ( \theta (t) ) $ can be written as follows: for $j = 1,2,\ldots , m$ and $t \in \mathbb {R}$,

$$\begin{aligned} \begin{aligned} \frac{\mathrm {d}}{\mathrm {d}t} a_{0,j} (t)&= - \frac{1}{n} \sum _{i=1}^{n} \big ( \hat{y}_{i}( \theta (t) ) - y_{i} \big ) \sigma ^{\prime } \big ( a_{j}(t) x_{i} + a_{0,j}(t) \big ) \frac{ b_{j}(t) }{ \sqrt{m} }, \\ \frac{\mathrm {d}}{\mathrm {d}t} a_{j} (t)&= - \frac{1}{n} \sum _{i=1}^{n} \big ( \hat{y}_{i}( \theta (t) ) - y_{i} \big ) \sigma ^{\prime } \big ( a_{j}(t) x_{i} + a_{0,j}(t) \big ) x_{i} \frac{ b_{j}(t) }{ \sqrt{m} }, \\ \frac{\mathrm {d}}{\mathrm {d}t} b_{j} (t)&= - \frac{1}{n} \sum _{i=1}^{n} \big ( \hat{y}_{i}( \theta (t) ) - y_{i} \big ) \sigma \big ( a_{j}(t) x_{i} + a_{0,j}(t) \big ) \frac{ 1 }{ \sqrt{m} }. \end{aligned} \end{aligned}$$

(1)

Proposition 1

For $m = 1,2,3,\ldots $, $j = 1,2,\ldots ,m$ and $t \ge 0$, we have

where $ F_{j} (t) := \vert a_{0,j} (t) \vert + \vert a_{j} (t) \vert + \vert b_{j} (t) \vert $.

Proof

We begin with estimating $a_{j}(t)$. Let $\dot{a}_j(s) := \frac{\mathrm {d}}{\mathrm {d}s} a_j (s)$. By fundamental theorem of calculus, the triangle inequality and (1), we have

$$\begin{aligned} \begin{aligned} \vert a_j (t) \vert&\le \vert a_j (0) \vert + \int _0^t \Big \vert \frac{1}{n} \sum _{i=1}^n \big ( \hat{y}_{i} (\theta (s)) - y_i \big ) \sigma ^{\prime } \big ( a_j (s) x_i + a_{0,j}(s) \big ) x_i \frac{ b_j (s) }{ \sqrt{m} } \Big \vert \mathrm {d}s \\&\le \vert a_j (0) \vert + \int _0^t \frac{ \Vert \sigma ^{\prime } \Vert _{\infty } \vert \mathcal {X} \vert }{ \sqrt{m} } \left( \frac{1}{n} \sum _{i=1}^n \vert \hat{y}_{i} ( \theta (s) ) - y_i \vert \right) \vert b_j(s) \vert \mathrm {d}s. \end{aligned} \end{aligned}$$

Since it holds that

$$\begin{aligned} \begin{aligned} \frac{1}{n} \sum _{i=1}^{n} \vert \hat{y}_{i}( \theta (s) ) - y_{i} \vert \le \sqrt{ L ( \theta (s) ) } \le \sqrt{ L ( \theta (0) ) } \end{aligned} \end{aligned}$$

(2)

by virtue of Jensen’s inequality and Lemma 1, we obtain

$$\begin{aligned} \begin{aligned} \vert a_j (t) \vert&\le \vert a_j (0) \vert + \frac{ \Vert \sigma ^{\prime } \Vert _{\infty } \vert \mathcal {X} \vert }{ \sqrt{m} } \sqrt{ L ( \theta (0) ) } \int _0^t \vert b_j(s) \vert \mathrm {d}s. \end{aligned} \end{aligned}$$

(3)

Similarly, we have

$$\begin{aligned} \begin{aligned} \vert a_{0,j} (t) \vert&\le \vert a_{0,j} (0) \vert + \frac{ \Vert \sigma ^{\prime } \Vert _{\infty } }{ \sqrt{m} } \sqrt{ L ( \theta (0) ) } \int _0^t \vert b_j(s) \vert \mathrm {d}s. \end{aligned} \end{aligned}$$

(4)

For $b_{j}(t)$, by estimating in a manner similar to $\vert a_{j}(t) \vert $, we get

$$\begin{aligned} \begin{aligned} \vert b_j (t) \vert&\le \vert b_j (0) \vert + \int _0^t \frac{1}{n} \sum _{i=1}^n \vert \hat{y}_{i} ( \theta (s) ) - y_i \vert \cdot \sigma \big ( a_j (s) x_i + a_{0,j} (s) \big ) \frac{1}{\sqrt{m}} \mathrm {d}s . \end{aligned} \end{aligned}$$

By using a estimate: $ \sigma \big ( a_j x_i + a_{0,j} \big ) \le \sigma (0) + \Vert \sigma ^{\prime } \Vert _{\infty } ( \vert \mathcal {X} \vert \vert a_j \vert + \vert a_{0,j} \vert ) $ and (2),

$$\begin{aligned} \begin{aligned} \vert b_j (t) \vert&\le \vert b_j (0) \vert + \frac{ \sigma (0) t }{\sqrt{m}} \sqrt{ L ( \theta (0) ) } + \frac{ \Vert \sigma ^{\prime } \Vert _{\infty }t ( \vert \mathcal {X} \vert + 1 ) }{\sqrt{m}} \sqrt{ L ( \theta (0) ) } \int _0^t \big ( \vert a_j (s) \vert + \vert a_{0,j} (s) \vert \big ) \mathrm {d}s. \end{aligned} \end{aligned}$$

(5)

By putting estimates (3), (4) and (5) together, we have

$$\begin{aligned} \begin{aligned} F_{j} (t)&\le F_{j} (0) + \frac{ \sigma (0) t }{\sqrt{m}} \sqrt{ L ( \theta (0) ) } + \frac{ \Vert \sigma ^{\prime } \Vert _{\infty } ( \vert \mathcal {X} \vert + 1 ) }{\sqrt{m}} \sqrt{ L ( \theta (0) ) } \int _0^t F_{j} (s) \mathrm {d}s. \end{aligned} \end{aligned}$$

Now, by applying Grönwall’s inequality, we reach the conclusion.

Proposition 2

For every $j = 1,2,\ldots , m$, we have

(i)
$\displaystyle \int _{0}^{t} F_{j} (u) \mathrm {d}u \le G_{j}(t) $,
(ii)
$\displaystyle \int _{0}^{t} \max \big \{ \vert \dot{a}_{0,j} (u) \vert , \vert \dot{a}_{j} (u) \vert , \vert \dot{b}_{j} (u) \vert \big \} \mathrm {d}u \le \sqrt{ \frac{ L ( \theta (0) ) }{ m } } \left\{ \Vert \sigma ^{\prime } \Vert _{\infty } ( \vert \mathcal {X} \vert + 1 ) G_{j}(t) + \sigma (0) t \right\} $,

where $ F_{j} (u) := \vert a_{0,j} (u) \vert + \vert a_{j} (u) \vert + \vert b_{j} (u) \vert $ and

(6)

Note that each $ G_{j} (t) $ depends on the width m of the network.

Proof

(i) Put $ c_{1} = \frac{ \Vert \sigma ^{\prime } \Vert _{\infty } ( \vert \mathcal {X} \vert + 1 ) }{\sqrt{m}} \sqrt{ L ( \theta (0) ) } $ and $ c_{2} = \frac{ \sigma (0) }{\sqrt{m}} \sqrt{ L ( \theta (0) ) } $. Then by Proposition 1, we have

$$\begin{aligned} \begin{aligned} \int _{0}^{t} F_{l} (u) \mathrm {d}u&\le \int _{0}^{t} \left( F_{l} (0) + c_{2} u \right) \mathrm {e}^{ c_{1}u } \mathrm {d}u \le F_{l} (0) \frac{ \mathrm {e}^{c_{1}t} - 1 }{ c_{1} t } t + \frac{ c_{2} }{ c_{1} } t \cdot \mathrm {e}^{ c_{1} t }. \end{aligned} \end{aligned}$$

Since it holds that $ \frac{ \mathrm {e}^{x} - 1 }{ x } \le \mathrm {e}^{2x} $ for $x > 0$, we obtain

$$\begin{aligned} \begin{aligned} \int _{0}^{t} F_{l} (u) \mathrm {d}u&\le F_{l} (0) \mathrm {e}^{2c_{1}t} \cdot t + \frac{ c_{2} }{ c_{1} } t \cdot \mathrm {e}^{ c_{1} t } \le \left( F_{l}(0) + \frac{ c_{2} }{ c_{1} } \right) t \cdot \mathrm {e}^{ 2c_{1} t } = G_{j} (t). \end{aligned} \end{aligned}$$

(ii) We show only for $ \int _{0}^{t} \vert \dot{b}_{j} (u) \vert \mathrm {d}u. $ The same is for the other parameters. By (1) and (2), we get $ \int _{0}^{t} \vert \dot{b}_{j} (u) \vert \mathrm {d}u \le \sqrt{ \frac{ L ( \theta (0) ) }{ m } } \int _{0}^{t} \sigma \big ( a_{j} (u) x_{i} + a_{0,j} (u) \big ) \mathrm {d}u $. Then by using that $ \sigma \big ( a_{j} (u) x_{i} + a_{0,j} (u) \big ) \le \Vert \sigma ^{\prime } \Vert _{\infty } ( \vert \mathcal {X} \vert + 1 ) F_{j} (u) + \sigma (0) $ and by (i), we have the conclusion.

Proposition 3

For all $p > 0$, we have the following: $ \limsup _{m \rightarrow \infty } \mathbf {E} [ \big ( \sqrt{ L ( \theta (0) ) } \big )^{p} ] < \infty $, $ \limsup _{m \rightarrow \infty } \mathbf {E} [ G_{j} (t)^{p} ] < \infty $ and $ \limsup _{m \rightarrow \infty } \mathbf {E} [ \big ( \sqrt{ L ( \theta (0) ) } \, G_{j} (t) \big )^{p} ] < \infty $.

Proof

The last estimate follows from the first two estimates and Cauchy-Schwarz’ inequality. Since the first estimate is obvious, we show only the second. For this, it is sufficient to show that

(7)

In the following, we write $a_{0,j} (0) = a_{0,j}$, $a_{j} (0) = a_{j}$ and $b_{j} (0) = b_{j}.$ First, we note that $ \sqrt{L ( \theta (0) )} \le \frac{1}{\sqrt{n}} \sum _{i=1}^{n} \vert \hat{y}_{i} ( \theta (0) ) - y_{i} \vert \le \frac{1}{\sqrt{n}} \sum _{i=1}^{n} \vert \hat{y}_{i} ( \theta (0) ) \vert + \frac{1}{\sqrt{n}} \vert y_{i} \vert $. Then by using Hölder’s inequality, we get

Since we have $ ( \hat{y}_{i} ( \theta (0) ) \mid \boldsymbol{a}_{0}, \boldsymbol{a} ) \sim \mathrm {N} \big ( 0, \frac{1}{m} \sum _{j=1}^{m} \sigma ( a_{j} x_{i} + a_{0,j} )^{2} \big ) $,

Furthermore, by Jensen’s inequality and independence,

We can show that $ \sigma ( a_{1} x_{i} + a_{0,1} )^{2} \le 16 \{ \Vert \sigma ^{\prime } \Vert _{\infty } ( \vert \mathcal {X} \vert + 1 ) \}^{2} \{ ( a_{0,1} )^{2} + ( a_{1} )^{2} \} + ( \sigma (0) )^{2} $. Hence

The right-hand-side is finite if $ \frac{ 8 p^{2} n \Vert \sigma ^{\prime } \Vert _{\infty }^{2} ( \vert \mathcal {X} \vert + 1 )^{2} }{ m } - \frac{1}{2} < 0 $, that is, $ m > 16 p^{2} n \Vert \sigma ^{\prime } \Vert _{\infty }^{2} ( \vert \mathcal {X} \vert + 1 )^{2} $, and then it is decreasing with respect to m. By putting all together, (7) is proved.

1.2 A.2 Proof of Theorem 1

It is enough to prove that both of $\{ A^{(m)}(t) \}_{m=1}^{\infty }$ and $\{ B^{(m)}(t) \}_{m=1}^{\infty }$ are tight. For this, from [3, Chapter I, Section 4, Theorem 4.3], it is sufficient to show that (i) $ \sup _{m} \mathbf {E} \big [ \vert A_{0}^{(m)}(t) \vert + \vert B_{0}^{(m)} (t) \vert \big ] < \infty $ and (ii) there exist $\gamma , \alpha > 0$ such that

$$\begin{aligned} \begin{aligned} \sup _{m} \sup _{ \begin{array}{c} s, u \in [0,1]: \\ s \ne u \end{array} } \left( \frac{ \mathbf {E} \big [ \vert A_{s}^{(m)} (t) - A_{u}^{(m)}(t) \vert ^{\gamma } \big ] }{ \vert s - u \vert ^{ 1 + \alpha } } + \frac{ \mathbf {E} \big [ \vert B_{s}^{(m)} (t) - B_{u}^{(m)}(t) \vert ^{\gamma } \big ] }{ \vert s - u \vert ^{ 1 + \alpha } } \right) < \infty . \end{aligned} \end{aligned}$$

(i) is clear since $ A_{0}^{(m)} (t) = B_{0}^{(m)} (t) = 0 $. Thus we show only (ii). We will only show the one for $A^{(m)}(t)$. Since $ A^{(m)} (t) $ is a piecewise linear interpolation of values on $\{ s_{k} = \frac{k}{m} \}_{k=0}^{m}$, it suffices to show that for some $\gamma , \alpha > 0$, it holds that

$$\begin{aligned} \sup _{m} \sup _{ \begin{array}{c} 1 \le k, j \le m: \\ k \ne j \end{array} } \frac{ \mathbf {E} \big [ \vert A_{s_{k}}^{(m)} (t) - A_{s_{j}}^{(m)}(t) \vert ^{\gamma } \big ] }{ \vert s_{k} - s_{j} \vert ^{ 1 + \alpha } } < \infty . \end{aligned}$$

(8)

Let $k,j \in \{ 1,2,\ldots , m \}$ be arbitrary. Without loss of generality, we assume that $j < k$. Then we have

$$\begin{aligned} \begin{aligned} \vert A_{s_{k}}^{(m)} (t) - A_{s_{j}}^{(m)}(t) \vert&\le \frac{1}{\sqrt{m}} \Big \vert \sum _{l=j+1}^{k} \big ( a_l (t) - \mathbf {E} [ a_l (t) ] \big ) \Big \vert + \frac{1}{\sqrt{m}} \Big \vert \sum _{l=j+1}^{k} \mathbf {E} [ a_l (t) ] \Big \vert . \end{aligned} \end{aligned}$$

(9)

We shall make estimates for two terms on the right-hand-side.

Lemma 2

With $G_{l}(t)$ defined in (6), we have

$$\begin{aligned} \begin{aligned} \Big \vert \sum _{l=j+1}^k \big ( a_l (t) - \mathbf {E} [ a_l (t) ] \big ) \Big \vert \le \Big \vert \sum _{l=j+1}^k a_l (0) \Big \vert + \frac{ \Vert \sigma ^{\prime } \Vert _{\infty } \vert \mathcal {X} \vert }{ \sqrt{m} } \sum _{l=j+1}^{k} \big ( \sqrt{ L ( \theta (0) ) } \, G_{l}(t) + \mathbf {E} [ \sqrt{ L ( \theta (0) ) } \, G_{l}(t) ] \big ) . \end{aligned} \end{aligned}$$

Proof

Since $\mathbf {E} [ a_{l}(0) ] = 0$, we have $ a_l (t) - \mathbf {E} [ a_l (t) ] = \int _0^t \dot{a}_l (u) \mathrm {d}u - \int _0^t \mathbf {E} [ \dot{a}_l (u) ] \mathrm {d}u + a_l (0) $. By summing up this over $l=j+1 , j+2, \ldots , k$ and by using (1) and (2),

$$\begin{aligned} \begin{aligned} \Big \vert \sum _{l=j+1}^k \big ( a_l (t) - \mathbf {E} [ a_l (t) ] \big ) \Big \vert \le \Big \vert \sum _{l=j+1}^k a_l (0) \Big \vert&+ \frac{ \Vert \sigma ^{\prime } \Vert _{\infty } \vert \mathcal {X} \vert }{ \sqrt{m} } \sqrt{ L ( \theta (0) ) } \sum _{l=j+1}^k \int _0^t \vert b_l (u) \vert \mathrm {d}u \\&\!\!\!+ \frac{ \Vert \sigma ^{\prime } \Vert _{\infty } \vert \mathcal {X} \vert }{ \sqrt{m} } \mathbf {E} \big [ \sqrt{ L ( \theta (0) ) } \sum _{l=j+1}^k \int _0^t \vert b_l (u) \vert \mathrm {d} u \big ] . \end{aligned} \end{aligned}$$

Finally, by applying Proposition 2, we get the conclusion.

Lemma 3

We have $\displaystyle \Big \vert \sum _{l=j+1}^k \mathbf {E} [ a_l (t) ] \Big \vert \le \frac{ \Vert \sigma ^{\prime } \Vert _{\infty } \vert \mathcal {X} \vert }{ \sqrt{m} } \mathbf {E} \big [ \sqrt{ L ( \theta (0) ) } \, \sum _{l=j+1}^k G_{l} (t) \big ] . $

Proof

By (1), $ \mathbf {E} [ a_l (t) ] = \int _0^t \mathbf {E} [ -\frac{1}{n} \sum _{i=1}^n \big ( \hat{y}_{i} ( \theta (u) ) - y_i \big ) \sigma ^{\prime } \big ( a_l (u) x_i + a_{0,l} (u) \big ) x_i \frac{ b_l (u) }{ \sqrt{m} } ] \mathrm {d}u $. By taking the sum over $l = j+1, j+2, \ldots , k$, we have

$$\begin{aligned} \begin{aligned} \Big \vert \sum _{l=j+1}^k \mathbf {E} [ a_l (t) ] \Big \vert&\le \frac{ \Vert \sigma ^{\prime } \Vert _{\infty } \vert \mathcal {X} \vert }{ \sqrt{m} } \mathbf {E} \Big [ \sqrt{ L ( \theta (0) ) } \sum _{l=j+1}^k \int _0^t \vert b_l (u) \vert \mathrm {d}u \Big ] . \end{aligned} \end{aligned}$$

Then by using Proposition 2, we reach the conclusion.

Turning back to Eq. (9), we apply Lemma 2 and Lemma 3 to get

$$\begin{aligned} \begin{aligned} \vert A_{s_{k}}^{(m)} (t) - A_{s_{j}}^{(m)}(t) \vert&\le \frac{1}{\sqrt{m}} \Big \vert \sum _{l=j+1}^{k} a_{l} (0) \Big \vert + \frac{ \Vert \sigma ^{\prime } \Vert _{\infty } \vert \mathcal {X} \vert }{m} \sum _{l=j+1}^{k} \big ( 2 H_{l} (t) + \mathbf {E} [ H_{l} (t) ] \big ) , \end{aligned} \end{aligned}$$

where $ H_{l} (t) = \sqrt{ L ( \theta (0) ) } \, G_{l} (t) $. By an easy estimate: $ ( x+y )^{4} \le 2^{4} ( x^{4} + y^{4} ) $,

$$\begin{aligned} \begin{aligned}&\big ( A_{s_{k}}^{(m)} (t) - A_{s_{j}}^{(m)}(t) \big )^{4} \\&\le \frac{2^{4}}{m^{2}} \left( \sum _{l=j+1}^{k} a_{l} (0) \right) ^{4} + 2^{4} \Vert \sigma ^{\prime } \Vert _{\infty }^{4} \vert \mathcal {X} \vert ^{4} \left( \frac{ k-j }{ m } \right) ^{4} \left( \frac{1}{k-j} \sum _{l=j+1}^{k} \big ( 2H_{l} (t) + \mathbf {E} [ H_{l} (t) ] \big ) \right) ^{4} . \end{aligned} \end{aligned}$$

Therefore $ \mathbf {E} [ \big ( A_{s_{k}}^{(m)} (t) - A_{s_{j}}^{(m)}(t) \big )^{4} ] = \frac{2^{4}}{m^{2}} I + 2^{4} \Vert \sigma ^{\prime } \Vert _{\infty }^{4} \vert \mathcal {X} \vert ^{4} ( s_{k} - s_{j} )^{4} I\!I $. Here,

$$\begin{aligned} I := \mathbf {E} [ \Big ( \sum _{l=j+1}^{k} a_{l} (0) \Big )^{4} ], \quad I\!I := \mathbf {E} [ \Big ( \frac{1}{k-j} \sum _{l=j+1}^{k} \big ( 2H_{l} (t) + \mathbf {E} [ H_{l} (t) ] \big ) \Big )^{4} ] . \end{aligned}$$

First, we shall focus on $I\!I$. By Jensen’s inequality,

$$\begin{aligned} \begin{aligned} I\!I \le \frac{1}{k-j} \sum _{l=j+1}^{k} \mathbf {E} [ \big ( 2H_{l} (t) + \mathbf {E} [ H_{l} (t) ] \big )^{4} ] = \mathbf {E} [ \big ( 2H_{1} (t) + \mathbf {E} [ H_{1} (t) ] \big )^{4} ] . \end{aligned} \end{aligned}$$

On the other hand, for I, since $ a_{1}(0), a_{2}(0), \ldots , a_{m}(0) $ are independent and identically distributed, and each of them is distributed in $\mathrm {N} ( 0, 1 )$, we have $ I = 3 (k-j)^{2} $. Hence

$$\begin{aligned} \begin{aligned}&\mathbf {E} [ \big ( A_{s_{k}}^{(m)} (t) - A_{s_{j}}^{(m)}(t) \big )^{4} ] \\&\le 2^{4} \cdot 3 ( s_{k} - s_{j} )^{2} + 2^{4} \Vert \sigma ^{\prime } \Vert _{\infty }^{4} \vert \mathcal {X} \vert ^{4} ( s_{k} - s_{j} )^{4} \mathbf {E} [ \big ( 2 H_{1} (t) + \mathbf {E} [ H_{1} (t) ] \big )^{4} ]. \end{aligned} \end{aligned}$$

Finally, by noting Proposition 3, we see that (8) holds for $\gamma = 4$ and $\alpha = 1$.

1.3 A.3 Proof of Theorem 2

By the law of large numbers, we see that $ \frac{1}{m} \sum _{j=1}^m \big ( b_j (0) \big )^{2} \rightarrow \mathbf {E} [ \big ( b_j (0) \big )^{2} ] = 1 $ as $m \rightarrow \infty $. Then it suffices to show that

$$\begin{aligned} \begin{aligned}&\mathbf {E} \big [ \Big \vert \frac{1}{m} \sum _{j=1}^m \big ( b_j (t)- \mathbf {E} [ b_j (t) ] \big )^2 - \frac{1}{m} \sum _{j=1}^m \big ( b_j (0) \big )^{2} \Big \vert \big ] \rightarrow 0. \end{aligned} \end{aligned}$$

Since $ b_j (t) - \mathbf {E} [ b_j (t) ] = b_j (0) + \int _0^t \big ( \dot{b}_j (u) - \mathbf {E} [ \dot{b}_j (u) ] \big ) \mathrm {d}u $, we have $ ( b_j (t) - \mathbf {E} [ b_j (t) ] )^{2} - ( b_j (0) )^{2} = \big ( \int _0^t \big ( \dot{b}_j (u) - \mathbf {E} [ \dot{b}_j (u) ] \big ) \mathrm {d}u \big )^2 + 2 b_j (0) \int _0^t ( \dot{b}_j (u) - \mathbf {E} [ \dot{b}_j (u) ] ) \mathrm {d}u $. Thus we have

$$\begin{aligned} \begin{aligned}&\Big \vert \frac{1}{m} \sum _{j=1}^m \big ( b_j (t) - \mathbf {E} [ b_j (t) ] \big )^2 - \frac{1}{m} \sum _{j=1}^m \big ( b_j (0) \big )^{2} \Big \vert \\&\le \frac{1}{m} \sum _{j=1}^m \Big \vert \left( \int _0^t \big ( \dot{b}_j (u) - \mathbf {E} [ \dot{b}_j (u) ] \big ) \mathrm {d}u \right) ^2 + 2 b_j (0) \int _0^t \big ( \dot{b}_j (u) - \mathbf {E} [ \dot{b}_j (u) ] \big ) \mathrm {d}u \Big \vert . \end{aligned} \end{aligned}$$

By taking the expectation, we get

$$\begin{aligned} \begin{aligned}&\mathbf {E} \big [ \Big \vert \frac{1}{m} \sum _{j=1}^m \big ( b_j (t)- \mathbf {E} [ b_j (t) ] \big )^2 - \frac{1}{m} \sum _{j=1}^m \big ( b_j (0) \big )^{2} \Big \vert \big ] \\&\le \frac{1}{m} \sum _{j=1}^m \left\{ \mathbf {E} \big [ \left( \int _0^t \big ( \vert \dot{b}_j (u) \vert + \mathbf {E} \big [ \vert \dot{b}_j (u) \vert \big ] \big ) \mathrm {d}u \right) ^2 \big ] + 2 \mathbf {E} \big [ \vert b_j (0) \vert \int _0^t \big ( \vert \dot{b}_j (u) \vert + \mathbf {E} \big [ \vert \dot{b}_j (u) \vert \big ] \big ) \mathrm {d}u \big ] \right\} . \end{aligned} \end{aligned}$$

For the term $\int _{0}^{t} \vert \dot{b}_{j} (u) \vert \mathrm {d}u$ appeared above, we know by Proposition 2 that

$$\begin{aligned} \begin{aligned} \int _{0}^{t} \vert \dot{b}_{j} (u) \vert \mathrm {d}u&\le \sqrt{ \frac{ L ( \theta (0) ) }{ m } } \left\{ \Vert \sigma ^{\prime } \Vert _{\infty } ( \vert \mathcal {X} \vert + 1 ) G_{j}(t) + \sigma (0) t \right\} =: \frac{ M_{j} (t) }{ \sqrt{m} }, \end{aligned} \end{aligned}$$

where note that $M_j (t)$ depends on the width m. Thus, $ \int _{0}^{t} \big ( \vert \dot{b}_{j} (u) \vert + \mathbf {E} \big [ \vert \dot{b}_{j} (u) \vert \big ] \big ) \mathrm {d}u \le \frac{ M_{j} (t) + \mathbf {E} [ M_{j} (t) ] }{ \sqrt{m} } $. By Proposition 3, we have $\displaystyle \limsup _{m \rightarrow \infty } \mathbf {E} \big [ \big ( M_{1}(t) + \mathbf {E} [ M_{1}(t) ] \big )^2 \big ] < \infty $ and $\displaystyle \limsup _{m \rightarrow \infty } \mathbf {E} \big [ \vert b_{1} (0) \vert \big ( M_{1}(t) + \mathbf {E} [ M_{1}(t) ] \big ) \big ] < \infty $. Hence as $m \rightarrow \infty $,

$$\begin{aligned} \begin{aligned}&\mathbf {E} \big [ \Big \vert \frac{1}{m} \sum _{j=1}^m \big ( b_j (t)- \mathbf {E} [ b_j (t) ] \big )^2 - \frac{1}{m} \sum _{j=1}^m \big ( b_j (0) \big )^{2} \Big \vert \big ]\\&\le \frac{1}{m} \sum _{j=1}^m \left\{ \frac{ \mathbf {E} \big [ \big ( M_{j} (t) + \mathbf {E} [ M_{j} (t) ] \big )^2 \big ] }{ m } + 2 \frac{ \mathbf {E} \big [ \vert b_j (0) \vert \big ( M_{j} (t) + \mathbf {E} [ M_{j} (t) ] \big ) \big ] }{ \sqrt{m} } \right\} \\&= \frac{ \mathbf {E} \big [ \big ( M_{1}(t) + \mathbf {E} [ M_{1}(t) ] \big )^2 \big ] }{ m } + 2 \frac{ \mathbf {E} \big [ \vert b_{1} (0) \vert \big ( M_{1}(t) + \mathbf {E} [ M_{1}(t) ] \big ) \big ] }{ \sqrt{m} } \rightarrow 0. \end{aligned} \end{aligned}$$

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Eguchi, S., Amaba, T. (2021). Energy Conservation in Infinitely Wide Neural-Networks. In: Farkaš, I., Masulli, P., Otte, S., Wermter, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2021. ICANN 2021. Lecture Notes in Computer Science(), vol 12894. Springer, Cham. https://doi.org/10.1007/978-3-030-86380-7_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-86380-7_15
Published: 07 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86379-1
Online ISBN: 978-3-030-86380-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Energy Conservation in Infinitely Wide Neural-Networks

Abstract

Access this chapter

Similar content being viewed by others

On the Systems of Conservation Laws and on a New Way To Construct for them Neural Networks Algorithms

Time Varying Stimulations in Simple Neural Networks and Convergence to Desired Outputs

Continuous neural network with windowed Hebbian learning

References

Acknowledgments