Skip to main content

Energy Conservation in Infinitely Wide Neural-Networks

  • Conference paper
  • First Online:
Artificial Neural Networks and Machine Learning – ICANN 2021 (ICANN 2021)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12894))

Included in the following conference series:

  • 2098 Accesses

Abstract

A three-layered neural-network (NN), which consists of an input layer, a wide hidden layer and an output layer, has three types of parameters. Two of them are pre-neuronal, namely, thresholds and weights to be applied to input data. The rest is post-neuronal weights to be applied after activation. The current paper consists of the following two parts. First, we consider three types of stochastic processes. They are constructed by summing up each of parameters over all neurons at each epoch, respectively. The neuron number will be regarded as another time different to epochs. In the wide neural-network with a neural-tangent-kernel- (NTK-) parametrization, it is well known that these parameters are hardly varied from their initial values during learning. We show that, however, the stochastic process associated with the post-neuronal parameters is actually varied during the learning while the stochastic processes associated with the pre-neuronal parameters are not. By our result, we can distinguish the type of parameters by focusing on those stochastic processes. Second, we show that the variance (sort of “energy”) of the parameters in the infinitely wide neural-network is conserved during the learning, and thus it gives a conserved quantity in learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Chizat, L., Oyallon, E., Bach, F.: On lazy training in differentiable programming. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32, (NeurIPS 2019). Curran Associates, Inc (2018)

    Google Scholar 

  2. Goldberg, P., Williams, C., Bishop, C.: Regression with input-dependent noise: a Gaussian process treatment. In: Advances in Neural Information Processing Systems, vol. 10, NIPS 1997. MIT Press (1998)

    Google Scholar 

  3. Ikeda, N., Watanabe, S.: Stochastic Differential Equations and Diffusion Processes, Second edn. North-Holland Mathematical Library, 24. North-Holland Publishing Co., Amsterdam; Kodansha Ltd, Tokyo, p. xvi+555 (1989). ISBN: 0-444-87378-3

    Google Scholar 

  4. Jacot, A., Gabriel. F., Hongler. C.: Neural tangent kernel: convergence and generalization in neural networks. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp. 8571–8580. Curran Associates, Inc (2018)

    Google Scholar 

  5. Lee, J., Bahri, Y., Novak, R., Schoenholz, S., Pennington, J., Sohl-Dickstein, J.: Deep neural networks as Gaussian processes. In: International Conference on Learning Representations, (ICLR 2018) (2018 )

    Google Scholar 

  6. Lee, J., et al.: Wide neural networks of any depth evolve as linear models under gradient descent. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32, (NeurIPS 2019), Curran Associates, Inc (2019)

    Google Scholar 

  7. Neal, R.M.: Priors for infinite networks. In: Bayesian Learning for Neural Networks, pp. 29–53. Springer, New York (1996). https://doi.org/10.1007/978-1-4612-0745-0_2

    Chapter  MATH  Google Scholar 

  8. Sonoda, S., Murata, N.: Neural network with unbounded activation functions is universal approximator. Appl. Comput. Harmonic Anal. 43(2), 233–268 (2017)

    Article  MathSciNet  Google Scholar 

  9. Suzuki, T.: Generalization bound of globally optimal non-convex neural network training: transportation map estimation by infinite dimensional Langevin dynamics. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.), Advances in Neural Information Processing Systems, vol. 33, (NeurIPS 2020), pp. 19224–19237. Curran Associates, Inc (2020)

    Google Scholar 

  10. Suzuki, T., Akiyama, S.: Benefit of deep learning with non-convex noisy gradient descent: provable excess risk bound and superiority to kernel methods. To appear in International Conference on Learning Representations, 2021 (ICLR 2021) (2021)

    Google Scholar 

  11. Murata, N.: An integral representation of functions using three-layered networks and their approximation bounds. Neural Netw. 9(6), 947–956 (1996)

    Article  Google Scholar 

  12. Williams, C.: Computing with infinite networks. In: Mozer, M.C., Jordan, M., Petsche, T. (eds.) Advances in Neural Information Processing Systems, vol. 9, (NIPS 1996), MIT Press (1997)

    Google Scholar 

Download references

Acknowledgments

The authors would like to express their appreciation to Professor Masaru Tanaka and Professor Jun Fujiki who provided valuable comments and advices.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shu Eguchi .

Editor information

Editors and Affiliations

A Proof of Theorem 1 and Theorem 2

A Proof of Theorem 1 and Theorem 2

Recall that the activation function \(\sigma \) has been assumed to be non-negative and Lipschitz continuous. Then \(\sigma \) is differentiable almost everywhere and the Lipschitz constant can be expressed as \( \Vert \sigma ^{\prime } \Vert _{\infty } := \mathrm {ess}\sup \vert \sigma ^{\prime } \vert \), where \(\sigma ^{\prime }\) is the almost-everywhere-defined derivative of \(\sigma \). We shall put \( \vert \mathcal {X} \vert := \max _{i = 1,2,\ldots , n} \vert x_{i} \vert \), where \(\{ x_i \}_{i=1}^{m}\) is the input data. Note that the loss function \( L ( \theta ) = \frac{1}{n} \sum _{i=1}^{n} \big ( \hat{y}_{i} ( \theta ) - y_{i} \big )^{2} \) depends on the width m as does so for the outputs \( \hat{y}_{i} ( \theta ) = \frac{1}{\sqrt{m}} \sum _{j=1}^{m} \sigma ( a_{j} x_{i} + a_{0,j} ) b_{j} \).

1.1 A.1 Equipments About Gradient Flow \(\frac{\mathrm {d}}{\mathrm {d} t} \theta (t) = - \frac{1}{2} ( \nabla _{\theta } L ) ( \theta (t) )\)

Lemma 1

Along the gradient flow, we have \( L( \theta (t) ) \le L( \theta (0) ) \) for \(t\ge 0\).

In the coordinate \( \theta (t) = ( \boldsymbol{a}_{0}(t), \boldsymbol{a}(t), \boldsymbol{b}(t) ) = ( \{ a_{0,j}(t) \}_{j=1}^{m}, \{ a_{j}(t) \}_{j=1}^{m},\)\( \{ b_{j}(t) \}_{j=1}^{m} ) \), the gradient flow \( \frac{\mathrm {d}}{\mathrm {d} t} \theta (t) = - \frac{1}{2} ( \nabla _{\theta } L) ( \theta (t) ) \) can be written as follows: for \(j = 1,2,\ldots , m\) and \(t \in \mathbb {R}\),

$$\begin{aligned} \begin{aligned} \frac{\mathrm {d}}{\mathrm {d}t} a_{0,j} (t)&= - \frac{1}{n} \sum _{i=1}^{n} \big ( \hat{y}_{i}( \theta (t) ) - y_{i} \big ) \sigma ^{\prime } \big ( a_{j}(t) x_{i} + a_{0,j}(t) \big ) \frac{ b_{j}(t) }{ \sqrt{m} }, \\ \frac{\mathrm {d}}{\mathrm {d}t} a_{j} (t)&= - \frac{1}{n} \sum _{i=1}^{n} \big ( \hat{y}_{i}( \theta (t) ) - y_{i} \big ) \sigma ^{\prime } \big ( a_{j}(t) x_{i} + a_{0,j}(t) \big ) x_{i} \frac{ b_{j}(t) }{ \sqrt{m} }, \\ \frac{\mathrm {d}}{\mathrm {d}t} b_{j} (t)&= - \frac{1}{n} \sum _{i=1}^{n} \big ( \hat{y}_{i}( \theta (t) ) - y_{i} \big ) \sigma \big ( a_{j}(t) x_{i} + a_{0,j}(t) \big ) \frac{ 1 }{ \sqrt{m} }. \end{aligned} \end{aligned}$$
(1)

Proposition 1

For \(m = 1,2,3,\ldots \), \(j = 1,2,\ldots ,m\) and \(t \ge 0\), we have

where \( F_{j} (t) := \vert a_{0,j} (t) \vert + \vert a_{j} (t) \vert + \vert b_{j} (t) \vert \).

Proof

We begin with estimating \(a_{j}(t)\). Let \(\dot{a}_j(s) := \frac{\mathrm {d}}{\mathrm {d}s} a_j (s)\). By fundamental theorem of calculus, the triangle inequality and (1), we have

$$\begin{aligned} \begin{aligned} \vert a_j (t) \vert&\le \vert a_j (0) \vert + \int _0^t \Big \vert \frac{1}{n} \sum _{i=1}^n \big ( \hat{y}_{i} (\theta (s)) - y_i \big ) \sigma ^{\prime } \big ( a_j (s) x_i + a_{0,j}(s) \big ) x_i \frac{ b_j (s) }{ \sqrt{m} } \Big \vert \mathrm {d}s \\&\le \vert a_j (0) \vert + \int _0^t \frac{ \Vert \sigma ^{\prime } \Vert _{\infty } \vert \mathcal {X} \vert }{ \sqrt{m} } \left( \frac{1}{n} \sum _{i=1}^n \vert \hat{y}_{i} ( \theta (s) ) - y_i \vert \right) \vert b_j(s) \vert \mathrm {d}s. \end{aligned} \end{aligned}$$

Since it holds that

$$\begin{aligned} \begin{aligned} \frac{1}{n} \sum _{i=1}^{n} \vert \hat{y}_{i}( \theta (s) ) - y_{i} \vert \le \sqrt{ L ( \theta (s) ) } \le \sqrt{ L ( \theta (0) ) } \end{aligned} \end{aligned}$$
(2)

by virtue of Jensen’s inequality and Lemma 1, we obtain

$$\begin{aligned} \begin{aligned} \vert a_j (t) \vert&\le \vert a_j (0) \vert + \frac{ \Vert \sigma ^{\prime } \Vert _{\infty } \vert \mathcal {X} \vert }{ \sqrt{m} } \sqrt{ L ( \theta (0) ) } \int _0^t \vert b_j(s) \vert \mathrm {d}s. \end{aligned} \end{aligned}$$
(3)

Similarly, we have

$$\begin{aligned} \begin{aligned} \vert a_{0,j} (t) \vert&\le \vert a_{0,j} (0) \vert + \frac{ \Vert \sigma ^{\prime } \Vert _{\infty } }{ \sqrt{m} } \sqrt{ L ( \theta (0) ) } \int _0^t \vert b_j(s) \vert \mathrm {d}s. \end{aligned} \end{aligned}$$
(4)

For \(b_{j}(t)\), by estimating in a manner similar to \(\vert a_{j}(t) \vert \), we get

$$\begin{aligned} \begin{aligned} \vert b_j (t) \vert&\le \vert b_j (0) \vert + \int _0^t \frac{1}{n} \sum _{i=1}^n \vert \hat{y}_{i} ( \theta (s) ) - y_i \vert \cdot \sigma \big ( a_j (s) x_i + a_{0,j} (s) \big ) \frac{1}{\sqrt{m}} \mathrm {d}s . \end{aligned} \end{aligned}$$

By using a estimate: \( \sigma \big ( a_j x_i + a_{0,j} \big ) \le \sigma (0) + \Vert \sigma ^{\prime } \Vert _{\infty } ( \vert \mathcal {X} \vert \vert a_j \vert + \vert a_{0,j} \vert ) \) and (2),

$$\begin{aligned} \begin{aligned} \vert b_j (t) \vert&\le \vert b_j (0) \vert + \frac{ \sigma (0) t }{\sqrt{m}} \sqrt{ L ( \theta (0) ) } + \frac{ \Vert \sigma ^{\prime } \Vert _{\infty }t ( \vert \mathcal {X} \vert + 1 ) }{\sqrt{m}} \sqrt{ L ( \theta (0) ) } \int _0^t \big ( \vert a_j (s) \vert + \vert a_{0,j} (s) \vert \big ) \mathrm {d}s. \end{aligned} \end{aligned}$$
(5)

By putting estimates (3), (4) and (5) together, we have

$$\begin{aligned} \begin{aligned} F_{j} (t)&\le F_{j} (0) + \frac{ \sigma (0) t }{\sqrt{m}} \sqrt{ L ( \theta (0) ) } + \frac{ \Vert \sigma ^{\prime } \Vert _{\infty } ( \vert \mathcal {X} \vert + 1 ) }{\sqrt{m}} \sqrt{ L ( \theta (0) ) } \int _0^t F_{j} (s) \mathrm {d}s. \end{aligned} \end{aligned}$$

Now, by applying Grönwall’s inequality, we reach the conclusion.

Proposition 2

For every \(j = 1,2,\ldots , m\), we have

  1. (i)

    \(\displaystyle \int _{0}^{t} F_{j} (u) \mathrm {d}u \le G_{j}(t) \),

  2. (ii)

    \(\displaystyle \int _{0}^{t} \max \big \{ \vert \dot{a}_{0,j} (u) \vert , \vert \dot{a}_{j} (u) \vert , \vert \dot{b}_{j} (u) \vert \big \} \mathrm {d}u \le \sqrt{ \frac{ L ( \theta (0) ) }{ m } } \left\{ \Vert \sigma ^{\prime } \Vert _{\infty } ( \vert \mathcal {X} \vert + 1 ) G_{j}(t) + \sigma (0) t \right\} \),

where \( F_{j} (u) := \vert a_{0,j} (u) \vert + \vert a_{j} (u) \vert + \vert b_{j} (u) \vert \) and

(6)

Note that each \( G_{j} (t) \) depends on the width m of the network.

Proof

(i) Put \( c_{1} = \frac{ \Vert \sigma ^{\prime } \Vert _{\infty } ( \vert \mathcal {X} \vert + 1 ) }{\sqrt{m}} \sqrt{ L ( \theta (0) ) } \) and \( c_{2} = \frac{ \sigma (0) }{\sqrt{m}} \sqrt{ L ( \theta (0) ) } \). Then by Proposition 1, we have

$$\begin{aligned} \begin{aligned} \int _{0}^{t} F_{l} (u) \mathrm {d}u&\le \int _{0}^{t} \left( F_{l} (0) + c_{2} u \right) \mathrm {e}^{ c_{1}u } \mathrm {d}u \le F_{l} (0) \frac{ \mathrm {e}^{c_{1}t} - 1 }{ c_{1} t } t + \frac{ c_{2} }{ c_{1} } t \cdot \mathrm {e}^{ c_{1} t }. \end{aligned} \end{aligned}$$

Since it holds that \( \frac{ \mathrm {e}^{x} - 1 }{ x } \le \mathrm {e}^{2x} \) for \(x > 0\), we obtain

$$\begin{aligned} \begin{aligned} \int _{0}^{t} F_{l} (u) \mathrm {d}u&\le F_{l} (0) \mathrm {e}^{2c_{1}t} \cdot t + \frac{ c_{2} }{ c_{1} } t \cdot \mathrm {e}^{ c_{1} t } \le \left( F_{l}(0) + \frac{ c_{2} }{ c_{1} } \right) t \cdot \mathrm {e}^{ 2c_{1} t } = G_{j} (t). \end{aligned} \end{aligned}$$

(ii) We show only for \( \int _{0}^{t} \vert \dot{b}_{j} (u) \vert \mathrm {d}u. \) The same is for the other parameters. By (1) and (2), we get \( \int _{0}^{t} \vert \dot{b}_{j} (u) \vert \mathrm {d}u \le \sqrt{ \frac{ L ( \theta (0) ) }{ m } } \int _{0}^{t} \sigma \big ( a_{j} (u) x_{i} + a_{0,j} (u) \big ) \mathrm {d}u \). Then by using that \( \sigma \big ( a_{j} (u) x_{i} + a_{0,j} (u) \big ) \le \Vert \sigma ^{\prime } \Vert _{\infty } ( \vert \mathcal {X} \vert + 1 ) F_{j} (u) + \sigma (0) \) and by (i), we have the conclusion.

Proposition 3

For all \(p > 0\), we have the following: \( \limsup _{m \rightarrow \infty } \mathbf {E} [ \big ( \sqrt{ L ( \theta (0) ) } \big )^{p} ] < \infty \), \( \limsup _{m \rightarrow \infty } \mathbf {E} [ G_{j} (t)^{p} ] < \infty \) and \( \limsup _{m \rightarrow \infty } \mathbf {E} [ \big ( \sqrt{ L ( \theta (0) ) } \, G_{j} (t) \big )^{p} ] < \infty \).

Proof

The last estimate follows from the first two estimates and Cauchy-Schwarz’ inequality. Since the first estimate is obvious, we show only the second. For this, it is sufficient to show that

(7)

In the following, we write \(a_{0,j} (0) = a_{0,j}\), \(a_{j} (0) = a_{j}\) and \(b_{j} (0) = b_{j}.\) First, we note that \( \sqrt{L ( \theta (0) )} \le \frac{1}{\sqrt{n}} \sum _{i=1}^{n} \vert \hat{y}_{i} ( \theta (0) ) - y_{i} \vert \le \frac{1}{\sqrt{n}} \sum _{i=1}^{n} \vert \hat{y}_{i} ( \theta (0) ) \vert + \frac{1}{\sqrt{n}} \vert y_{i} \vert \). Then by using Hölder’s inequality, we get

Since we have \( ( \hat{y}_{i} ( \theta (0) ) \mid \boldsymbol{a}_{0}, \boldsymbol{a} ) \sim \mathrm {N} \big ( 0, \frac{1}{m} \sum _{j=1}^{m} \sigma ( a_{j} x_{i} + a_{0,j} )^{2} \big ) \),

Furthermore, by Jensen’s inequality and independence,

We can show that \( \sigma ( a_{1} x_{i} + a_{0,1} )^{2} \le 16 \{ \Vert \sigma ^{\prime } \Vert _{\infty } ( \vert \mathcal {X} \vert + 1 ) \}^{2} \{ ( a_{0,1} )^{2} + ( a_{1} )^{2} \} + ( \sigma (0) )^{2} \). Hence

The right-hand-side is finite if \( \frac{ 8 p^{2} n \Vert \sigma ^{\prime } \Vert _{\infty }^{2} ( \vert \mathcal {X} \vert + 1 )^{2} }{ m } - \frac{1}{2} < 0 \), that is, \( m > 16 p^{2} n \Vert \sigma ^{\prime } \Vert _{\infty }^{2} ( \vert \mathcal {X} \vert + 1 )^{2} \), and then it is decreasing with respect to m. By putting all together, (7) is proved.

1.2 A.2 Proof of Theorem 1

It is enough to prove that both of \(\{ A^{(m)}(t) \}_{m=1}^{\infty }\) and \(\{ B^{(m)}(t) \}_{m=1}^{\infty }\) are tight. For this, from [3, Chapter I, Section 4, Theorem 4.3], it is sufficient to show that (i) \( \sup _{m} \mathbf {E} \big [ \vert A_{0}^{(m)}(t) \vert + \vert B_{0}^{(m)} (t) \vert \big ] < \infty \) and (ii) there exist \(\gamma , \alpha > 0\) such that

$$\begin{aligned} \begin{aligned} \sup _{m} \sup _{ \begin{array}{c} s, u \in [0,1]: \\ s \ne u \end{array} } \left( \frac{ \mathbf {E} \big [ \vert A_{s}^{(m)} (t) - A_{u}^{(m)}(t) \vert ^{\gamma } \big ] }{ \vert s - u \vert ^{ 1 + \alpha } } + \frac{ \mathbf {E} \big [ \vert B_{s}^{(m)} (t) - B_{u}^{(m)}(t) \vert ^{\gamma } \big ] }{ \vert s - u \vert ^{ 1 + \alpha } } \right) < \infty . \end{aligned} \end{aligned}$$

(i) is clear since \( A_{0}^{(m)} (t) = B_{0}^{(m)} (t) = 0 \). Thus we show only (ii). We will only show the one for \(A^{(m)}(t)\). Since \( A^{(m)} (t) \) is a piecewise linear interpolation of values on \(\{ s_{k} = \frac{k}{m} \}_{k=0}^{m}\), it suffices to show that for some \(\gamma , \alpha > 0\), it holds that

$$\begin{aligned} \sup _{m} \sup _{ \begin{array}{c} 1 \le k, j \le m: \\ k \ne j \end{array} } \frac{ \mathbf {E} \big [ \vert A_{s_{k}}^{(m)} (t) - A_{s_{j}}^{(m)}(t) \vert ^{\gamma } \big ] }{ \vert s_{k} - s_{j} \vert ^{ 1 + \alpha } } < \infty . \end{aligned}$$
(8)

Let \(k,j \in \{ 1,2,\ldots , m \}\) be arbitrary. Without loss of generality, we assume that \(j < k\). Then we have

$$\begin{aligned} \begin{aligned} \vert A_{s_{k}}^{(m)} (t) - A_{s_{j}}^{(m)}(t) \vert&\le \frac{1}{\sqrt{m}} \Big \vert \sum _{l=j+1}^{k} \big ( a_l (t) - \mathbf {E} [ a_l (t) ] \big ) \Big \vert + \frac{1}{\sqrt{m}} \Big \vert \sum _{l=j+1}^{k} \mathbf {E} [ a_l (t) ] \Big \vert . \end{aligned} \end{aligned}$$
(9)

We shall make estimates for two terms on the right-hand-side.

Lemma 2

With \(G_{l}(t)\) defined in (6), we have

$$\begin{aligned} \begin{aligned} \Big \vert \sum _{l=j+1}^k \big ( a_l (t) - \mathbf {E} [ a_l (t) ] \big ) \Big \vert \le \Big \vert \sum _{l=j+1}^k a_l (0) \Big \vert + \frac{ \Vert \sigma ^{\prime } \Vert _{\infty } \vert \mathcal {X} \vert }{ \sqrt{m} } \sum _{l=j+1}^{k} \big ( \sqrt{ L ( \theta (0) ) } \, G_{l}(t) + \mathbf {E} [ \sqrt{ L ( \theta (0) ) } \, G_{l}(t) ] \big ) . \end{aligned} \end{aligned}$$

Proof

Since \(\mathbf {E} [ a_{l}(0) ] = 0\), we have \( a_l (t) - \mathbf {E} [ a_l (t) ] = \int _0^t \dot{a}_l (u) \mathrm {d}u - \int _0^t \mathbf {E} [ \dot{a}_l (u) ] \mathrm {d}u + a_l (0) \). By summing up this over \(l=j+1 , j+2, \ldots , k\) and by using (1) and (2),

$$\begin{aligned} \begin{aligned} \Big \vert \sum _{l=j+1}^k \big ( a_l (t) - \mathbf {E} [ a_l (t) ] \big ) \Big \vert \le \Big \vert \sum _{l=j+1}^k a_l (0) \Big \vert&+ \frac{ \Vert \sigma ^{\prime } \Vert _{\infty } \vert \mathcal {X} \vert }{ \sqrt{m} } \sqrt{ L ( \theta (0) ) } \sum _{l=j+1}^k \int _0^t \vert b_l (u) \vert \mathrm {d}u \\&\!\!\!+ \frac{ \Vert \sigma ^{\prime } \Vert _{\infty } \vert \mathcal {X} \vert }{ \sqrt{m} } \mathbf {E} \big [ \sqrt{ L ( \theta (0) ) } \sum _{l=j+1}^k \int _0^t \vert b_l (u) \vert \mathrm {d} u \big ] . \end{aligned} \end{aligned}$$

Finally, by applying Proposition 2, we get the conclusion.

Lemma 3

We have \(\displaystyle \Big \vert \sum _{l=j+1}^k \mathbf {E} [ a_l (t) ] \Big \vert \le \frac{ \Vert \sigma ^{\prime } \Vert _{\infty } \vert \mathcal {X} \vert }{ \sqrt{m} } \mathbf {E} \big [ \sqrt{ L ( \theta (0) ) } \, \sum _{l=j+1}^k G_{l} (t) \big ] . \)

Proof

By (1), \( \mathbf {E} [ a_l (t) ] = \int _0^t \mathbf {E} [ -\frac{1}{n} \sum _{i=1}^n \big ( \hat{y}_{i} ( \theta (u) ) - y_i \big ) \sigma ^{\prime } \big ( a_l (u) x_i + a_{0,l} (u) \big ) x_i \frac{ b_l (u) }{ \sqrt{m} } ] \mathrm {d}u \). By taking the sum over \(l = j+1, j+2, \ldots , k\), we have

$$\begin{aligned} \begin{aligned} \Big \vert \sum _{l=j+1}^k \mathbf {E} [ a_l (t) ] \Big \vert&\le \frac{ \Vert \sigma ^{\prime } \Vert _{\infty } \vert \mathcal {X} \vert }{ \sqrt{m} } \mathbf {E} \Big [ \sqrt{ L ( \theta (0) ) } \sum _{l=j+1}^k \int _0^t \vert b_l (u) \vert \mathrm {d}u \Big ] . \end{aligned} \end{aligned}$$

Then by using Proposition 2, we reach the conclusion.

Turning back to Eq. (9), we apply Lemma 2 and Lemma 3 to get

$$\begin{aligned} \begin{aligned} \vert A_{s_{k}}^{(m)} (t) - A_{s_{j}}^{(m)}(t) \vert&\le \frac{1}{\sqrt{m}} \Big \vert \sum _{l=j+1}^{k} a_{l} (0) \Big \vert + \frac{ \Vert \sigma ^{\prime } \Vert _{\infty } \vert \mathcal {X} \vert }{m} \sum _{l=j+1}^{k} \big ( 2 H_{l} (t) + \mathbf {E} [ H_{l} (t) ] \big ) , \end{aligned} \end{aligned}$$

where \( H_{l} (t) = \sqrt{ L ( \theta (0) ) } \, G_{l} (t) \). By an easy estimate: \( ( x+y )^{4} \le 2^{4} ( x^{4} + y^{4} ) \),

$$\begin{aligned} \begin{aligned}&\big ( A_{s_{k}}^{(m)} (t) - A_{s_{j}}^{(m)}(t) \big )^{4} \\&\le \frac{2^{4}}{m^{2}} \left( \sum _{l=j+1}^{k} a_{l} (0) \right) ^{4} + 2^{4} \Vert \sigma ^{\prime } \Vert _{\infty }^{4} \vert \mathcal {X} \vert ^{4} \left( \frac{ k-j }{ m } \right) ^{4} \left( \frac{1}{k-j} \sum _{l=j+1}^{k} \big ( 2H_{l} (t) + \mathbf {E} [ H_{l} (t) ] \big ) \right) ^{4} . \end{aligned} \end{aligned}$$

Therefore \( \mathbf {E} [ \big ( A_{s_{k}}^{(m)} (t) - A_{s_{j}}^{(m)}(t) \big )^{4} ] = \frac{2^{4}}{m^{2}} I + 2^{4} \Vert \sigma ^{\prime } \Vert _{\infty }^{4} \vert \mathcal {X} \vert ^{4} ( s_{k} - s_{j} )^{4} I\!I \). Here,

$$\begin{aligned} I := \mathbf {E} [ \Big ( \sum _{l=j+1}^{k} a_{l} (0) \Big )^{4} ], \quad I\!I := \mathbf {E} [ \Big ( \frac{1}{k-j} \sum _{l=j+1}^{k} \big ( 2H_{l} (t) + \mathbf {E} [ H_{l} (t) ] \big ) \Big )^{4} ] . \end{aligned}$$

First, we shall focus on \(I\!I\). By Jensen’s inequality,

$$\begin{aligned} \begin{aligned} I\!I \le \frac{1}{k-j} \sum _{l=j+1}^{k} \mathbf {E} [ \big ( 2H_{l} (t) + \mathbf {E} [ H_{l} (t) ] \big )^{4} ] = \mathbf {E} [ \big ( 2H_{1} (t) + \mathbf {E} [ H_{1} (t) ] \big )^{4} ] . \end{aligned} \end{aligned}$$

On the other hand, for I, since \( a_{1}(0), a_{2}(0), \ldots , a_{m}(0) \) are independent and identically distributed, and each of them is distributed in \(\mathrm {N} ( 0, 1 )\), we have \( I = 3 (k-j)^{2} \). Hence

$$\begin{aligned} \begin{aligned}&\mathbf {E} [ \big ( A_{s_{k}}^{(m)} (t) - A_{s_{j}}^{(m)}(t) \big )^{4} ] \\&\le 2^{4} \cdot 3 ( s_{k} - s_{j} )^{2} + 2^{4} \Vert \sigma ^{\prime } \Vert _{\infty }^{4} \vert \mathcal {X} \vert ^{4} ( s_{k} - s_{j} )^{4} \mathbf {E} [ \big ( 2 H_{1} (t) + \mathbf {E} [ H_{1} (t) ] \big )^{4} ]. \end{aligned} \end{aligned}$$

Finally, by noting Proposition 3, we see that (8) holds for \(\gamma = 4\) and \(\alpha = 1\).

1.3 A.3 Proof of Theorem 2

By the law of large numbers, we see that \( \frac{1}{m} \sum _{j=1}^m \big ( b_j (0) \big )^{2} \rightarrow \mathbf {E} [ \big ( b_j (0) \big )^{2} ] = 1 \) as \(m \rightarrow \infty \). Then it suffices to show that

$$\begin{aligned} \begin{aligned}&\mathbf {E} \big [ \Big \vert \frac{1}{m} \sum _{j=1}^m \big ( b_j (t)- \mathbf {E} [ b_j (t) ] \big )^2 - \frac{1}{m} \sum _{j=1}^m \big ( b_j (0) \big )^{2} \Big \vert \big ] \rightarrow 0. \end{aligned} \end{aligned}$$

Since \( b_j (t) - \mathbf {E} [ b_j (t) ] = b_j (0) + \int _0^t \big ( \dot{b}_j (u) - \mathbf {E} [ \dot{b}_j (u) ] \big ) \mathrm {d}u \), we have \( ( b_j (t) - \mathbf {E} [ b_j (t) ] )^{2} - ( b_j (0) )^{2} = \big ( \int _0^t \big ( \dot{b}_j (u) - \mathbf {E} [ \dot{b}_j (u) ] \big ) \mathrm {d}u \big )^2 + 2 b_j (0) \int _0^t ( \dot{b}_j (u) - \mathbf {E} [ \dot{b}_j (u) ] ) \mathrm {d}u \). Thus we have

$$\begin{aligned} \begin{aligned}&\Big \vert \frac{1}{m} \sum _{j=1}^m \big ( b_j (t) - \mathbf {E} [ b_j (t) ] \big )^2 - \frac{1}{m} \sum _{j=1}^m \big ( b_j (0) \big )^{2} \Big \vert \\&\le \frac{1}{m} \sum _{j=1}^m \Big \vert \left( \int _0^t \big ( \dot{b}_j (u) - \mathbf {E} [ \dot{b}_j (u) ] \big ) \mathrm {d}u \right) ^2 + 2 b_j (0) \int _0^t \big ( \dot{b}_j (u) - \mathbf {E} [ \dot{b}_j (u) ] \big ) \mathrm {d}u \Big \vert . \end{aligned} \end{aligned}$$

By taking the expectation, we get

$$\begin{aligned} \begin{aligned}&\mathbf {E} \big [ \Big \vert \frac{1}{m} \sum _{j=1}^m \big ( b_j (t)- \mathbf {E} [ b_j (t) ] \big )^2 - \frac{1}{m} \sum _{j=1}^m \big ( b_j (0) \big )^{2} \Big \vert \big ] \\&\le \frac{1}{m} \sum _{j=1}^m \left\{ \mathbf {E} \big [ \left( \int _0^t \big ( \vert \dot{b}_j (u) \vert + \mathbf {E} \big [ \vert \dot{b}_j (u) \vert \big ] \big ) \mathrm {d}u \right) ^2 \big ] + 2 \mathbf {E} \big [ \vert b_j (0) \vert \int _0^t \big ( \vert \dot{b}_j (u) \vert + \mathbf {E} \big [ \vert \dot{b}_j (u) \vert \big ] \big ) \mathrm {d}u \big ] \right\} . \end{aligned} \end{aligned}$$

For the term \(\int _{0}^{t} \vert \dot{b}_{j} (u) \vert \mathrm {d}u\) appeared above, we know by Proposition 2 that

$$\begin{aligned} \begin{aligned} \int _{0}^{t} \vert \dot{b}_{j} (u) \vert \mathrm {d}u&\le \sqrt{ \frac{ L ( \theta (0) ) }{ m } } \left\{ \Vert \sigma ^{\prime } \Vert _{\infty } ( \vert \mathcal {X} \vert + 1 ) G_{j}(t) + \sigma (0) t \right\} =: \frac{ M_{j} (t) }{ \sqrt{m} }, \end{aligned} \end{aligned}$$

where note that \(M_j (t)\) depends on the width m. Thus, \( \int _{0}^{t} \big ( \vert \dot{b}_{j} (u) \vert + \mathbf {E} \big [ \vert \dot{b}_{j} (u) \vert \big ] \big ) \mathrm {d}u \le \frac{ M_{j} (t) + \mathbf {E} [ M_{j} (t) ] }{ \sqrt{m} } \). By Proposition 3, we have \(\displaystyle \limsup _{m \rightarrow \infty } \mathbf {E} \big [ \big ( M_{1}(t) + \mathbf {E} [ M_{1}(t) ] \big )^2 \big ] < \infty \) and \(\displaystyle \limsup _{m \rightarrow \infty } \mathbf {E} \big [ \vert b_{1} (0) \vert \big ( M_{1}(t) + \mathbf {E} [ M_{1}(t) ] \big ) \big ] < \infty \). Hence as \(m \rightarrow \infty \),

$$\begin{aligned} \begin{aligned}&\mathbf {E} \big [ \Big \vert \frac{1}{m} \sum _{j=1}^m \big ( b_j (t)- \mathbf {E} [ b_j (t) ] \big )^2 - \frac{1}{m} \sum _{j=1}^m \big ( b_j (0) \big )^{2} \Big \vert \big ]\\&\le \frac{1}{m} \sum _{j=1}^m \left\{ \frac{ \mathbf {E} \big [ \big ( M_{j} (t) + \mathbf {E} [ M_{j} (t) ] \big )^2 \big ] }{ m } + 2 \frac{ \mathbf {E} \big [ \vert b_j (0) \vert \big ( M_{j} (t) + \mathbf {E} [ M_{j} (t) ] \big ) \big ] }{ \sqrt{m} } \right\} \\&= \frac{ \mathbf {E} \big [ \big ( M_{1}(t) + \mathbf {E} [ M_{1}(t) ] \big )^2 \big ] }{ m } + 2 \frac{ \mathbf {E} \big [ \vert b_{1} (0) \vert \big ( M_{1}(t) + \mathbf {E} [ M_{1}(t) ] \big ) \big ] }{ \sqrt{m} } \rightarrow 0. \end{aligned} \end{aligned}$$

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Eguchi, S., Amaba, T. (2021). Energy Conservation in Infinitely Wide Neural-Networks. In: Farkaš, I., Masulli, P., Otte, S., Wermter, S. (eds) Artificial Neural Networks and Machine Learning – ICANN 2021. ICANN 2021. Lecture Notes in Computer Science(), vol 12894. Springer, Cham. https://doi.org/10.1007/978-3-030-86380-7_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86380-7_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86379-1

  • Online ISBN: 978-3-030-86380-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics