Basics of Neural Networks

Tanaka, Akinori; Tomiya, Akio; Hashimoto, Koji

doi:10.1007/978-981-33-6108-9_3

Akinori Tanaka¹¹,
Akio Tomiya¹² &
Koji Hashimoto¹³

Part of the book series: Mathematical Physics Studies ((MPST))

2656 Accesses
1 Citations

Abstract

In this chapter, we derive neural networks from the viewpoint of physical models. A neural network is a nonlinear function that maps an input to an output, and giving the network is equivalent to giving a function called an error function in the case of supervised learning. By considering the output as dynamical degrees of freedom and the input as an external field, various neural networks and their deepened versions are born from simple Hamiltonians. Training (learning) is a procedure for reducing the value of the error function, and we will learn the specific method of backpropagation using the bra-ket notation popular in quantum mechanics. And we will look at how the “universal approximation theorem” works, which is why neural networks can express connections between various types of data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Hardcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
For conventional explanations, we recommend Ref. [20]. Reading them together with this book may complement understanding.
2.
If the readers are new to analytical mechanics, there is no problem in translating mechanical degrees of freedom into coordinates (or momentum) and Hamiltonians into energy.
3.
In the Boltzmann distribution, the factor of H∕(k _B T), which is the Hamiltonian divided by the temperature, is in the exponent. We redefine J to Jk _B T to absorb the temperature part, so that the temperature does not appear in the exponent.
4.
The reason that the sigmoid function is similar to the Fermi distribution function can be understood from the fact that d takes on a binary value in the current system. d = 0 corresponds to the state where the fermion site is not occupied (vacancy state), and d = 1 corresponds to the state where fermion excitation exists.
5.
By the way, $-\sum P(\mathbf {x}, d) \log P(d|\mathbf {x}) $ which is equivalent to the “J-independent part” in (3.13) is called conditional entropy. Due to the positive definite nature of the relative entropy, the first term of (3.13) should be greater than this conditional entropy. If (3.14), which approximates the first term, is smaller than this bound, it is clearly a sign of over-training. Since it is difficult to actually evaluate that term, we ignore the term in the following.
6.
It is essentially the same as (3.16). To make them exactly the same, set d = (d, 1 − d) in (3.16).
7.
In this way, increasing the number of degrees of freedom while sharing the parameters corresponds to the handling of the ensemble of h _bit and improves the accuracy in a statistical sense. This is thought to lead to the improvement of performance [30]. In our case, J is shifted by 0.5, which simplifies 〈h〉 as shown in the main text, thus the computational cost is lower than that of the ordinary ensemble.
8.
Here, every |h _l〉 is expanded by the basis |m〉, meaning that the dimensions of all vectors are the same. However, if the range in the sum symbol of (3.59) is not abbreviated, this notation can be applied in different dimensions.
9.
Visit his website for an intuitive understanding of the proof.
10.
The situation is the same as the physics calculations at zero temperature at which the Fermi distribution function becomes a step function.
11.
The part corresponds to the dense nature in Cybenko’s proof.

References

Bishop, C.M.: Pattern Recognition and Machine Learning. Springer (2006)
Google Scholar
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814 (2010)
Google Scholar
Teh, Y.W., Hinton, G.E.: Rate-coded restricted boltzmann machines for face recognition. In: Advances in Neural Information Processing Systems, pp. 908–914 (2001)
Google Scholar
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science (1985)
Google Scholar
Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2(4), 303–314 (1989)
Article MathSciNet Google Scholar
Nielsen, M.: A visual proof that neural nets can compute any function. http://neuralnetworksanddeeplearning.com/chap4.html.
Lee, H., Ge, R., Ma, T., Risteski, A., Arora, S.: On the ability of neural nets to express distributions. arXiv preprint arXiv:1702.07028 (2017)
Google Scholar
Sonoda, S., Murata, N.: Neural network with unbounded activation functions is universal approximator. Appl. Comput. Harmon Anal. 43(2), 233–268 (2017)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

iTHEMS, RIKEN, Wako, Saitama, Japan
Akinori Tanaka
Radiation Lab, RIKEN, Wako, Saitama, Japan
Akio Tomiya
Department of Physics, Osaka University, Toyonaka, Osaka, Japan
Koji Hashimoto

Authors

Akinori Tanaka
View author publications
You can also search for this author in PubMed Google Scholar
Akio Tomiya
View author publications
You can also search for this author in PubMed Google Scholar
Koji Hashimoto
View author publications
You can also search for this author in PubMed Google Scholar

Appendices

Column: Statistical Mechanics and Quantum Mechanics

Canonical Distribution in Statistical Mechanics

The statistical mechanics used in this book is called canonical distribution for a given temperature. The physical situation is as follows. Suppose we have some physical degrees of freedom, such as a spin or a particle position. Denote it as d. Since this is a physical degree of freedom, it should have energy which depends on its value. Let it be H(d). When this system is immersed in an environment with a temperature of T (called a heat-bath), the probability of achieving d with the energy H(d) is known to be

$$\displaystyle \begin{aligned} P(d) = \frac{e^{- \frac{H(d)}{k_B T}}}{Z} \, . \end{aligned} $$

(3.92)

Here, k _B is an important constant called the Boltzmann constant , and Z is called the partition function , defined as

$$\displaystyle \begin{aligned} Z = \sum_{d} e^{- \frac{H(d)}{k_B T}} \, . \end{aligned} $$

(3.93)

In the main text, we put k _B T = 1.

Simple example: law of equipartition of energy

As an example, consider d as the position x and momentum p of a particle in a box of length L, with the standard kinetic energy function as its energy. The expectation value of energy is

$$\displaystyle \begin{aligned} \langle E \rangle = \int_0^L dx \int_{-\infty}^{+\infty} dp \ \frac{p^2}{2m} \frac{e^{ - \frac{p^2}{2 m k_B T } } }{Z} \, . \end{aligned} $$

(3.94)

Here

$$\displaystyle \begin{aligned} Z = \int_0^L dx \int_{-\infty}^{+\infty} dp \ e^{ - \frac{p^2}{2 m k T } } = L (2 \pi m k_B T)^{1/2} \, . \end{aligned} $$

(3.95)

The calculation of 〈E〉 looks a bit difficult, but using $ \beta = \frac {1}{k_B T} $ we find

$$\displaystyle \begin{aligned} &Z = L \Big( \frac{2\pi m}{\beta} \Big)^{1/2}, \end{aligned} $$

(3.96)

$$\displaystyle \begin{aligned} &\langle E \rangle = - \frac{\partial}{\partial \beta} \log Z = \frac{1}{2} \frac{1}{\beta} = \frac{1}{2} k_BT \, . \end{aligned} $$

(3.97)

In other words, when the system temperature T is high (i.e. when the system is hot), the expectation value of energy is high, and when the temperature is low, the expectation value of energy is low. This is consistent with our intuition. In addition, if we consider the case of three spatial dimensions, we can obtain the famous formula $ \frac {3}{2}k_BT $ as the expectation value of energy.

Bracket Notation in Quantum Mechanics

In the derivation of the backpropagation method, bracket notation has been introduced as a simple method. This is nothing more than just writing a vector as a ket ,

$$\displaystyle \begin{aligned} \mathbf{v} = | v \rangle \, . \end{aligned} $$

(3.98)

There are some notational conveniences. For example, the inner bracket is written conventionally as

$$\displaystyle \begin{aligned} \mathbf{w} \cdot \mathbf{v} = \langle w | v \rangle, \end{aligned} $$

(3.99)

which means that the inner product is regarded as the “product of the matrices,”

$$\displaystyle \begin{aligned} \mathbf{w} \cdot \mathbf{v} = \begin{pmatrix} w_1, & w_2, &\cdots \end{pmatrix} \begin{pmatrix} v_1 \\ v_2 \\ \vdots \end{pmatrix} \, . \end{aligned} $$

(3.100)

Also in the main text we have had

$$\displaystyle \begin{aligned} |v \rangle \langle w | \end{aligned} $$

(3.101)

and also this can be regarded as a “matrix,” made by a matrix product,

$$\displaystyle \begin{aligned} |v \rangle \langle w | &= \begin{pmatrix} v_1 \\ v_2 \\ \vdots \end{pmatrix} \begin{pmatrix} w_1, & w_2, &\cdots \end{pmatrix} = \begin{pmatrix} v_1 w_1& v_1 w_2&\cdots \\ v_2 w_1& v_2 w_2&\cdots \\ \dots \end{pmatrix} \, . \end{aligned} $$

(3.102)

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Tanaka, A., Tomiya, A., Hashimoto, K. (2021). Basics of Neural Networks. In: Deep Learning and Physics. Mathematical Physics Studies. Springer, Singapore. https://doi.org/10.1007/978-981-33-6108-9_3

Download citation

DOI: https://doi.org/10.1007/978-981-33-6108-9_3
Published: 21 February 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-33-6107-2
Online ISBN: 978-981-33-6108-9
eBook Packages: Physics and AstronomyPhysics and Astronomy (R0)

Publish with us

Policies and ethics