Abstract
In this chapter, we derive neural networks from the viewpoint of physical models. A neural network is a nonlinear function that maps an input to an output, and giving the network is equivalent to giving a function called an error function in the case of supervised learning. By considering the output as dynamical degrees of freedom and the input as an external field, various neural networks and their deepened versions are born from simple Hamiltonians. Training (learning) is a procedure for reducing the value of the error function, and we will learn the specific method of backpropagation using the bra-ket notation popular in quantum mechanics. And we will look at how the “universal approximation theorem” works, which is why neural networks can express connections between various types of data.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
For conventional explanations, we recommend Ref. [20]. Reading them together with this book may complement understanding.
- 2.
If the readers are new to analytical mechanics, there is no problem in translating mechanical degrees of freedom into coordinates (or momentum) and Hamiltonians into energy.
- 3.
In the Boltzmann distribution, the factor of H∕(k B T), which is the Hamiltonian divided by the temperature, is in the exponent. We redefine J to Jk B T to absorb the temperature part, so that the temperature does not appear in the exponent.
- 4.
The reason that the sigmoid function is similar to the Fermi distribution function can be understood from the fact that d takes on a binary value in the current system. d = 0 corresponds to the state where the fermion site is not occupied (vacancy state), and d = 1 corresponds to the state where fermion excitation exists.
- 5.
By the way, \(-\sum P(\mathbf {x}, d) \log P(d|\mathbf {x}) \) which is equivalent to the “J-independent part” in (3.13) is called conditional entropy. Due to the positive definite nature of the relative entropy, the first term of (3.13) should be greater than this conditional entropy. If (3.14), which approximates the first term, is smaller than this bound, it is clearly a sign of over-training. Since it is difficult to actually evaluate that term, we ignore the term in the following.
- 6.
- 7.
In this way, increasing the number of degrees of freedom while sharing the parameters corresponds to the handling of the ensemble of h bit and improves the accuracy in a statistical sense. This is thought to lead to the improvement of performance [30]. In our case, J is shifted by 0.5, which simplifies 〈h〉 as shown in the main text, thus the computational cost is lower than that of the ordinary ensemble.
- 8.
Here, every |h l〉 is expanded by the basis |m〉, meaning that the dimensions of all vectors are the same. However, if the range in the sum symbol of (3.59) is not abbreviated, this notation can be applied in different dimensions.
- 9.
Visit his website for an intuitive understanding of the proof.
- 10.
The situation is the same as the physics calculations at zero temperature at which the Fermi distribution function becomes a step function.
- 11.
The part corresponds to the dense nature in Cybenko’s proof.
References
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer (2006)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814 (2010)
Teh, Y.W., Hinton, G.E.: Rate-coded restricted boltzmann machines for face recognition. In: Advances in Neural Information Processing Systems, pp. 908–914 (2001)
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science (1985)
Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2(4), 303–314 (1989)
Nielsen, M.: A visual proof that neural nets can compute any function. http://neuralnetworksanddeeplearning.com/chap4.html.
Lee, H., Ge, R., Ma, T., Risteski, A., Arora, S.: On the ability of neural nets to express distributions. arXiv preprint arXiv:1702.07028 (2017)
Sonoda, S., Murata, N.: Neural network with unbounded activation functions is universal approximator. Appl. Comput. Harmon Anal. 43(2), 233–268 (2017)
Author information
Authors and Affiliations
Appendices
Column: Statistical Mechanics and Quantum Mechanics
Canonical Distribution in Statistical Mechanics
The statistical mechanics used in this book is called canonical distribution for a given temperature. The physical situation is as follows. Suppose we have some physical degrees of freedom, such as a spin or a particle position. Denote it as d. Since this is a physical degree of freedom, it should have energy which depends on its value. Let it be H(d). When this system is immersed in an environment with a temperature of T (called a heat-bath), the probability of achieving d with the energy H(d) is known to be
Here, k B is an important constant called the Boltzmann constant , and Z is called the partition function , defined as
In the main text, we put k B T = 1.
Simple example: law of equipartition of energy
As an example, consider d as the position x and momentum p of a particle in a box of length L, with the standard kinetic energy function as its energy. The expectation value of energy is
Here
The calculation of 〈E〉 looks a bit difficult, but using \( \beta = \frac {1}{k_B T} \) we find
In other words, when the system temperature T is high (i.e. when the system is hot), the expectation value of energy is high, and when the temperature is low, the expectation value of energy is low. This is consistent with our intuition. In addition, if we consider the case of three spatial dimensions, we can obtain the famous formula \( \frac {3}{2}k_BT \) as the expectation value of energy.
Bracket Notation in Quantum Mechanics
In the derivation of the backpropagation method, bracket notation has been introduced as a simple method. This is nothing more than just writing a vector as a ket ,
There are some notational conveniences. For example, the inner bracket is written conventionally as
which means that the inner product is regarded as the “product of the matrices,”
Also in the main text we have had
and also this can be regarded as a “matrix,” made by a matrix product,
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Tanaka, A., Tomiya, A., Hashimoto, K. (2021). Basics of Neural Networks. In: Deep Learning and Physics. Mathematical Physics Studies. Springer, Singapore. https://doi.org/10.1007/978-981-33-6108-9_3
Download citation
DOI: https://doi.org/10.1007/978-981-33-6108-9_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-33-6107-2
Online ISBN: 978-981-33-6108-9
eBook Packages: Physics and AstronomyPhysics and Astronomy (R0)