Introduction to Machine Learning

Tanaka, Akinori; Tomiya, Akio; Hashimoto, Koji

doi:10.1007/978-981-33-6108-9_2

Akinori Tanaka¹¹,
Akio Tomiya¹² &
Koji Hashimoto¹³

Part of the book series: Mathematical Physics Studies ((MPST))

2606 Accesses

Abstract

In this chapter, we learn the general theory of machine learning. We shall take a look at examples of what learning is, what is the meaning of “machines learned,” and what relative entropy is. We will learn how to handle data in probability theory, and describe “generalization” and its importance in learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Hardcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Richard Feynman said [17], “We can imagine that this complicated array of moving things which consists “the world” is something like a great chess game being played by the gods, and we are observers of the game. We do not know what the rules of the game are; all we are allowed to do is to watch the playing. Of course, if we watch long enough, we may eventually catch on to a few of the rules.” It is a good parable that captures the essence of the inverse problem of guessing rules and structures. The machine learning, compared with this example, would mean that the observer is a machine instead of a human.
2.
Modified NIST (MNIST) is based on a database of handwritten character images created by the National Institute of Standards and Technology (NIST).
3.
A database created by the Canadian Institute for Advanced Research (CIFAR). “10” in “CIFAR-10” indicates that there are 10 teacher labels. There is also data with a more detailed label, CIFAR-100.
4.
This process corresponds to P(x, d) = P(x|d)P(d), but in reality the following order is easier to collect data:
$$\displaystyle \begin{aligned} &\text{1. Take an image}\ \mathbf{x}. \\ &\text{2. Judge the label of the image and set it to}\ \mathbf{d}. \\ &\text{3. Record}\ (\mathbf{x}, \mathbf{d}). \end{aligned} $$

This process corresponds to P(x, d) = P(d|x)P(x). The resulting sampling should be the same from Bayes’ theorem (see the column in this chapter), admitting the existence of data generation probabilities.
5.
Needless to say, the probabilities here are all classical, and quantum theory has nothing to do with it.
6.
As mentioned in Chap. 1, relative entropy is also called Kullback–Leibler divergence . Although it measures the “distance,” it does not satisfy the axiom of symmetry for the distance, so it is called divergence .
7.
In general, “generalization error” often refers to the expectation value of the error function (which we will describe later). As shown later, they are essentially the same thing.
8.
This is the same as using maximum likelihood estimation, just as we used it when we introduced relative entropy in Chap. 1.
9.
If the reader knows experimental physics, recall overfitting. See also [20].
10.
As described in the footnote below, this holds when the story is limited to binary classification. In addition, it is an inequality that can be used when both the generalization error and the empirical error have been normalized to take a value of [0, 1] by some method [21].
11.
The exact definition of the VC dimension is as follows. We define the model Q_J as
$$\displaystyle \begin{aligned} Q_J(\mathbf{x}, \mathbf{d}) = Q_J(\mathbf{d}|\mathbf{x})P(\mathbf{x}) \, , \end{aligned} $$
(2.12)

as we will do later. Also, using function f_J with parameter J and the Dirac delta function δ, we write
$$\displaystyle \begin{aligned} Q_J(\mathbf{d}|\mathbf{x}) = \delta(f_J(\mathbf{x}) - \mathbf{d}). \end{aligned} $$
(2.13)

Suppose further that we are working on the problem of binary classification with d = 0, 1. It means that f_J(x) is working to assign the input x to either 0 or 1. By the way, if there are # data, there are 2^# possible ways of assigning 0/1. If we can vary J fully, then [f_J(x[1]), f_J(x[2]), …, f_j(x[#])] can realize all possible 0/1 distributions, and this model has the ability to completely fit the data (the capacity is saturated compared to the number of data #). The VC dimension refers to the maximum value of # where such a situation is realized.
12.
Over-training is a situation that falls into this state.
13.
A well-known index based on a similar idea is Akaike’s Information Criteria (AIC) [22]:
$$\displaystyle \begin{aligned} AIC = - 2 \log L + 2k \, , \end{aligned} $$
(2.16)

Where L is the maximum likelihood and k is the number of model parameters. Also, the AIC and the amount of χ² that determine the accuracy of fitting have a relationship [23].
14.
For example, according to the theorem 20.6 of [25], if the number of learning parameters of a neural network having a simple step function as an activation function (described in the next section) is N_J, the VC dimension of the neural network is the order of $ N_J \log N_J $. ResNet is not such a neural network, but let us estimate its VC dimension with this formula for reference. For example, according to Table 6 of [24], a ResNet with 110 layers (=1700 parameters) has an average error rate of 6.61% for the classification of CIFAR-10 (60,000 data), while the VC dimension is 12, 645.25 according to the above formula, and the second term of (2.11) is 0.57. Since the errors are scaled to [0,1] in the inequalities, the error rate can be read as at most about 10%, and the above-mentioned error rate of 6.61% overwhelms this. A classification error of ImageNet [26] (with approximately 10⁷ data) is written in table 4 of the same paper, and this has a top-5 error rate of 4.49% in 152 layers, while the same simple calculation gives about 9%, so the reality is still better than the upper limit of the inequality. The inequality (2.11) is a formula for binary classification, but CIFAR-10 has 10 classes and ImageNet has 1000 classes, so we should consider the estimation here as a reference only.
15.
In fact, the first derivative ∇_J D_KL(P||Q_J) is not enough to minimize the error. A Hessian corresponding to the second derivative is what we should look at, but it is not practical because of the computational complexity.
16.
If the value of 𝜖 is too large, the approximation “≈” in the expression (2.19) will be poor, and the actual parameter update will behave unintentionally. This is related to the gradient explosion problem described in Chap. 4.

References

Samuel, A.L.: Some studies in machine learning using the game of checkers. II – recent progress. In: Computer Games I, pp. 366–400. Springer (1988)
Google Scholar
Feynman, R.P., Leighton, R.B., Sands, M.: The Feynman Lectures on Physics: Mainly Electromagnetism and Matter, vol. 2. Addison-Wesley, Reading. reprinted (1977)
MATH Google Scholar
LeCun, Y.: The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/ (1998)
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical report, Citeseer (2009)
Google Scholar
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer (2006)
Google Scholar
Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of Machine Learning. The MIT Press (2018)
Google Scholar
Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Selected Papers of Hirotugu Akaike, pp. 199–213. Springer (1998)
Google Scholar
Borsanyi, S., et al.: Ab initio calculation of the neutron-proton mass difference. Science 347, 1452–1455 (2015)
Article ADS Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press (2014)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
Google Scholar
Kawaguchi, K., Kaelbling, L.P., Bengio, Y.: Generalization in deep learning. arXiv preprint arXiv:1710.05468 (2017)
Google Scholar
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

iTHEMS, RIKEN, Wako, Saitama, Japan
Akinori Tanaka
Radiation Lab, RIKEN, Wako, Saitama, Japan
Akio Tomiya
Department of Physics, Osaka University, Toyonaka, Osaka, Japan
Koji Hashimoto

Authors

Akinori Tanaka
View author publications
You can also search for this author in PubMed Google Scholar
Akio Tomiya
View author publications
You can also search for this author in PubMed Google Scholar
Koji Hashimoto
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Tanaka, A., Tomiya, A., Hashimoto, K. (2021). Introduction to Machine Learning. In: Deep Learning and Physics. Mathematical Physics Studies. Springer, Singapore. https://doi.org/10.1007/978-981-33-6108-9_2

Download citation

DOI: https://doi.org/10.1007/978-981-33-6108-9_2
Published: 21 February 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-33-6107-2
Online ISBN: 978-981-33-6108-9
eBook Packages: Physics and AstronomyPhysics and Astronomy (R0)

Publish with us

Policies and ethics