Abstract
In this chapter, we learn the general theory of machine learning. We shall take a look at examples of what learning is, what is the meaning of “machines learned,” and what relative entropy is. We will learn how to handle data in probability theory, and describe “generalization” and its importance in learning.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Richard Feynman said [17], “We can imagine that this complicated array of moving things which consists “the world” is something like a great chess game being played by the gods, and we are observers of the game. We do not know what the rules of the game are; all we are allowed to do is to watch the playing. Of course, if we watch long enough, we may eventually catch on to a few of the rules.” It is a good parable that captures the essence of the inverse problem of guessing rules and structures. The machine learning, compared with this example, would mean that the observer is a machine instead of a human.
- 2.
Modified NIST (MNIST) is based on a database of handwritten character images created by the National Institute of Standards and Technology (NIST).
- 3.
A database created by the Canadian Institute for Advanced Research (CIFAR). “10” in “CIFAR-10” indicates that there are 10 teacher labels. There is also data with a more detailed label, CIFAR-100.
- 4.
This process corresponds to P(x, d) = P(x|d)P(d), but in reality the following order is easier to collect data:
$$\displaystyle \begin{aligned} &\text{1. Take an image}\ \mathbf{x}. \\ &\text{2. Judge the label of the image and set it to}\ \mathbf{d}. \\ &\text{3. Record}\ (\mathbf{x}, \mathbf{d}). \end{aligned} $$This process corresponds to P(x, d) = P(d|x)P(x). The resulting sampling should be the same from Bayes’ theorem (see the column in this chapter), admitting the existence of data generation probabilities.
- 5.
Needless to say, the probabilities here are all classical, and quantum theory has nothing to do with it.
- 6.
As mentioned in Chap. 1, relative entropy is also called Kullback–Leibler divergence . Although it measures the “distance,” it does not satisfy the axiom of symmetry for the distance, so it is called divergence .
- 7.
In general, “generalization error” often refers to the expectation value of the error function (which we will describe later). As shown later, they are essentially the same thing.
- 8.
This is the same as using maximum likelihood estimation, just as we used it when we introduced relative entropy in Chap. 1.
- 9.
If the reader knows experimental physics, recall overfitting. See also [20].
- 10.
As described in the footnote below, this holds when the story is limited to binary classification. In addition, it is an inequality that can be used when both the generalization error and the empirical error have been normalized to take a value of [0, 1] by some method [21].
- 11.
The exact definition of the VC dimension is as follows. We define the model QJ as
$$\displaystyle \begin{aligned} Q_J(\mathbf{x}, \mathbf{d}) = Q_J(\mathbf{d}|\mathbf{x})P(\mathbf{x}) \, , \end{aligned} $$(2.12)as we will do later. Also, using function fJ with parameter J and the Dirac delta function δ, we write
$$\displaystyle \begin{aligned} Q_J(\mathbf{d}|\mathbf{x}) = \delta(f_J(\mathbf{x}) - \mathbf{d}). \end{aligned} $$(2.13)Suppose further that we are working on the problem of binary classification with d = 0, 1. It means that fJ(x) is working to assign the input x to either 0 or 1. By the way, if there are # data, there are 2# possible ways of assigning 0/1. If we can vary J fully, then [fJ(x[1]), fJ(x[2]), …, fj(x[#])] can realize all possible 0/1 distributions, and this model has the ability to completely fit the data (the capacity is saturated compared to the number of data #). The VC dimension refers to the maximum value of # where such a situation is realized.
- 12.
Over-training is a situation that falls into this state.
- 13.
A well-known index based on a similar idea is Akaike’s Information Criteria (AIC) [22]:
$$\displaystyle \begin{aligned} AIC = - 2 \log L + 2k \, , \end{aligned} $$(2.16)Where L is the maximum likelihood and k is the number of model parameters. Also, the AIC and the amount of χ2 that determine the accuracy of fitting have a relationship [23].
- 14.
For example, according to the theorem 20.6 of [25], if the number of learning parameters of a neural network having a simple step function as an activation function (described in the next section) is NJ, the VC dimension of the neural network is the order of \( N_J \log N_J \). ResNet is not such a neural network, but let us estimate its VC dimension with this formula for reference. For example, according to Table 6 of [24], a ResNet with 110 layers (=1700 parameters) has an average error rate of 6.61% for the classification of CIFAR-10 (60,000 data), while the VC dimension is 12, 645.25 according to the above formula, and the second term of (2.11) is 0.57. Since the errors are scaled to [0,1] in the inequalities, the error rate can be read as at most about 10%, and the above-mentioned error rate of 6.61% overwhelms this. A classification error of ImageNet [26] (with approximately 107 data) is written in table 4 of the same paper, and this has a top-5 error rate of 4.49% in 152 layers, while the same simple calculation gives about 9%, so the reality is still better than the upper limit of the inequality. The inequality (2.11) is a formula for binary classification, but CIFAR-10 has 10 classes and ImageNet has 1000 classes, so we should consider the estimation here as a reference only.
- 15.
In fact, the first derivative ∇J DKL(P||QJ) is not enough to minimize the error. A Hessian corresponding to the second derivative is what we should look at, but it is not practical because of the computational complexity.
- 16.
References
Samuel, A.L.: Some studies in machine learning using the game of checkers. II – recent progress. In: Computer Games I, pp. 366–400. Springer (1988)
Feynman, R.P., Leighton, R.B., Sands, M.: The Feynman Lectures on Physics: Mainly Electromagnetism and Matter, vol. 2. Addison-Wesley, Reading. reprinted (1977)
LeCun, Y.: The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/ (1998)
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical report, Citeseer (2009)
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer (2006)
Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of Machine Learning. The MIT Press (2018)
Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Selected Papers of Hirotugu Akaike, pp. 199–213. Springer (1998)
Borsanyi, S., et al.: Ab initio calculation of the neutron-proton mass difference. Science 347, 1452–1455 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press (2014)
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
Kawaguchi, K., Kaelbling, L.P., Bengio, Y.: Generalization in deep learning. arXiv preprint arXiv:1710.05468 (2017)
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016)
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Tanaka, A., Tomiya, A., Hashimoto, K. (2021). Introduction to Machine Learning. In: Deep Learning and Physics. Mathematical Physics Studies. Springer, Singapore. https://doi.org/10.1007/978-981-33-6108-9_2
Download citation
DOI: https://doi.org/10.1007/978-981-33-6108-9_2
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-33-6107-2
Online ISBN: 978-981-33-6108-9
eBook Packages: Physics and AstronomyPhysics and Astronomy (R0)