Skip to main content

Introduction to Machine Learning

  • Chapter
  • First Online:
Deep Learning and Physics

Part of the book series: Mathematical Physics Studies ((MPST))

  • 2606 Accesses

Abstract

In this chapter, we learn the general theory of machine learning. We shall take a look at examples of what learning is, what is the meaning of “machines learned,” and what relative entropy is. We will learn how to handle data in probability theory, and describe “generalization” and its importance in learning.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Richard Feynman said [17], “We can imagine that this complicated array of moving things which consists “the world” is something like a great chess game being played by the gods, and we are observers of the game. We do not know what the rules of the game are; all we are allowed to do is to watch the playing. Of course, if we watch long enough, we may eventually catch on to a few of the rules.” It is a good parable that captures the essence of the inverse problem of guessing rules and structures. The machine learning, compared with this example, would mean that the observer is a machine instead of a human.

  2. 2.

    Modified NIST (MNIST) is based on a database of handwritten character images created by the National Institute of Standards and Technology (NIST).

  3. 3.

    A database created by the Canadian Institute for Advanced Research (CIFAR). “10” in “CIFAR-10” indicates that there are 10 teacher labels. There is also data with a more detailed label, CIFAR-100.

  4. 4.

    This process corresponds to P(x, d) = P(x|d)P(d), but in reality the following order is easier to collect data:

    $$\displaystyle \begin{aligned} &\text{1. Take an image}\ \mathbf{x}. \\ &\text{2. Judge the label of the image and set it to}\ \mathbf{d}. \\ &\text{3. Record}\ (\mathbf{x}, \mathbf{d}). \end{aligned} $$

    This process corresponds to P(x, d) = P(d|x)P(x). The resulting sampling should be the same from Bayes’ theorem (see the column in this chapter), admitting the existence of data generation probabilities.

  5. 5.

    Needless to say, the probabilities here are all classical, and quantum theory has nothing to do with it.

  6. 6.

    As mentioned in Chap. 1, relative entropy is also called Kullback–Leibler divergence . Although it measures the “distance,” it does not satisfy the axiom of symmetry for the distance, so it is called divergence .

  7. 7.

    In general, “generalization error” often refers to the expectation value of the error function (which we will describe later). As shown later, they are essentially the same thing.

  8. 8.

    This is the same as using maximum likelihood estimation, just as we used it when we introduced relative entropy in Chap. 1.

  9. 9.

    If the reader knows experimental physics, recall overfitting. See also [20].

  10. 10.

    As described in the footnote below, this holds when the story is limited to binary classification. In addition, it is an inequality that can be used when both the generalization error and the empirical error have been normalized to take a value of [0,  1] by some method [21].

  11. 11.

    The exact definition of the VC dimension is as follows. We define the model QJ as

    $$\displaystyle \begin{aligned} Q_J(\mathbf{x}, \mathbf{d}) = Q_J(\mathbf{d}|\mathbf{x})P(\mathbf{x}) \, , \end{aligned} $$
    (2.12)

    as we will do later. Also, using function fJ with parameter J and the Dirac delta function δ, we write

    $$\displaystyle \begin{aligned} Q_J(\mathbf{d}|\mathbf{x}) = \delta(f_J(\mathbf{x}) - \mathbf{d}). \end{aligned} $$
    (2.13)

    Suppose further that we are working on the problem of binary classification with d = 0,  1. It means that fJ(x) is working to assign the input x to either 0 or 1. By the way, if there are # data, there are 2# possible ways of assigning 0/1. If we can vary J fully, then [fJ(x[1]), fJ(x[2]), …, fj(x[#])] can realize all possible 0/1 distributions, and this model has the ability to completely fit the data (the capacity is saturated compared to the number of data #). The VC dimension refers to the maximum value of # where such a situation is realized.

  12. 12.

    Over-training is a situation that falls into this state.

  13. 13.

    A well-known index based on a similar idea is Akaike’s Information Criteria (AIC) [22]:

    $$\displaystyle \begin{aligned} AIC = - 2 \log L + 2k \, , \end{aligned} $$
    (2.16)

    Where L is the maximum likelihood and k is the number of model parameters. Also, the AIC and the amount of χ2 that determine the accuracy of fitting have a relationship [23].

  14. 14.

    For example, according to the theorem 20.6 of [25], if the number of learning parameters of a neural network having a simple step function as an activation function (described in the next section) is NJ, the VC dimension of the neural network is the order of \( N_J \log N_J \). ResNet is not such a neural network, but let us estimate its VC dimension with this formula for reference. For example, according to Table 6 of [24], a ResNet with 110 layers (=1700 parameters) has an average error rate of 6.61% for the classification of CIFAR-10 (60,000 data), while the VC dimension is 12, 645.25 according to the above formula, and the second term of (2.11) is 0.57. Since the errors are scaled to [0,1] in the inequalities, the error rate can be read as at most about 10%, and the above-mentioned error rate of 6.61% overwhelms this. A classification error of ImageNet [26] (with approximately 107 data) is written in table 4 of the same paper, and this has a top-5 error rate of 4.49% in 152 layers, while the same simple calculation gives about 9%, so the reality is still better than the upper limit of the inequality. The inequality (2.11) is a formula for binary classification, but CIFAR-10 has 10 classes and ImageNet has 1000 classes, so we should consider the estimation here as a reference only.

  15. 15.

    In fact, the first derivative ∇J DKL(P||QJ) is not enough to minimize the error. A Hessian corresponding to the second derivative is what we should look at, but it is not practical because of the computational complexity.

  16. 16.

    If the value of 𝜖 is too large, the approximation “≈” in the expression (2.19) will be poor, and the actual parameter update will behave unintentionally. This is related to the gradient explosion problem described in Chap. 4.

References

  1. Samuel, A.L.: Some studies in machine learning using the game of checkers. II – recent progress. In: Computer Games I, pp. 366–400. Springer (1988)

    Google Scholar 

  2. Feynman, R.P., Leighton, R.B., Sands, M.: The Feynman Lectures on Physics: Mainly Electromagnetism and Matter, vol. 2. Addison-Wesley, Reading. reprinted (1977)

    MATH  Google Scholar 

  3. LeCun, Y.: The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/ (1998)

  4. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. Technical report, Citeseer (2009)

    Google Scholar 

  5. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer (2006)

    Google Scholar 

  6. Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of Machine Learning. The MIT Press (2018)

    Google Scholar 

  7. Akaike, H.: Information theory and an extension of the maximum likelihood principle. In: Selected Papers of Hirotugu Akaike, pp. 199–213. Springer (1998)

    Google Scholar 

  8. Borsanyi, S., et al.: Ab initio calculation of the neutron-proton mass difference. Science 347, 1452–1455 (2015)

    Article  ADS  Google Scholar 

  9. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  10. Shalev-Shwartz, S., Ben-David, S.: Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press (2014)

    Google Scholar 

  11. Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)

    Google Scholar 

  12. Kawaguchi, K., Kaelbling, L.P., Bengio, Y.: Generalization in deep learning. arXiv preprint arXiv:1710.05468 (2017)

    Google Scholar 

  13. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Tanaka, A., Tomiya, A., Hashimoto, K. (2021). Introduction to Machine Learning. In: Deep Learning and Physics. Mathematical Physics Studies. Springer, Singapore. https://doi.org/10.1007/978-981-33-6108-9_2

Download citation

  • DOI: https://doi.org/10.1007/978-981-33-6108-9_2

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-33-6107-2

  • Online ISBN: 978-981-33-6108-9

  • eBook Packages: Physics and AstronomyPhysics and Astronomy (R0)

Publish with us

Policies and ethics