Skip to main content

Unsupervised Deep Learning

  • Chapter
  • First Online:
Deep Learning and Physics

Part of the book series: Mathematical Physics Studies ((MPST))

  • 2581 Accesses

Abstract

In this chapter, at the end of Part I, we will explain Boltzmann machines and Generative adversarial networks (GANs). Both models are not the “find an answer” network given in Chap. 3, but rather the network itself giving the probability distribution of the input. Boltzmann machines have historically been the cornerstone of neural networks and are given by the Hamiltonian statistical mechanics of multi-particle spin systems. It is an important bridge between machine learning and physics. Generative adversarial networks are also one of the important topics in deep learning in recent years, and we try to provide an explanation of it from a physical point of view.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In addition to generative models, tasks such as clustering and principal component analysis are included in unsupervised learning. It should be noted here that in the sense that data is provided in advance, it is different from schemes such as reinforcement learning where data is not provided. Reinforcement learning is a very important machine learning technique that is not discussed in this book. Instead, here we provide some references. First, the standard textbook is Ref. [76] written by Sutton et al. This also describes the historical background.

  2. 2.

    It is very difficult to calculate the partition function using the Monte Carlo method. Roughly speaking, the partition function needs information on all states, but the Monte Carlo method focuses on information on parts that contribute to expectation values.

  3. 3.

    The effective Hamiltonian is defined to satisfy the following equation: \( \exp [-H_J^{\mathrm {eff}}(\mathbf {x})] = \sum _{\mathbf {h }} \exp [-H_J(\mathbf {x},\mathbf {h})]/Z_J \).

  4. 4.

    The contrastive divergence method is abbreviated as “CD method.” What is used here is also called the CD-1 method. In general, the contrastive divergence method is called the CD-k method, where k is the number of samples using the heatbath method for the process between x and h.

  5. 5.

    By the way, since Z usually brings a space of several hundred dimensions, there is not much difference between the Gaussian distribution and the uniform distribution on the sphere due to the effect of the curse of dimensionality. This is because the higher the dimension, the larger the ratio of the spherical shell to the inside of the spherical surface.

  6. 6.

    This is written with the sigmoid function σ(u) = (1 + e u)−1 as

    $$\displaystyle \begin{aligned} -\langle \log \sigma (D (x)) \rangle_ {x\sim P (x)} -\langle \log (1-\sigma (D (x))) \rangle_ {x\sim Q_G(x)}\end{aligned} $$
    (6.34)

    and it is similar to the cross-entropy error (3.16), which is the error function derived for binary classification in Chap. 3. In fact, this is identical to a binary classification problem of discriminating whether an item of data is real (x ∼ P(x)) or not (x ∼ Q G(x)).

  7. 7.

    The minimax theorem is that if f(x, y) is a concave (convex) function with respect to the 1st (2nd) variable x (y),

    $$\displaystyle \begin{aligned} \min_x \max_y f (x, y) = \max_y \min_x f (x, y) \, . \end{aligned} $$
    (6.42)

    Taking f = V D and setting the solution of this minimax problem to \( \overline {G}, \overline {D} \), one can prove that these satisfy the Nash equilibrium condition (6.39) and (6.40). In the proof, one uses the condition V G + V D = 0 (the zero-sum condition ).

  8. 8.

    Just as V D in the original GAN corresponds to the cross entropy, the objective function (6.49) called hinge loss corresponds to the error function of a support vector machine (which is not described in this book).

  9. 9.

    This is a detailed version of the proof provided in the appendix of the original paper.

  10. 10.

    To be precise, it is better to say \( P(x)-Q_{G^*}(x) \geq 0 \) almost everywhere. The meaning of this expression is explained in the next footnote.

  11. 11.

    Let us prove it by reductio ad absurdum. The negation of the statement that almost everywhere we have (6.66) is

    $$\displaystyle \begin{aligned} S=\{ x | D^*(x) > m \} \text{contributes to the integral}.\end{aligned} $$
    (6.63)

    We define a new \(\tilde {D}(x) = \min (m, D^*(x) )\) and substituting it into D of V D(G , D), we just put \( D=\tilde {D} \) in (6.53),

    $$\displaystyle \begin{aligned} V_D(G^*, \tilde{D}) &= \int_{S \cup S^c} dx \Big[ P(x) \tilde{D}(x) + Q_{G^*}(x) \max ( 0, m- \tilde{D}(x) ) \Big] \\ &= \int_{S } dx \Big[ P(x) \underbrace{ \tilde{D}(x) }_{ =m < D^*(x) } + Q_{G^*}(x) \underbrace{ \max ( 0, m- \tilde{D}(x) ) }_{ =0 < \max(0, m - D^*(x)) } \Big] \\ &\quad +\int_{S^c} dx \Big[ P(x) \underbrace{ \tilde{D}(x) }_{ D^*(x) } + Q_{G^*}(x) \max ( 0, m- \underbrace{ \tilde{D}(x) }_{ D^*(x) } ) \Big] \\ &< \int_{S \cup S^c} dx \Big[ P(x) D^*(x) + Q_{G^*}(x) \max ( 0, m- D^*(x) ) \Big] = V_D(G^*, D^*) \, .\end{aligned} $$
    (6.64)

    This inequality contradicts the Nash equilibrium definition (6.40), where D gives the minimum value of V D(G , D) for D.

  12. 12.

    This is equivalent to

    $$\displaystyle \begin{aligned} \int 1_{ D^*(x) > m} dx = 0 \, .\end{aligned} $$
    (6.65)

    In other words, the support set S = {x|D (x) > m} does not contribute to the integral.

  13. 13.

    This generalization is not necessary to derive the final form of the WGAN, but considering the Helmholtz free energy makes the derivation easier to understand.

  14. 14.

    The term “+ 1” is not necessary but it will make the later discussion cleaner.

  15. 15.

    For example, there is a way to reduce the amount of calculation in the actual calculation algorithm compared to the original zero temperature problem [85].

  16. 16.

    As another example, support vector machines (which are not described in this book) have also a duality.

  17. 17.

    Actually, spectral normalization was introduced to stabilize the learning process of conventional GANs rather than to use it for WGANs, and the authors are not aware of successful examples of using spectral normalization in WGAN implementations. With this normalization, the network acquires a K-Lipschitz continuity with a certain number K, so there is no reason why it cannot be used for WGAN.

  18. 18.

    Noteworthy deep generative models that we have not been able to introduce here include variational auto-encoder (VAE) [88, 89] and nonlinear independent component estimation (NICE) [90]. A brief review is provided by a physicist, L. Wang [91].

  19. 19.

    Yet another transform

    $$\displaystyle \begin{aligned} \text{(6.109)} &= \int dx \sum_d Q_{J^* G^*} (x, d) \log \frac{Q_{J^* G^*} (x, d) }{ Q_{G^*}(x) Q(d) }, \quad Q_{J^* G^*} (x, d) = Q_{J^*} (d|x) Q_{G^*}(x)\end{aligned} $$
    (6.108)

    provides a quantity called mutual information between the generated image and the classification labels. Reference [93] used this to hack IS, that is, generate images that have an unusually large value of the IS (although images that make little sense to the human eyes). As seen from this example, it is not always true that the higher the IS, the better.

  20. 20.

    There is another index called Fréchet inception distance (FID) [95], which is the Wasserstein distance (which is called the Fréchet distance, and the name FID follows from it) between the data image distribution and the generated image distribution in the hidden layer (feature space) of image classification network, assuming that the features in a classification network follow a certain Gaussian distribution.

  21. 21.

    The name Inception came from the title of a popular Hollywood movie “Inception”: the title of the GoogLeNet paper is “Going deeper with convolutions,” while the main character’s line in the movie is “We need to go deeper.” The original paper even cites the movie (the article summarizing the backgrounds). It is a witty naming that makes anyone who has seen this movie grin.

References

  1. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  2. Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)

    Google Scholar 

  3. Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)

    Google Scholar 

  4. Sutton, R.S., Barto, A.G., et al.: Introduction to Reinforcement Learning, vol. 135. MIT Press, Cambridge, MA (1998)

    MATH  Google Scholar 

  5. Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14(8), 1771–1800 (2002)

    Article  Google Scholar 

  6. Bengio, Y., Delalleau, O.: Justifying and generalizing contrastive divergence. Neural Comput. 21(6), 1601–1621 (2009)

    Article  MathSciNet  Google Scholar 

  7. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)

    Google Scholar 

  8. Neumann, V.: Zur theorie der gesellschaftsspiele. Mathematische Annalen 100(1), 295–320 (1928)

    Article  MathSciNet  Google Scholar 

  9. Zhao, J., Mathieu, M., LeCun, Y.: Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126 (2016)

    Google Scholar 

  10. Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017)

    Google Scholar 

  11. Villani, C.: Optimal Transport: Old and New, vol. 338. Springer Science & Business Media (2008)

    Google Scholar 

  12. Peyré, G., Cuturi, M., et al.: Computational optimal transport. Found. Trends® Mach. Learn. 11(5–6), 355–607 (2019)

    Google Scholar 

  13. Cuturi, M.: Sinkhorn distances: lightspeed computation of optimal transport. In: Advances in Neural Information Processing Systems, pp. 2292–2300 (2013)

    Google Scholar 

  14. Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of wasserstein gans. In: Advances in Neural Information Processing Systems, pp. 5767–5777 (2017)

    Google Scholar 

  15. Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957 (2018)

    Google Scholar 

  16. Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)

    Google Scholar 

  17. Tolstikhin, I., Bousquet, O., Gelly, S., Schoelkopf, B.: Wasserstein auto-encoders. arXiv preprint arXiv:1711.01558 (2017)

    Google Scholar 

  18. Dinh, L., Krueger, D., Bengio, Y.: Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516 (2014)

    Google Scholar 

  19. Wang, L.: Generative models for physicists (2018). https://wangleiphy.github.io/lectures/piltutorial.pdf. https://wangleiphy.github.io/lectures/PILtutorial.pdf

  20. Dodge, S., Karam, L.: A study and comparison of human and deep learning recognition performance under visual distortions. In: 26th International Conference on Computer Communication and Networks (ICCCN), pp. 1–7. IEEE (2017)

    Google Scholar 

  21. Barratt, S., Sharma, R.: A note on the inception score. arXiv preprint arXiv:1801.01973 (2018)

    Google Scholar 

  22. Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. In: Advances in Neural Information Processing Systems, pp. 2234–2242 (2016)

    Google Scholar 

  23. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems, pp. 6626–6637 (2017)

    Google Scholar 

  24. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)

    Google Scholar 

  25. Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 215–223 (2011)

    Google Scholar 

  26. Miyato, T., Koyama, M.: cGANs with projection discriminator. arXiv preprint arXiv:1802.05637 (2018)

    Google Scholar 

  27. Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096 (2018)

    Google Scholar 

  28. Maeda, S., Aoki, Y., Ishii, S.: Training of Markov chain with detailed balance learning (in Japanese). In: Proceedings of the 19th Meeting of Japan Neural Network Society, pp. 40–41 (2009)

    Google Scholar 

  29. Liu, J., Qi, Y., Meng, Z.Y., Fu, L.: Self-learning monte carlo method. Phys. Rev. B 95(4), 041101 (2017)

    Article  ADS  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Column: Self-Learning Monte Carlo Method

Column: Self-Learning Monte Carlo Method

According to Ref. [100], the contrastive divergence method is a kind of optimization of an “error function ”

$$\displaystyle \begin{aligned} K_{ex}(\theta) = D_{KL} \Big( P_\theta (\mathbf{x}' | \mathbf{x}) P(\mathbf{x}) \Big| \Big| P_\theta (\mathbf{x} | \mathbf{x}') P(\mathbf{x}') \Big) \, . \end{aligned} $$
(6.111)

At θ = θ 0 where this quantity vanishes, the detailed balance condition is satisfied, so the target distribution P(x) is a convergence destination of a Markov chain \( P_{\theta _0} (\mathbf {x} '| \mathbf {x}) \). A little massage of this equation using Bayes’ theorem gives

$$\displaystyle \begin{aligned} K_{ex}(\theta) &= D_{KL} \Big( P_\theta (\mathbf{x}' | \mathbf{x}) P(\mathbf{x}) \Big| \Big| P_\theta (\mathbf{x}' | \mathbf{x}) \frac{e^{-H_\theta^{\text{eff}} (\mathbf{x}) }} {e^{-H_\theta^{\text{eff}} (\mathbf{x}') }} P(\mathbf{x}') \Big) \\ &= - \Big\langle \log \frac{ P(\mathbf{x}') e^{- H_\theta^{\text{eff}} (\mathbf{x})} }{ P(\mathbf{x}) e^{- H_\theta^{\text{eff}} (\mathbf{x}')} } \Big\rangle_{P_\theta(\mathbf{x}' | \mathbf{x} ) P(\mathbf{x}) } \geq 0 \, . \end{aligned} $$
(6.112)

The last inequality is due to the property of relative entropy. Therefore, roughly speaking, bringing the following quantity closer to 1,

$$\displaystyle \begin{aligned} \frac{ P(\mathbf{x}') e^{- H_\theta^{\text{eff}} (\mathbf{x})} }{ P(\mathbf{x}) e^{- H_\theta^{\text{eff}} (\mathbf{x}')} } \, \end{aligned} $$
(6.113)

is the contrastive divergence method. As a matter of fact, the transition probability obtained by the heatbath method together with the Metropolis test using this factor satisfies exactly the detailed balance condition.

$$\displaystyle \begin{aligned} P_\theta^{ex} (\mathbf{x}' | \mathbf{x} ) &= \min\Big( 1 , \frac{ P(\mathbf{x}') e^{- H_\theta^{\text{eff}} (\mathbf{x})} }{ P(\mathbf{x}) e^{- H_\theta^{\text{eff}} (\mathbf{x}')} } \Big) P_\theta (\mathbf{x}' | \mathbf{x} ) \\ &= \frac{ P(\mathbf{x}') e^{- H_\theta^{\text{eff}} (\mathbf{x})} }{ P(\mathbf{x}) e^{- H_\theta^{\text{eff}} (\mathbf{x}')} } \min\Big( \frac{ P(\mathbf{x}) e^{- H_\theta^{\text{eff}} (\mathbf{x}')} }{ P(\mathbf{x}') e^{- H_\theta^{\text{eff}} (\mathbf{x})} } , 1 \Big) \frac{e^{-H_\theta^{\text{eff}} (\mathbf{x}' )}} {e^{-H_\theta^{\text{eff}} (\mathbf{x} )}} P_\theta (\mathbf{x} | \mathbf{x}' ) \\ &= \frac{ P(\mathbf{x}') }{ P(\mathbf{x}) } P_\theta^{ex} ( \mathbf{x} | \mathbf{x}' ) \, . {} \end{aligned} $$
(6.114)

In this way, the desired convergence is guaranteed even if the probability defined by \( H_\theta ^{\text{eff}} \) does not exactly match the target P. Note that, to implement this correction term, one needs to have access to the value of P(x). For example, if the target is written by some Hamiltonian

$$\displaystyle \begin{aligned} P(\mathbf{x}) = \frac{e^{-H_{\text{true}} (\mathbf{x})}}{Z} \, , {} \end{aligned} $$
(6.115)

it is possible. In this case, the training sample is a snapshot of the configuration from the statistical mechanical system defined by this Hamiltonian. A machine learning Metropolis method that repeats the cycle of (1) training \( H_\theta ^{\text{eff}} \) with the configurations generated by Markov chain Monte Carlo method for (6.115), and (2) generating new configurations with a Markov chain of type (6.114), is called a self-learning Monte Carlo method , which has been actively studied since 2016 [101].

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Tanaka, A., Tomiya, A., Hashimoto, K. (2021). Unsupervised Deep Learning. In: Deep Learning and Physics. Mathematical Physics Studies. Springer, Singapore. https://doi.org/10.1007/978-981-33-6108-9_6

Download citation

  • DOI: https://doi.org/10.1007/978-981-33-6108-9_6

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-33-6107-2

  • Online ISBN: 978-981-33-6108-9

  • eBook Packages: Physics and AstronomyPhysics and Astronomy (R0)

Publish with us

Policies and ethics