Unsupervised Deep Learning

Tanaka, Akinori; Tomiya, Akio; Hashimoto, Koji

doi:10.1007/978-981-33-6108-9_6

Akinori Tanaka¹¹,
Akio Tomiya¹² &
Koji Hashimoto¹³

Part of the book series: Mathematical Physics Studies ((MPST))

2581 Accesses

Abstract

In this chapter, at the end of Part I, we will explain Boltzmann machines and Generative adversarial networks (GANs). Both models are not the “find an answer” network given in Chap. 3, but rather the network itself giving the probability distribution of the input. Boltzmann machines have historically been the cornerstone of neural networks and are given by the Hamiltonian statistical mechanics of multi-particle spin systems. It is an important bridge between machine learning and physics. Generative adversarial networks are also one of the important topics in deep learning in recent years, and we try to provide an explanation of it from a physical point of view.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Hardcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In addition to generative models, tasks such as clustering and principal component analysis are included in unsupervised learning. It should be noted here that in the sense that data is provided in advance, it is different from schemes such as reinforcement learning where data is not provided. Reinforcement learning is a very important machine learning technique that is not discussed in this book. Instead, here we provide some references. First, the standard textbook is Ref. [76] written by Sutton et al. This also describes the historical background.
2.
It is very difficult to calculate the partition function using the Monte Carlo method. Roughly speaking, the partition function needs information on all states, but the Monte Carlo method focuses on information on parts that contribute to expectation values.
3.
The effective Hamiltonian is defined to satisfy the following equation: $ \exp [-H_J^{\mathrm {eff}}(\mathbf {x})] = \sum _{\mathbf {h }} \exp [-H_J(\mathbf {x},\mathbf {h})]/Z_J $.
4.
The contrastive divergence method is abbreviated as “CD method.” What is used here is also called the CD-1 method. In general, the contrastive divergence method is called the CD-k method, where k is the number of samples using the heatbath method for the process between x and h.
5.
By the way, since Z usually brings a space of several hundred dimensions, there is not much difference between the Gaussian distribution and the uniform distribution on the sphere due to the effect of the curse of dimensionality. This is because the higher the dimension, the larger the ratio of the spherical shell to the inside of the spherical surface.
6.
This is written with the sigmoid function σ(u) = (1 + e ^−u)⁻¹ as
$$\displaystyle \begin{aligned} -\langle \log \sigma (D (x)) \rangle_ {x\sim P (x)} -\langle \log (1-\sigma (D (x))) \rangle_ {x\sim Q_G(x)}\end{aligned} $$
(6.34)

and it is similar to the cross-entropy error (3.16), which is the error function derived for binary classification in Chap. 3. In fact, this is identical to a binary classification problem of discriminating whether an item of data is real (x ∼ P(x)) or not (x ∼ Q _G(x)).
7.
The minimax theorem is that if f(x, y) is a concave (convex) function with respect to the 1st (2nd) variable x (y),
$$\displaystyle \begin{aligned} \min_x \max_y f (x, y) = \max_y \min_x f (x, y) \, . \end{aligned} $$
(6.42)

Taking f = V _D and setting the solution of this minimax problem to $ \overline {G}, \overline {D} $, one can prove that these satisfy the Nash equilibrium condition (6.39) and (6.40). In the proof, one uses the condition V _G + V _D = 0 (the zero-sum condition ).
8.
Just as V _D in the original GAN corresponds to the cross entropy, the objective function (6.49) called hinge loss corresponds to the error function of a support vector machine (which is not described in this book).
9.
This is a detailed version of the proof provided in the appendix of the original paper.
10.
To be precise, it is better to say $ P(x)-Q_{G^*}(x) \geq 0 $ almost everywhere. The meaning of this expression is explained in the next footnote.
11.
Let us prove it by reductio ad absurdum. The negation of the statement that almost everywhere we have (6.66) is
$$\displaystyle \begin{aligned} S=\{ x | D^*(x) > m \} \text{contributes to the integral}.\end{aligned} $$
(6.63)

We define a new $\tilde {D}(x) = \min (m, D^*(x) )$ and substituting it into D of V _D(G ^∗, D), we just put $ D=\tilde {D} $ in (6.53),
$$\displaystyle \begin{aligned} V_D(G^*, \tilde{D}) &= \int_{S \cup S^c} dx \Big[ P(x) \tilde{D}(x) + Q_{G^*}(x) \max ( 0, m- \tilde{D}(x) ) \Big] \\ &= \int_{S } dx \Big[ P(x) \underbrace{ \tilde{D}(x) }_{ =m < D^*(x) } + Q_{G^*}(x) \underbrace{ \max ( 0, m- \tilde{D}(x) ) }_{ =0 < \max(0, m - D^*(x)) } \Big] \\ &\quad +\int_{S^c} dx \Big[ P(x) \underbrace{ \tilde{D}(x) }_{ D^*(x) } + Q_{G^*}(x) \max ( 0, m- \underbrace{ \tilde{D}(x) }_{ D^*(x) } ) \Big] \\ &< \int_{S \cup S^c} dx \Big[ P(x) D^*(x) + Q_{G^*}(x) \max ( 0, m- D^*(x) ) \Big] = V_D(G^*, D^*) \, .\end{aligned} $$
(6.64)

This inequality contradicts the Nash equilibrium definition (6.40), where D ^∗ gives the minimum value of V _D(G ^∗, D) for D.
12.
This is equivalent to
$$\displaystyle \begin{aligned} \int 1_{ D^*(x) > m} dx = 0 \, .\end{aligned} $$
(6.65)

In other words, the support set S = {x|D ^∗(x) > m} does not contribute to the integral.
13.
This generalization is not necessary to derive the final form of the WGAN, but considering the Helmholtz free energy makes the derivation easier to understand.
14.
The term “+ 1” is not necessary but it will make the later discussion cleaner.
15.
For example, there is a way to reduce the amount of calculation in the actual calculation algorithm compared to the original zero temperature problem [85].
16.
As another example, support vector machines (which are not described in this book) have also a duality.
17.
Actually, spectral normalization was introduced to stabilize the learning process of conventional GANs rather than to use it for WGANs, and the authors are not aware of successful examples of using spectral normalization in WGAN implementations. With this normalization, the network acquires a K-Lipschitz continuity with a certain number K, so there is no reason why it cannot be used for WGAN.
18.
Noteworthy deep generative models that we have not been able to introduce here include variational auto-encoder (VAE) [88, 89] and nonlinear independent component estimation (NICE) [90]. A brief review is provided by a physicist, L. Wang [91].
19.
Yet another transform
$$\displaystyle \begin{aligned} \text{(6.109)} &= \int dx \sum_d Q_{J^* G^*} (x, d) \log \frac{Q_{J^* G^*} (x, d) }{ Q_{G^*}(x) Q(d) }, \quad Q_{J^* G^*} (x, d) = Q_{J^*} (d|x) Q_{G^*}(x)\end{aligned} $$
(6.108)

provides a quantity called mutual information between the generated image and the classification labels. Reference [93] used this to hack IS, that is, generate images that have an unusually large value of the IS (although images that make little sense to the human eyes). As seen from this example, it is not always true that the higher the IS, the better.
20.
There is another index called Fréchet inception distance (FID) [95], which is the Wasserstein distance (which is called the Fréchet distance, and the name FID follows from it) between the data image distribution and the generated image distribution in the hidden layer (feature space) of image classification network, assuming that the features in a classification network follow a certain Gaussian distribution.
21.
The name Inception came from the title of a popular Hollywood movie “Inception”: the title of the GoogLeNet paper is “Going deeper with convolutions,” while the main character’s line in the movie is “We need to go deeper.” The original paper even cites the movie (the article summarizing the backgrounds). It is a witty naming that makes anyone who has seen this movie grin.

References

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
Google Scholar
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
Google Scholar
Sutton, R.S., Barto, A.G., et al.: Introduction to Reinforcement Learning, vol. 135. MIT Press, Cambridge, MA (1998)
MATH Google Scholar
Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14(8), 1771–1800 (2002)
Article Google Scholar
Bengio, Y., Delalleau, O.: Justifying and generalizing contrastive divergence. Neural Comput. 21(6), 1601–1621 (2009)
Article MathSciNet Google Scholar
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)
Google Scholar
Neumann, V.: Zur theorie der gesellschaftsspiele. Mathematische Annalen 100(1), 295–320 (1928)
Article MathSciNet Google Scholar
Zhao, J., Mathieu, M., LeCun, Y.: Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126 (2016)
Google Scholar
Arjovsky, M., Chintala, S., Bottou, L.: Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017)
Google Scholar
Villani, C.: Optimal Transport: Old and New, vol. 338. Springer Science & Business Media (2008)
Google Scholar
Peyré, G., Cuturi, M., et al.: Computational optimal transport. Found. Trends® Mach. Learn. 11(5–6), 355–607 (2019)
Google Scholar
Cuturi, M.: Sinkhorn distances: lightspeed computation of optimal transport. In: Advances in Neural Information Processing Systems, pp. 2292–2300 (2013)
Google Scholar
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A.C.: Improved training of wasserstein gans. In: Advances in Neural Information Processing Systems, pp. 5767–5777 (2017)
Google Scholar
Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957 (2018)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Google Scholar
Tolstikhin, I., Bousquet, O., Gelly, S., Schoelkopf, B.: Wasserstein auto-encoders. arXiv preprint arXiv:1711.01558 (2017)
Google Scholar
Dinh, L., Krueger, D., Bengio, Y.: Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516 (2014)
Google Scholar
Wang, L.: Generative models for physicists (2018). https://wangleiphy.github.io/lectures/piltutorial.pdf. https://wangleiphy.github.io/lectures/PILtutorial.pdf
Dodge, S., Karam, L.: A study and comparison of human and deep learning recognition performance under visual distortions. In: 26th International Conference on Computer Communication and Networks (ICCCN), pp. 1–7. IEEE (2017)
Google Scholar
Barratt, S., Sharma, R.: A note on the inception score. arXiv preprint arXiv:1801.01973 (2018)
Google Scholar
Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X.: Improved techniques for training gans. In: Advances in Neural Information Processing Systems, pp. 2234–2242 (2016)
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems, pp. 6626–6637 (2017)
Google Scholar
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Google Scholar
Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 215–223 (2011)
Google Scholar
Miyato, T., Koyama, M.: cGANs with projection discriminator. arXiv preprint arXiv:1802.05637 (2018)
Google Scholar
Brock, A., Donahue, J., Simonyan, K.: Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096 (2018)
Google Scholar
Maeda, S., Aoki, Y., Ishii, S.: Training of Markov chain with detailed balance learning (in Japanese). In: Proceedings of the 19th Meeting of Japan Neural Network Society, pp. 40–41 (2009)
Google Scholar
Liu, J., Qi, Y., Meng, Z.Y., Fu, L.: Self-learning monte carlo method. Phys. Rev. B 95(4), 041101 (2017)
Article ADS Google Scholar

Download references

Author information

Authors and Affiliations

iTHEMS, RIKEN, Wako, Saitama, Japan
Akinori Tanaka
Radiation Lab, RIKEN, Wako, Saitama, Japan
Akio Tomiya
Department of Physics, Osaka University, Toyonaka, Osaka, Japan
Koji Hashimoto

Authors

Akinori Tanaka
View author publications
You can also search for this author in PubMed Google Scholar
Akio Tomiya
View author publications
You can also search for this author in PubMed Google Scholar
Koji Hashimoto
View author publications
You can also search for this author in PubMed Google Scholar

Column: Self-Learning Monte Carlo Method

According to Ref. [100], the contrastive divergence method is a kind of optimization of an “error function ”

$$\displaystyle \begin{aligned} K_{ex}(\theta) = D_{KL} \Big( P_\theta (\mathbf{x}' | \mathbf{x}) P(\mathbf{x}) \Big| \Big| P_\theta (\mathbf{x} | \mathbf{x}') P(\mathbf{x}') \Big) \, . \end{aligned} $$

(6.111)

At θ = θ ₀ where this quantity vanishes, the detailed balance condition is satisfied, so the target distribution P(x) is a convergence destination of a Markov chain $ P_{\theta _0} (\mathbf {x} '| \mathbf {x}) $. A little massage of this equation using Bayes’ theorem gives

$$\displaystyle \begin{aligned} K_{ex}(\theta) &= D_{KL} \Big( P_\theta (\mathbf{x}' | \mathbf{x}) P(\mathbf{x}) \Big| \Big| P_\theta (\mathbf{x}' | \mathbf{x}) \frac{e^{-H_\theta^{\text{eff}} (\mathbf{x}) }} {e^{-H_\theta^{\text{eff}} (\mathbf{x}') }} P(\mathbf{x}') \Big) \\ &= - \Big\langle \log \frac{ P(\mathbf{x}') e^{- H_\theta^{\text{eff}} (\mathbf{x})} }{ P(\mathbf{x}) e^{- H_\theta^{\text{eff}} (\mathbf{x}')} } \Big\rangle_{P_\theta(\mathbf{x}' | \mathbf{x} ) P(\mathbf{x}) } \geq 0 \, . \end{aligned} $$

(6.112)

The last inequality is due to the property of relative entropy. Therefore, roughly speaking, bringing the following quantity closer to 1,

$$\displaystyle \begin{aligned} \frac{ P(\mathbf{x}') e^{- H_\theta^{\text{eff}} (\mathbf{x})} }{ P(\mathbf{x}) e^{- H_\theta^{\text{eff}} (\mathbf{x}')} } \, \end{aligned} $$

(6.113)

is the contrastive divergence method. As a matter of fact, the transition probability obtained by the heatbath method together with the Metropolis test using this factor satisfies exactly the detailed balance condition.

$$\displaystyle \begin{aligned} P_\theta^{ex} (\mathbf{x}' | \mathbf{x} ) &= \min\Big( 1 , \frac{ P(\mathbf{x}') e^{- H_\theta^{\text{eff}} (\mathbf{x})} }{ P(\mathbf{x}) e^{- H_\theta^{\text{eff}} (\mathbf{x}')} } \Big) P_\theta (\mathbf{x}' | \mathbf{x} ) \\ &= \frac{ P(\mathbf{x}') e^{- H_\theta^{\text{eff}} (\mathbf{x})} }{ P(\mathbf{x}) e^{- H_\theta^{\text{eff}} (\mathbf{x}')} } \min\Big( \frac{ P(\mathbf{x}) e^{- H_\theta^{\text{eff}} (\mathbf{x}')} }{ P(\mathbf{x}') e^{- H_\theta^{\text{eff}} (\mathbf{x})} } , 1 \Big) \frac{e^{-H_\theta^{\text{eff}} (\mathbf{x}' )}} {e^{-H_\theta^{\text{eff}} (\mathbf{x} )}} P_\theta (\mathbf{x} | \mathbf{x}' ) \\ &= \frac{ P(\mathbf{x}') }{ P(\mathbf{x}) } P_\theta^{ex} ( \mathbf{x} | \mathbf{x}' ) \, . {} \end{aligned} $$

(6.114)

In this way, the desired convergence is guaranteed even if the probability defined by $ H_\theta ^{\text{eff}} $ does not exactly match the target P. Note that, to implement this correction term, one needs to have access to the value of P(x). For example, if the target is written by some Hamiltonian

$$\displaystyle \begin{aligned} P(\mathbf{x}) = \frac{e^{-H_{\text{true}} (\mathbf{x})}}{Z} \, , {} \end{aligned} $$

(6.115)

it is possible. In this case, the training sample is a snapshot of the configuration from the statistical mechanical system defined by this Hamiltonian. A machine learning Metropolis method that repeats the cycle of (1) training $ H_\theta ^{\text{eff}} $ with the configurations generated by Markov chain Monte Carlo method for (6.115), and (2) generating new configurations with a Markov chain of type (6.114), is called a self-learning Monte Carlo method , which has been actively studied since 2016 [101].

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Tanaka, A., Tomiya, A., Hashimoto, K. (2021). Unsupervised Deep Learning. In: Deep Learning and Physics. Mathematical Physics Studies. Springer, Singapore. https://doi.org/10.1007/978-981-33-6108-9_6

Download citation

DOI: https://doi.org/10.1007/978-981-33-6108-9_6
Published: 21 February 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-33-6107-2
Online ISBN: 978-981-33-6108-9
eBook Packages: Physics and AstronomyPhysics and Astronomy (R0)

Publish with us

Policies and ethics

Unsupervised Deep Learning

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Column: Self-Learning Monte Carlo Method

Column: Self-Learning Monte Carlo Method

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation